[dpdk-dev] Aligned rte_mempool for storage applications

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] Aligned rte_mempool for storage applications
@ 2019-03-25 21:06 Howell, Seth
  2019-03-25 21:06 ` Howell, Seth
  2019-03-25 21:13 ` Harris, James R
  0 siblings, 2 replies; 14+ messages in thread
From: Howell, Seth @ 2019-03-25 21:06 UTC (permalink / raw)
  To: dev; +Cc: Harris, James R

Hello,

In SPDK, we use the rte_mempool struct for many internal structure collections. The per-thread cache and ease of allocation of mempools are very useful features.
Some of the collections we store in SPDK are pools of I/O buffers. Typically, these pools contain elements of at least 4096 bytes, and we would like them to be aligned to 4k for performance reasons.
Currently, the rte_mempool API doesn't support aligned mempool objects. This means that when we allocate a 4k buffer and want it aligned to 4k, we actually need to allocate an 8k buffer and calculate an offset into it each time we want to use it.
We recently did a proof of concept using the rte_mempool_ops hook where we allocated a mempool and populated it with aligned entries. This allowed us to retrieve aligned addresses directly from rte_mempool_get(), but didn't help with the allocation size.
Because the rte_mempool struct assumes that each element has a header attached to it, we still need to live up to that assumption for each object we create in a mempool. This means that the actual size of a buffer becomes 4k + 24 bytes. In order to get to our next aligned address, we need to add about 4k of padding to each element.
Modifying the current rte_mempool struct to allow entries without headers seems impossible since it would break rte_mempool_for_obj_iter and rte_mempool_from_obj. However I still think there is a lot of benefit to be gained from a mempool structure that supports aligned objects without headers.
I am wondering if DPDK would be open to us introducing an rte_mempool_aligned structure. This structure would essentially be a wrapper around a regular mempool struct. However, it would not require headers or trailers for each object in the pool.

This structure would only be applicable to a subset of mempools with the following characteristics:
	1. mempools for which the following flags were set: MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , MEMPOOL_F_NO_SPREAD
	2. mempools that do not require the use of the following functions rte_mempool_from_obj (requires a pointer to the mp in the header of each obj), rte_mempool_for_obj_iter.
	3. Any attempt to create this object when RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we can't check the header cookies.

My thought would be that we could implement this data structure in a header and it would look something like this:

Struct rte_mempool_aligned {
	Struct rte_mempool mp;
	Size_t obj_alignment;
};

The rest of the functions in the header would primarily be wrappers around the original functions. Most functions (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be implemented directly as wrappers, and others such as rte_mempool_create and the populate functions would have to be re-implemented to some degree in the new header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file. 

Would the community be welcoming of a new rte_mempool_aligned struct? If you don't feel like this would be the way to go, are there other options in DPDK for creating a pool of pre-allocated aligned objects? 

Thank you,

Seth Howell

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-25 21:06 [dpdk-dev] Aligned rte_mempool for storage applications Howell, Seth
@ 2019-03-25 21:06 ` Howell, Seth
  2019-03-25 21:13 ` Harris, James R
  1 sibling, 0 replies; 14+ messages in thread
From: Howell, Seth @ 2019-03-25 21:06 UTC (permalink / raw)
  To: dev; +Cc: Harris, James R

Hello,

In SPDK, we use the rte_mempool struct for many internal structure collections. The per-thread cache and ease of allocation of mempools are very useful features.
Some of the collections we store in SPDK are pools of I/O buffers. Typically, these pools contain elements of at least 4096 bytes, and we would like them to be aligned to 4k for performance reasons.
Currently, the rte_mempool API doesn't support aligned mempool objects. This means that when we allocate a 4k buffer and want it aligned to 4k, we actually need to allocate an 8k buffer and calculate an offset into it each time we want to use it.
We recently did a proof of concept using the rte_mempool_ops hook where we allocated a mempool and populated it with aligned entries. This allowed us to retrieve aligned addresses directly from rte_mempool_get(), but didn't help with the allocation size.
Because the rte_mempool struct assumes that each element has a header attached to it, we still need to live up to that assumption for each object we create in a mempool. This means that the actual size of a buffer becomes 4k + 24 bytes. In order to get to our next aligned address, we need to add about 4k of padding to each element.
Modifying the current rte_mempool struct to allow entries without headers seems impossible since it would break rte_mempool_for_obj_iter and rte_mempool_from_obj. However I still think there is a lot of benefit to be gained from a mempool structure that supports aligned objects without headers.
I am wondering if DPDK would be open to us introducing an rte_mempool_aligned structure. This structure would essentially be a wrapper around a regular mempool struct. However, it would not require headers or trailers for each object in the pool.

This structure would only be applicable to a subset of mempools with the following characteristics:
	1. mempools for which the following flags were set: MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , MEMPOOL_F_NO_SPREAD
	2. mempools that do not require the use of the following functions rte_mempool_from_obj (requires a pointer to the mp in the header of each obj), rte_mempool_for_obj_iter.
	3. Any attempt to create this object when RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we can't check the header cookies.

My thought would be that we could implement this data structure in a header and it would look something like this:

Struct rte_mempool_aligned {
	Struct rte_mempool mp;
	Size_t obj_alignment;
};

The rest of the functions in the header would primarily be wrappers around the original functions. Most functions (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be implemented directly as wrappers, and others such as rte_mempool_create and the populate functions would have to be re-implemented to some degree in the new header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file. 

Would the community be welcoming of a new rte_mempool_aligned struct? If you don't feel like this would be the way to go, are there other options in DPDK for creating a pool of pre-allocated aligned objects? 

Thank you,

Seth Howell

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-25 21:06 [dpdk-dev] Aligned rte_mempool for storage applications Howell, Seth
  2019-03-25 21:06 ` Howell, Seth
@ 2019-03-25 21:13 ` Harris, James R
  2019-03-25 21:13   ` Harris, James R
  2019-03-26  2:52   ` Varghese, Vipin
  1 sibling, 2 replies; 14+ messages in thread
From: Harris, James R @ 2019-03-25 21:13 UTC (permalink / raw)
  To: Howell, Seth, dev



On 3/25/19, 2:06 PM, "Howell, Seth" <seth.howell@intel.com> wrote:

    Hello,
    
    In SPDK, we use the rte_mempool struct for many internal structure collections. The per-thread cache and ease of allocation of mempools are very useful features.
    Some of the collections we store in SPDK are pools of I/O buffers. Typically, these pools contain elements of at least 4096 bytes, and we would like them to be aligned to 4k for performance reasons.

[Jim] Just to clarify Seth's point - the performance reasons are specifically to avoid wasteful memcopies.  The vast majority of NVMe SSDs in the market today do not have full scatter/gather support - rather they only support something called PRP (Physical Region Pages) which require all scatter gather elements except the first to be 4KB aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.

-Jim


    Currently, the rte_mempool API doesn't support aligned mempool objects. This means that when we allocate a 4k buffer and want it aligned to 4k, we actually need to allocate an 8k buffer and calculate an offset into it each time we want to use it.
    We recently did a proof of concept using the rte_mempool_ops hook where we allocated a mempool and populated it with aligned entries. This allowed us to retrieve aligned addresses directly from rte_mempool_get(), but didn't help with the allocation size.
    Because the rte_mempool struct assumes that each element has a header attached to it, we still need to live up to that assumption for each object we create in a mempool. This means that the actual size of a buffer becomes 4k + 24 bytes. In order to get to our next aligned address, we need to add about 4k of padding to each element.
    Modifying the current rte_mempool struct to allow entries without headers seems impossible since it would break rte_mempool_for_obj_iter and rte_mempool_from_obj. However I still think there is a lot of benefit to be gained from a mempool structure that supports aligned objects without headers.
    I am wondering if DPDK would be open to us introducing an rte_mempool_aligned structure. This structure would essentially be a wrapper around a regular mempool struct. However, it would not require headers or trailers for each object in the pool.
    
    This structure would only be applicable to a subset of mempools with the following characteristics:
    	1. mempools for which the following flags were set: MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , MEMPOOL_F_NO_SPREAD
    	2. mempools that do not require the use of the following functions rte_mempool_from_obj (requires a pointer to the mp in the header of each obj), rte_mempool_for_obj_iter.
    	3. Any attempt to create this object when RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we can't check the header cookies.
    
    My thought would be that we could implement this data structure in a header and it would look something like this:
    
    Struct rte_mempool_aligned {
    	Struct rte_mempool mp;
    	Size_t obj_alignment;
    };
    
    The rest of the functions in the header would primarily be wrappers around the original functions. Most functions (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be implemented directly as wrappers, and others such as rte_mempool_create and the populate functions would have to be re-implemented to some degree in the new header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file. 
    
    Would the community be welcoming of a new rte_mempool_aligned struct? If you don't feel like this would be the way to go, are there other options in DPDK for creating a pool of pre-allocated aligned objects? 
    
    Thank you,
    
    Seth Howell
    
    
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-25 21:13 ` Harris, James R
@ 2019-03-25 21:13   ` Harris, James R
  2019-03-26  2:52   ` Varghese, Vipin
  1 sibling, 0 replies; 14+ messages in thread
From: Harris, James R @ 2019-03-25 21:13 UTC (permalink / raw)
  To: Howell, Seth, dev



On 3/25/19, 2:06 PM, "Howell, Seth" <seth.howell@intel.com> wrote:

    Hello,
    
    In SPDK, we use the rte_mempool struct for many internal structure collections. The per-thread cache and ease of allocation of mempools are very useful features.
    Some of the collections we store in SPDK are pools of I/O buffers. Typically, these pools contain elements of at least 4096 bytes, and we would like them to be aligned to 4k for performance reasons.

[Jim] Just to clarify Seth's point - the performance reasons are specifically to avoid wasteful memcopies.  The vast majority of NVMe SSDs in the market today do not have full scatter/gather support - rather they only support something called PRP (Physical Region Pages) which require all scatter gather elements except the first to be 4KB aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.

-Jim


    Currently, the rte_mempool API doesn't support aligned mempool objects. This means that when we allocate a 4k buffer and want it aligned to 4k, we actually need to allocate an 8k buffer and calculate an offset into it each time we want to use it.
    We recently did a proof of concept using the rte_mempool_ops hook where we allocated a mempool and populated it with aligned entries. This allowed us to retrieve aligned addresses directly from rte_mempool_get(), but didn't help with the allocation size.
    Because the rte_mempool struct assumes that each element has a header attached to it, we still need to live up to that assumption for each object we create in a mempool. This means that the actual size of a buffer becomes 4k + 24 bytes. In order to get to our next aligned address, we need to add about 4k of padding to each element.
    Modifying the current rte_mempool struct to allow entries without headers seems impossible since it would break rte_mempool_for_obj_iter and rte_mempool_from_obj. However I still think there is a lot of benefit to be gained from a mempool structure that supports aligned objects without headers.
    I am wondering if DPDK would be open to us introducing an rte_mempool_aligned structure. This structure would essentially be a wrapper around a regular mempool struct. However, it would not require headers or trailers for each object in the pool.
    
    This structure would only be applicable to a subset of mempools with the following characteristics:
    	1. mempools for which the following flags were set: MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , MEMPOOL_F_NO_SPREAD
    	2. mempools that do not require the use of the following functions rte_mempool_from_obj (requires a pointer to the mp in the header of each obj), rte_mempool_for_obj_iter.
    	3. Any attempt to create this object when RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we can't check the header cookies.
    
    My thought would be that we could implement this data structure in a header and it would look something like this:
    
    Struct rte_mempool_aligned {
    	Struct rte_mempool mp;
    	Size_t obj_alignment;
    };
    
    The rest of the functions in the header would primarily be wrappers around the original functions. Most functions (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be implemented directly as wrappers, and others such as rte_mempool_create and the populate functions would have to be re-implemented to some degree in the new header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file. 
    
    Would the community be welcoming of a new rte_mempool_aligned struct? If you don't feel like this would be the way to go, are there other options in DPDK for creating a pool of pre-allocated aligned objects? 
    
    Thank you,
    
    Seth Howell
    
    
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-25 21:13 ` Harris, James R
  2019-03-25 21:13   ` Harris, James R
@ 2019-03-26  2:52   ` Varghese, Vipin
  2019-03-26  2:52     ` Varghese, Vipin
  2019-03-26 18:34     ` Howell, Seth
  1 sibling, 2 replies; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-26  2:52 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

Hi Seth,

If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.

Snipped
> 
>     In SPDK, we use the rte_mempool struct for many internal structure
> collections. The per-thread cache and ease of allocation of mempools are very
> useful features.
>     Some of the collections we store in SPDK are pools of I/O buffers. Typically,
> these pools contain elements of at least 4096 bytes, and we would like them to be
> aligned to 4k for performance reasons.
Query-1> is the total memory required to be 4096 only (data portion)?

> 
> [Jim] Just to clarify Seth's point - the performance reasons are specifically to avoid
> wasteful memcopies.  The vast majority of NVMe SSDs in the market today do not
> have full scatter/gather support - rather they only support something called PRP
> (Physical Region Pages) which require all scatter gather elements except the first
> to be 4KB aligned.  There are other storage interfaces such as Linux AIO that also
> impose alignment restrictions.
> 
> -Jim
> 
> 
>     Currently, the rte_mempool API doesn't support aligned mempool objects. This
> means that when we allocate a 4k buffer and want it aligned to 4k, we actually
> need to allocate an 8k buffer and calculate an offset into it each time we want to
> use it.
Query-2> why not create contiguous 4K aligned memory with rte_malloc?

>     We recently did a proof of concept using the rte_mempool_ops hook where
> we allocated a mempool and populated it with aligned entries. This allowed us to
> retrieve aligned addresses directly from rte_mempool_get(), but didn't help with
> the allocation size.
>     Because the rte_mempool struct assumes that each element has a header
> attached to it, we still need to live up to that assumption for each object we
> create in a mempool. This means that the actual size of a buffer becomes 4k + 24
> bytes. In order to get to our next aligned address, we need to add about 4k of
> padding to each element.
>     Modifying the current rte_mempool struct to allow entries without headers
> seems impossible since it would break rte_mempool_for_obj_iter and
> rte_mempool_from_obj. However I still think there is a lot of benefit to be gained
> from a mempool structure that supports aligned objects without headers.
>     I am wondering if DPDK would be open to us introducing an
> rte_mempool_aligned structure. This structure would essentially be a wrapper
> around a regular mempool struct. However, it would not require headers or
> trailers for each object in the pool.
Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 

Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 

Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.

> 
>     This structure would only be applicable to a subset of mempools with the
> following characteristics:
>     	1. mempools for which the following flags were set:
> MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
> MEMPOOL_F_NO_SPREAD
>     	2. mempools that do not require the use of the following functions
> rte_mempool_from_obj (requires a pointer to the mp in the header of each obj),
> rte_mempool_for_obj_iter.
>     	3. Any attempt to create this object when
> RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
> can't check the header cookies.
> 
>     My thought would be that we could implement this data structure in a header
> and it would look something like this:
> 
>     Struct rte_mempool_aligned {
>     	Struct rte_mempool mp;
>     	Size_t obj_alignment;
>     };
> 
>     The rest of the functions in the header would primarily be wrappers around the
> original functions. Most functions (rte_mempool_alloc, rte_mempool_free,
> rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be
> implemented directly as wrappers, and others such as rte_mempool_create and
> the populate functions would have to be re-implemented to some degree in the
> new header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
> 
>     Would the community be welcoming of a new rte_mempool_aligned struct? If
> you don't feel like this would be the way to go, are there other options in DPDK
> for creating a pool of pre-allocated aligned objects?
> 
>     Thank you,
> 
>     Seth Howell
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26  2:52   ` Varghese, Vipin
@ 2019-03-26  2:52     ` Varghese, Vipin
  2019-03-26 18:34     ` Howell, Seth
  1 sibling, 0 replies; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-26  2:52 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

Hi Seth,

If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.

Snipped
> 
>     In SPDK, we use the rte_mempool struct for many internal structure
> collections. The per-thread cache and ease of allocation of mempools are very
> useful features.
>     Some of the collections we store in SPDK are pools of I/O buffers. Typically,
> these pools contain elements of at least 4096 bytes, and we would like them to be
> aligned to 4k for performance reasons.
Query-1> is the total memory required to be 4096 only (data portion)?

> 
> [Jim] Just to clarify Seth's point - the performance reasons are specifically to avoid
> wasteful memcopies.  The vast majority of NVMe SSDs in the market today do not
> have full scatter/gather support - rather they only support something called PRP
> (Physical Region Pages) which require all scatter gather elements except the first
> to be 4KB aligned.  There are other storage interfaces such as Linux AIO that also
> impose alignment restrictions.
> 
> -Jim
> 
> 
>     Currently, the rte_mempool API doesn't support aligned mempool objects. This
> means that when we allocate a 4k buffer and want it aligned to 4k, we actually
> need to allocate an 8k buffer and calculate an offset into it each time we want to
> use it.
Query-2> why not create contiguous 4K aligned memory with rte_malloc?

>     We recently did a proof of concept using the rte_mempool_ops hook where
> we allocated a mempool and populated it with aligned entries. This allowed us to
> retrieve aligned addresses directly from rte_mempool_get(), but didn't help with
> the allocation size.
>     Because the rte_mempool struct assumes that each element has a header
> attached to it, we still need to live up to that assumption for each object we
> create in a mempool. This means that the actual size of a buffer becomes 4k + 24
> bytes. In order to get to our next aligned address, we need to add about 4k of
> padding to each element.
>     Modifying the current rte_mempool struct to allow entries without headers
> seems impossible since it would break rte_mempool_for_obj_iter and
> rte_mempool_from_obj. However I still think there is a lot of benefit to be gained
> from a mempool structure that supports aligned objects without headers.
>     I am wondering if DPDK would be open to us introducing an
> rte_mempool_aligned structure. This structure would essentially be a wrapper
> around a regular mempool struct. However, it would not require headers or
> trailers for each object in the pool.
Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 

Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 

Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.

> 
>     This structure would only be applicable to a subset of mempools with the
> following characteristics:
>     	1. mempools for which the following flags were set:
> MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
> MEMPOOL_F_NO_SPREAD
>     	2. mempools that do not require the use of the following functions
> rte_mempool_from_obj (requires a pointer to the mp in the header of each obj),
> rte_mempool_for_obj_iter.
>     	3. Any attempt to create this object when
> RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
> can't check the header cookies.
> 
>     My thought would be that we could implement this data structure in a header
> and it would look something like this:
> 
>     Struct rte_mempool_aligned {
>     	Struct rte_mempool mp;
>     	Size_t obj_alignment;
>     };
> 
>     The rest of the functions in the header would primarily be wrappers around the
> original functions. Most functions (rte_mempool_alloc, rte_mempool_free,
> rte_mempool_enqueue/dequeue, rte_mempool_get_count, etc.) could be
> implemented directly as wrappers, and others such as rte_mempool_create and
> the populate functions would have to be re-implemented to some degree in the
> new header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
> 
>     Would the community be welcoming of a new rte_mempool_aligned struct? If
> you don't feel like this would be the way to go, are there other options in DPDK
> for creating a pool of pre-allocated aligned objects?
> 
>     Thank you,
> 
>     Seth Howell
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26  2:52   ` Varghese, Vipin
  2019-03-26  2:52     ` Varghese, Vipin
@ 2019-03-26 18:34     ` Howell, Seth
  2019-03-26 18:34       ` Howell, Seth
  2019-03-26 18:59       ` Harris, James R
  1 sibling, 2 replies; 14+ messages in thread
From: Howell, Seth @ 2019-03-26 18:34 UTC (permalink / raw)
  To: Varghese, Vipin, Harris, James R, dev

Hi Vipin,

Thanks for your quick reply. I will respond to your queries in order.
1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our other buffers are much larger (>64KiB)
2. These buffers are used in the I/O path, so performance is very important. Allocating and freeing a buffer each time we use it could be pretty costly.
3. Could you describe the idea of an indirect buffer in more detail? I don't think I quite understand that concept. I know we couldn't use mbufs because we often have buffers that are larger than 64k. I think there are more reasons we don't use the mbuf structure in our use case, but am not familiar with all of them. Maybe Jim can explain those in more detail. 

Thanks,

Seth
-----Original Message-----
From: Varghese, Vipin 
Sent: Monday, March 25, 2019 7:53 PM
To: Harris, James R <james.r.harris@intel.com>; Howell, Seth <seth.howell@intel.com>; dev@dpdk.org
Subject: RE: Aligned rte_mempool for storage applications

Hi Seth,

If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.

Snipped
> 
>     In SPDK, we use the rte_mempool struct for many internal structure 
> collections. The per-thread cache and ease of allocation of mempools 
> are very useful features.
>     Some of the collections we store in SPDK are pools of I/O buffers. 
> Typically, these pools contain elements of at least 4096 bytes, and we 
> would like them to be aligned to 4k for performance reasons.
Query-1> is the total memory required to be 4096 only (data portion)?

> 
> [Jim] Just to clarify Seth's point - the performance reasons are 
> specifically to avoid wasteful memcopies.  The vast majority of NVMe 
> SSDs in the market today do not have full scatter/gather support - 
> rather they only support something called PRP (Physical Region Pages) 
> which require all scatter gather elements except the first to be 4KB 
> aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.
> 
> -Jim
> 
> 
>     Currently, the rte_mempool API doesn't support aligned mempool 
> objects. This means that when we allocate a 4k buffer and want it 
> aligned to 4k, we actually need to allocate an 8k buffer and calculate 
> an offset into it each time we want to use it.
Query-2> why not create contiguous 4K aligned memory with rte_malloc?

>     We recently did a proof of concept using the rte_mempool_ops hook 
> where we allocated a mempool and populated it with aligned entries. 
> This allowed us to retrieve aligned addresses directly from 
> rte_mempool_get(), but didn't help with the allocation size.
>     Because the rte_mempool struct assumes that each element has a 
> header attached to it, we still need to live up to that assumption for 
> each object we create in a mempool. This means that the actual size of 
> a buffer becomes 4k + 24 bytes. In order to get to our next aligned 
> address, we need to add about 4k of padding to each element.
>     Modifying the current rte_mempool struct to allow entries without 
> headers seems impossible since it would break rte_mempool_for_obj_iter 
> and rte_mempool_from_obj. However I still think there is a lot of 
> benefit to be gained from a mempool structure that supports aligned objects without headers.
>     I am wondering if DPDK would be open to us introducing an 
> rte_mempool_aligned structure. This structure would essentially be a 
> wrapper around a regular mempool struct. However, it would not require 
> headers or trailers for each object in the pool.
Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 

Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 

Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.

> 
>     This structure would only be applicable to a subset of mempools 
> with the following characteristics:
>     	1. mempools for which the following flags were set:
> MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , 
> MEMPOOL_F_NO_SPREAD
>     	2. mempools that do not require the use of the following 
> functions rte_mempool_from_obj (requires a pointer to the mp in the 
> header of each obj), rte_mempool_for_obj_iter.
>     	3. Any attempt to create this object when 
> RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we 
> can't check the header cookies.
> 
>     My thought would be that we could implement this data structure in 
> a header and it would look something like this:
> 
>     Struct rte_mempool_aligned {
>     	Struct rte_mempool mp;
>     	Size_t obj_alignment;
>     };
> 
>     The rest of the functions in the header would primarily be 
> wrappers around the original functions. Most functions 
> (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, 
> rte_mempool_get_count, etc.) could be implemented directly as 
> wrappers, and others such as rte_mempool_create and the populate 
> functions would have to be re-implemented to some degree in the new 
> header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file.
> 
>     Would the community be welcoming of a new rte_mempool_aligned 
> struct? If you don't feel like this would be the way to go, are there 
> other options in DPDK for creating a pool of pre-allocated aligned objects?
> 
>     Thank you,
> 
>     Seth Howell
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26 18:34     ` Howell, Seth
@ 2019-03-26 18:34       ` Howell, Seth
  2019-03-26 18:59       ` Harris, James R
  1 sibling, 0 replies; 14+ messages in thread
From: Howell, Seth @ 2019-03-26 18:34 UTC (permalink / raw)
  To: Varghese, Vipin, Harris, James R, dev

Hi Vipin,

Thanks for your quick reply. I will respond to your queries in order.
1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our other buffers are much larger (>64KiB)
2. These buffers are used in the I/O path, so performance is very important. Allocating and freeing a buffer each time we use it could be pretty costly.
3. Could you describe the idea of an indirect buffer in more detail? I don't think I quite understand that concept. I know we couldn't use mbufs because we often have buffers that are larger than 64k. I think there are more reasons we don't use the mbuf structure in our use case, but am not familiar with all of them. Maybe Jim can explain those in more detail. 

Thanks,

Seth
-----Original Message-----
From: Varghese, Vipin 
Sent: Monday, March 25, 2019 7:53 PM
To: Harris, James R <james.r.harris@intel.com>; Howell, Seth <seth.howell@intel.com>; dev@dpdk.org
Subject: RE: Aligned rte_mempool for storage applications

Hi Seth,

If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.

Snipped
> 
>     In SPDK, we use the rte_mempool struct for many internal structure 
> collections. The per-thread cache and ease of allocation of mempools 
> are very useful features.
>     Some of the collections we store in SPDK are pools of I/O buffers. 
> Typically, these pools contain elements of at least 4096 bytes, and we 
> would like them to be aligned to 4k for performance reasons.
Query-1> is the total memory required to be 4096 only (data portion)?

> 
> [Jim] Just to clarify Seth's point - the performance reasons are 
> specifically to avoid wasteful memcopies.  The vast majority of NVMe 
> SSDs in the market today do not have full scatter/gather support - 
> rather they only support something called PRP (Physical Region Pages) 
> which require all scatter gather elements except the first to be 4KB 
> aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.
> 
> -Jim
> 
> 
>     Currently, the rte_mempool API doesn't support aligned mempool 
> objects. This means that when we allocate a 4k buffer and want it 
> aligned to 4k, we actually need to allocate an 8k buffer and calculate 
> an offset into it each time we want to use it.
Query-2> why not create contiguous 4K aligned memory with rte_malloc?

>     We recently did a proof of concept using the rte_mempool_ops hook 
> where we allocated a mempool and populated it with aligned entries. 
> This allowed us to retrieve aligned addresses directly from 
> rte_mempool_get(), but didn't help with the allocation size.
>     Because the rte_mempool struct assumes that each element has a 
> header attached to it, we still need to live up to that assumption for 
> each object we create in a mempool. This means that the actual size of 
> a buffer becomes 4k + 24 bytes. In order to get to our next aligned 
> address, we need to add about 4k of padding to each element.
>     Modifying the current rte_mempool struct to allow entries without 
> headers seems impossible since it would break rte_mempool_for_obj_iter 
> and rte_mempool_from_obj. However I still think there is a lot of 
> benefit to be gained from a mempool structure that supports aligned objects without headers.
>     I am wondering if DPDK would be open to us introducing an 
> rte_mempool_aligned structure. This structure would essentially be a 
> wrapper around a regular mempool struct. However, it would not require 
> headers or trailers for each object in the pool.
Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 

Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 

Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.

> 
>     This structure would only be applicable to a subset of mempools 
> with the following characteristics:
>     	1. mempools for which the following flags were set:
> MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , 
> MEMPOOL_F_NO_SPREAD
>     	2. mempools that do not require the use of the following 
> functions rte_mempool_from_obj (requires a pointer to the mp in the 
> header of each obj), rte_mempool_for_obj_iter.
>     	3. Any attempt to create this object when 
> RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we 
> can't check the header cookies.
> 
>     My thought would be that we could implement this data structure in 
> a header and it would look something like this:
> 
>     Struct rte_mempool_aligned {
>     	Struct rte_mempool mp;
>     	Size_t obj_alignment;
>     };
> 
>     The rest of the functions in the header would primarily be 
> wrappers around the original functions. Most functions 
> (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, 
> rte_mempool_get_count, etc.) could be implemented directly as 
> wrappers, and others such as rte_mempool_create and the populate 
> functions would have to be re-implemented to some degree in the new 
> header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file.
> 
>     Would the community be welcoming of a new rte_mempool_aligned 
> struct? If you don't feel like this would be the way to go, are there 
> other options in DPDK for creating a pool of pre-allocated aligned objects?
> 
>     Thank you,
> 
>     Seth Howell
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26 18:34     ` Howell, Seth
  2019-03-26 18:34       ` Howell, Seth
@ 2019-03-26 18:59       ` Harris, James R
  2019-03-26 18:59         ` Harris, James R
                           ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Harris, James R @ 2019-03-26 18:59 UTC (permalink / raw)
  To: Howell, Seth, Varghese, Vipin, dev



On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:

    Hi Vipin,
    
    Thanks for your quick reply. I will respond to your queries in order.
    1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our other buffers are much larger (>64KiB)
    2. These buffers are used in the I/O path, so performance is very important. Allocating and freeing a buffer each time we use it could be pretty costly.

I think Vipin may have been suggesting allocating one (or multiple) very large buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that would still require SPDK to develop its own mempool-like feature to hold those buffers.  We'd really like to use the DPDK rte_mempool implementation rather than inventing our own.

    3. Could you describe the idea of an indirect buffer in more detail? I don't think I quite understand that concept. I know we couldn't use mbufs because we often have buffers that are larger than 64k. I think there are more reasons we don't use the mbuf structure in our use case, but am not familiar with all of them. Maybe Jim can explain those in more detail. 

SPDK doesn't use rte_mbufs (except when absolutely required for things like DPDK cryptodev/compressdev).  Most of that data structure is filled with network packet related fields that would never be used for storage.  We could create our own very small data structure and do something similar to Vipin's indirect mbuf suggestion.  And I think this is what Vipin was starting to allude to in query #2.

It would be less optimal than a native aligned mempool because we'd be adding an extra pointer dereference on every get from the mempool - but probably only slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect some benchmarking data to measure the performance impact of this extra dereference.

Thanks Vipin - this definitely gives us an alternative direction to investigate that we hadn't considered.

-Jim


    
    Thanks,
    
    Seth
    -----Original Message-----
    From: Varghese, Vipin 
    Sent: Monday, March 25, 2019 7:53 PM
    To: Harris, James R <james.r.harris@intel.com>; Howell, Seth <seth.howell@intel.com>; dev@dpdk.org
    Subject: RE: Aligned rte_mempool for storage applications
    
    Hi Seth,
    
    If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.
    
    Snipped
    > 
    >     In SPDK, we use the rte_mempool struct for many internal structure 
    > collections. The per-thread cache and ease of allocation of mempools 
    > are very useful features.
    >     Some of the collections we store in SPDK are pools of I/O buffers. 
    > Typically, these pools contain elements of at least 4096 bytes, and we 
    > would like them to be aligned to 4k for performance reasons.
    Query-1> is the total memory required to be 4096 only (data portion)?
    
    > 
    > [Jim] Just to clarify Seth's point - the performance reasons are 
    > specifically to avoid wasteful memcopies.  The vast majority of NVMe 
    > SSDs in the market today do not have full scatter/gather support - 
    > rather they only support something called PRP (Physical Region Pages) 
    > which require all scatter gather elements except the first to be 4KB 
    > aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.
    > 
    > -Jim
    > 
    > 
    >     Currently, the rte_mempool API doesn't support aligned mempool 
    > objects. This means that when we allocate a 4k buffer and want it 
    > aligned to 4k, we actually need to allocate an 8k buffer and calculate 
    > an offset into it each time we want to use it.
    Query-2> why not create contiguous 4K aligned memory with rte_malloc?
    
    >     We recently did a proof of concept using the rte_mempool_ops hook 
    > where we allocated a mempool and populated it with aligned entries. 
    > This allowed us to retrieve aligned addresses directly from 
    > rte_mempool_get(), but didn't help with the allocation size.
    >     Because the rte_mempool struct assumes that each element has a 
    > header attached to it, we still need to live up to that assumption for 
    > each object we create in a mempool. This means that the actual size of 
    > a buffer becomes 4k + 24 bytes. In order to get to our next aligned 
    > address, we need to add about 4k of padding to each element.
    >     Modifying the current rte_mempool struct to allow entries without 
    > headers seems impossible since it would break rte_mempool_for_obj_iter 
    > and rte_mempool_from_obj. However I still think there is a lot of 
    > benefit to be gained from a mempool structure that supports aligned objects without headers.
    >     I am wondering if DPDK would be open to us introducing an 
    > rte_mempool_aligned structure. This structure would essentially be a 
    > wrapper around a regular mempool struct. However, it would not require 
    > headers or trailers for each object in the pool.
    Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 
    
    Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 
    
    Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.
    
    > 
    >     This structure would only be applicable to a subset of mempools 
    > with the following characteristics:
    >     	1. mempools for which the following flags were set:
    > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , 
    > MEMPOOL_F_NO_SPREAD
    >     	2. mempools that do not require the use of the following 
    > functions rte_mempool_from_obj (requires a pointer to the mp in the 
    > header of each obj), rte_mempool_for_obj_iter.
    >     	3. Any attempt to create this object when 
    > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we 
    > can't check the header cookies.
    > 
    >     My thought would be that we could implement this data structure in 
    > a header and it would look something like this:
    > 
    >     Struct rte_mempool_aligned {
    >     	Struct rte_mempool mp;
    >     	Size_t obj_alignment;
    >     };
    > 
    >     The rest of the functions in the header would primarily be 
    > wrappers around the original functions. Most functions 
    > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, 
    > rte_mempool_get_count, etc.) could be implemented directly as 
    > wrappers, and others such as rte_mempool_create and the populate 
    > functions would have to be re-implemented to some degree in the new 
    > header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file.
    > 
    >     Would the community be welcoming of a new rte_mempool_aligned 
    > struct? If you don't feel like this would be the way to go, are there 
    > other options in DPDK for creating a pool of pre-allocated aligned objects?
    > 
    >     Thank you,
    > 
    >     Seth Howell
    > 
    > 
    > 
    
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26 18:59       ` Harris, James R
@ 2019-03-26 18:59         ` Harris, James R
  2019-03-27  2:33         ` Varghese, Vipin
  2019-03-27  8:28         ` Varghese, Vipin
  2 siblings, 0 replies; 14+ messages in thread
From: Harris, James R @ 2019-03-26 18:59 UTC (permalink / raw)
  To: Howell, Seth, Varghese, Vipin, dev



On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:

    Hi Vipin,
    
    Thanks for your quick reply. I will respond to your queries in order.
    1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our other buffers are much larger (>64KiB)
    2. These buffers are used in the I/O path, so performance is very important. Allocating and freeing a buffer each time we use it could be pretty costly.

I think Vipin may have been suggesting allocating one (or multiple) very large buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that would still require SPDK to develop its own mempool-like feature to hold those buffers.  We'd really like to use the DPDK rte_mempool implementation rather than inventing our own.

    3. Could you describe the idea of an indirect buffer in more detail? I don't think I quite understand that concept. I know we couldn't use mbufs because we often have buffers that are larger than 64k. I think there are more reasons we don't use the mbuf structure in our use case, but am not familiar with all of them. Maybe Jim can explain those in more detail. 

SPDK doesn't use rte_mbufs (except when absolutely required for things like DPDK cryptodev/compressdev).  Most of that data structure is filled with network packet related fields that would never be used for storage.  We could create our own very small data structure and do something similar to Vipin's indirect mbuf suggestion.  And I think this is what Vipin was starting to allude to in query #2.

It would be less optimal than a native aligned mempool because we'd be adding an extra pointer dereference on every get from the mempool - but probably only slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect some benchmarking data to measure the performance impact of this extra dereference.

Thanks Vipin - this definitely gives us an alternative direction to investigate that we hadn't considered.

-Jim


    
    Thanks,
    
    Seth
    -----Original Message-----
    From: Varghese, Vipin 
    Sent: Monday, March 25, 2019 7:53 PM
    To: Harris, James R <james.r.harris@intel.com>; Howell, Seth <seth.howell@intel.com>; dev@dpdk.org
    Subject: RE: Aligned rte_mempool for storage applications
    
    Hi Seth,
    
    If I may I would like to suggest and ask a query on the mempool alignment details. Please find my suggestion and query inline to the email.
    
    Snipped
    > 
    >     In SPDK, we use the rte_mempool struct for many internal structure 
    > collections. The per-thread cache and ease of allocation of mempools 
    > are very useful features.
    >     Some of the collections we store in SPDK are pools of I/O buffers. 
    > Typically, these pools contain elements of at least 4096 bytes, and we 
    > would like them to be aligned to 4k for performance reasons.
    Query-1> is the total memory required to be 4096 only (data portion)?
    
    > 
    > [Jim] Just to clarify Seth's point - the performance reasons are 
    > specifically to avoid wasteful memcopies.  The vast majority of NVMe 
    > SSDs in the market today do not have full scatter/gather support - 
    > rather they only support something called PRP (Physical Region Pages) 
    > which require all scatter gather elements except the first to be 4KB 
    > aligned.  There are other storage interfaces such as Linux AIO that also impose alignment restrictions.
    > 
    > -Jim
    > 
    > 
    >     Currently, the rte_mempool API doesn't support aligned mempool 
    > objects. This means that when we allocate a 4k buffer and want it 
    > aligned to 4k, we actually need to allocate an 8k buffer and calculate 
    > an offset into it each time we want to use it.
    Query-2> why not create contiguous 4K aligned memory with rte_malloc?
    
    >     We recently did a proof of concept using the rte_mempool_ops hook 
    > where we allocated a mempool and populated it with aligned entries. 
    > This allowed us to retrieve aligned addresses directly from 
    > rte_mempool_get(), but didn't help with the allocation size.
    >     Because the rte_mempool struct assumes that each element has a 
    > header attached to it, we still need to live up to that assumption for 
    > each object we create in a mempool. This means that the actual size of 
    > a buffer becomes 4k + 24 bytes. In order to get to our next aligned 
    > address, we need to add about 4k of padding to each element.
    >     Modifying the current rte_mempool struct to allow entries without 
    > headers seems impossible since it would break rte_mempool_for_obj_iter 
    > and rte_mempool_from_obj. However I still think there is a lot of 
    > benefit to be gained from a mempool structure that supports aligned objects without headers.
    >     I am wondering if DPDK would be open to us introducing an 
    > rte_mempool_aligned structure. This structure would essentially be a 
    > wrapper around a regular mempool struct. However, it would not require 
    > headers or trailers for each object in the pool.
    Query-3> using mempool with 0 size for data portion we can either create a indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc areas. 
    
    Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR summit 2019). 
    
    Advantage: no change in mempool library, mbuf library, or rte_malloc. Application works with zero change.
    
    > 
    >     This structure would only be applicable to a subset of mempools 
    > with the following characteristics:
    >     	1. mempools for which the following flags were set:
    > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , 
    > MEMPOOL_F_NO_SPREAD
    >     	2. mempools that do not require the use of the following 
    > functions rte_mempool_from_obj (requires a pointer to the mp in the 
    > header of each obj), rte_mempool_for_obj_iter.
    >     	3. Any attempt to create this object when 
    > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we 
    > can't check the header cookies.
    > 
    >     My thought would be that we could implement this data structure in 
    > a header and it would look something like this:
    > 
    >     Struct rte_mempool_aligned {
    >     	Struct rte_mempool mp;
    >     	Size_t obj_alignment;
    >     };
    > 
    >     The rest of the functions in the header would primarily be 
    > wrappers around the original functions. Most functions 
    > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, 
    > rte_mempool_get_count, etc.) could be implemented directly as 
    > wrappers, and others such as rte_mempool_create and the populate 
    > functions would have to be re-implemented to some degree in the new 
    > header. The remaining functions (check_cookies, for_obj_iter) would not be implemented in the rte_mempool_aligned.h file.
    > 
    >     Would the community be welcoming of a new rte_mempool_aligned 
    > struct? If you don't feel like this would be the way to go, are there 
    > other options in DPDK for creating a pool of pre-allocated aligned objects?
    > 
    >     Thank you,
    > 
    >     Seth Howell
    > 
    > 
    > 
    
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26 18:59       ` Harris, James R
  2019-03-26 18:59         ` Harris, James R
@ 2019-03-27  2:33         ` Varghese, Vipin
  2019-03-27  2:33           ` Varghese, Vipin
  2019-03-27  8:28         ` Varghese, Vipin
  2 siblings, 1 reply; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-27  2:33 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

Thanks Jim for the consideration. 

I humbly suggested the ideas, since we had a similar issue when creating AF_XDP_ZC PMD. Happy to share ideas.

Thanks
Vipin Varghese

> -----Original Message-----
> From: Harris, James R
> Sent: Wednesday, March 27, 2019 12:29 AM
> To: Howell, Seth <seth.howell@intel.com>; Varghese, Vipin
> <vipin.varghese@intel.com>; dev@dpdk.org
> Subject: Re: Aligned rte_mempool for storage applications
> 
> 
> 
> On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:
> 
>     Hi Vipin,
> 
>     Thanks for your quick reply. I will respond to your queries in order.
>     1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our
> other buffers are much larger (>64KiB)
>     2. These buffers are used in the I/O path, so performance is very important.
> Allocating and freeing a buffer each time we use it could be pretty costly.
> 
> I think Vipin may have been suggesting allocating one (or multiple) very large
> buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that
> would still require SPDK to develop its own mempool-like feature to hold those
> buffers.  We'd really like to use the DPDK rte_mempool implementation rather
> than inventing our own.
> 
>     3. Could you describe the idea of an indirect buffer in more detail? I don't think
> I quite understand that concept. I know we couldn't use mbufs because we often
> have buffers that are larger than 64k. I think there are more reasons we don't use
> the mbuf structure in our use case, but am not familiar with all of them. Maybe
> Jim can explain those in more detail.
> 
> SPDK doesn't use rte_mbufs (except when absolutely required for things like
> DPDK cryptodev/compressdev).  Most of that data structure is filled with network
> packet related fields that would never be used for storage.  We could create our
> own very small data structure and do something similar to Vipin's indirect mbuf
> suggestion.  And I think this is what Vipin was starting to allude to in query #2.
> 
> It would be less optimal than a native aligned mempool because we'd be adding
> an extra pointer dereference on every get from the mempool - but probably only
> slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect
> some benchmarking data to measure the performance impact of this extra
> dereference.
> 
> Thanks Vipin - this definitely gives us an alternative direction to investigate that
> we hadn't considered.
> 
> -Jim
> 
> 
> 
>     Thanks,
> 
>     Seth
>     -----Original Message-----
>     From: Varghese, Vipin
>     Sent: Monday, March 25, 2019 7:53 PM
>     To: Harris, James R <james.r.harris@intel.com>; Howell, Seth
> <seth.howell@intel.com>; dev@dpdk.org
>     Subject: RE: Aligned rte_mempool for storage applications
> 
>     Hi Seth,
> 
>     If I may I would like to suggest and ask a query on the mempool alignment
> details. Please find my suggestion and query inline to the email.
> 
>     Snipped
>     >
>     >     In SPDK, we use the rte_mempool struct for many internal structure
>     > collections. The per-thread cache and ease of allocation of mempools
>     > are very useful features.
>     >     Some of the collections we store in SPDK are pools of I/O buffers.
>     > Typically, these pools contain elements of at least 4096 bytes, and we
>     > would like them to be aligned to 4k for performance reasons.
>     Query-1> is the total memory required to be 4096 only (data portion)?
> 
>     >
>     > [Jim] Just to clarify Seth's point - the performance reasons are
>     > specifically to avoid wasteful memcopies.  The vast majority of NVMe
>     > SSDs in the market today do not have full scatter/gather support -
>     > rather they only support something called PRP (Physical Region Pages)
>     > which require all scatter gather elements except the first to be 4KB
>     > aligned.  There are other storage interfaces such as Linux AIO that also impose
> alignment restrictions.
>     >
>     > -Jim
>     >
>     >
>     >     Currently, the rte_mempool API doesn't support aligned mempool
>     > objects. This means that when we allocate a 4k buffer and want it
>     > aligned to 4k, we actually need to allocate an 8k buffer and calculate
>     > an offset into it each time we want to use it.
>     Query-2> why not create contiguous 4K aligned memory with rte_malloc?
> 
>     >     We recently did a proof of concept using the rte_mempool_ops hook
>     > where we allocated a mempool and populated it with aligned entries.
>     > This allowed us to retrieve aligned addresses directly from
>     > rte_mempool_get(), but didn't help with the allocation size.
>     >     Because the rte_mempool struct assumes that each element has a
>     > header attached to it, we still need to live up to that assumption for
>     > each object we create in a mempool. This means that the actual size of
>     > a buffer becomes 4k + 24 bytes. In order to get to our next aligned
>     > address, we need to add about 4k of padding to each element.
>     >     Modifying the current rte_mempool struct to allow entries without
>     > headers seems impossible since it would break rte_mempool_for_obj_iter
>     > and rte_mempool_from_obj. However I still think there is a lot of
>     > benefit to be gained from a mempool structure that supports aligned objects
> without headers.
>     >     I am wondering if DPDK would be open to us introducing an
>     > rte_mempool_aligned structure. This structure would essentially be a
>     > wrapper around a regular mempool struct. However, it would not require
>     > headers or trailers for each object in the pool.
>     Query-3> using mempool with 0 size for data portion we can either create a
> indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc
> areas.
> 
>     Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR
> summit 2019).
> 
>     Advantage: no change in mempool library, mbuf library, or rte_malloc.
> Application works with zero change.
> 
>     >
>     >     This structure would only be applicable to a subset of mempools
>     > with the following characteristics:
>     >     	1. mempools for which the following flags were set:
>     > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
>     > MEMPOOL_F_NO_SPREAD
>     >     	2. mempools that do not require the use of the following
>     > functions rte_mempool_from_obj (requires a pointer to the mp in the
>     > header of each obj), rte_mempool_for_obj_iter.
>     >     	3. Any attempt to create this object when
>     > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
>     > can't check the header cookies.
>     >
>     >     My thought would be that we could implement this data structure in
>     > a header and it would look something like this:
>     >
>     >     Struct rte_mempool_aligned {
>     >     	Struct rte_mempool mp;
>     >     	Size_t obj_alignment;
>     >     };
>     >
>     >     The rest of the functions in the header would primarily be
>     > wrappers around the original functions. Most functions
>     > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue,
>     > rte_mempool_get_count, etc.) could be implemented directly as
>     > wrappers, and others such as rte_mempool_create and the populate
>     > functions would have to be re-implemented to some degree in the new
>     > header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
>     >
>     >     Would the community be welcoming of a new rte_mempool_aligned
>     > struct? If you don't feel like this would be the way to go, are there
>     > other options in DPDK for creating a pool of pre-allocated aligned objects?
>     >
>     >     Thank you,
>     >
>     >     Seth Howell
>     >
>     >
>     >
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-27  2:33         ` Varghese, Vipin
@ 2019-03-27  2:33           ` Varghese, Vipin
  0 siblings, 0 replies; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-27  2:33 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

Thanks Jim for the consideration. 

I humbly suggested the ideas, since we had a similar issue when creating AF_XDP_ZC PMD. Happy to share ideas.

Thanks
Vipin Varghese

> -----Original Message-----
> From: Harris, James R
> Sent: Wednesday, March 27, 2019 12:29 AM
> To: Howell, Seth <seth.howell@intel.com>; Varghese, Vipin
> <vipin.varghese@intel.com>; dev@dpdk.org
> Subject: Re: Aligned rte_mempool for storage applications
> 
> 
> 
> On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:
> 
>     Hi Vipin,
> 
>     Thanks for your quick reply. I will respond to your queries in order.
>     1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our
> other buffers are much larger (>64KiB)
>     2. These buffers are used in the I/O path, so performance is very important.
> Allocating and freeing a buffer each time we use it could be pretty costly.
> 
> I think Vipin may have been suggesting allocating one (or multiple) very large
> buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that
> would still require SPDK to develop its own mempool-like feature to hold those
> buffers.  We'd really like to use the DPDK rte_mempool implementation rather
> than inventing our own.
> 
>     3. Could you describe the idea of an indirect buffer in more detail? I don't think
> I quite understand that concept. I know we couldn't use mbufs because we often
> have buffers that are larger than 64k. I think there are more reasons we don't use
> the mbuf structure in our use case, but am not familiar with all of them. Maybe
> Jim can explain those in more detail.
> 
> SPDK doesn't use rte_mbufs (except when absolutely required for things like
> DPDK cryptodev/compressdev).  Most of that data structure is filled with network
> packet related fields that would never be used for storage.  We could create our
> own very small data structure and do something similar to Vipin's indirect mbuf
> suggestion.  And I think this is what Vipin was starting to allude to in query #2.
> 
> It would be less optimal than a native aligned mempool because we'd be adding
> an extra pointer dereference on every get from the mempool - but probably only
> slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect
> some benchmarking data to measure the performance impact of this extra
> dereference.
> 
> Thanks Vipin - this definitely gives us an alternative direction to investigate that
> we hadn't considered.
> 
> -Jim
> 
> 
> 
>     Thanks,
> 
>     Seth
>     -----Original Message-----
>     From: Varghese, Vipin
>     Sent: Monday, March 25, 2019 7:53 PM
>     To: Harris, James R <james.r.harris@intel.com>; Howell, Seth
> <seth.howell@intel.com>; dev@dpdk.org
>     Subject: RE: Aligned rte_mempool for storage applications
> 
>     Hi Seth,
> 
>     If I may I would like to suggest and ask a query on the mempool alignment
> details. Please find my suggestion and query inline to the email.
> 
>     Snipped
>     >
>     >     In SPDK, we use the rte_mempool struct for many internal structure
>     > collections. The per-thread cache and ease of allocation of mempools
>     > are very useful features.
>     >     Some of the collections we store in SPDK are pools of I/O buffers.
>     > Typically, these pools contain elements of at least 4096 bytes, and we
>     > would like them to be aligned to 4k for performance reasons.
>     Query-1> is the total memory required to be 4096 only (data portion)?
> 
>     >
>     > [Jim] Just to clarify Seth's point - the performance reasons are
>     > specifically to avoid wasteful memcopies.  The vast majority of NVMe
>     > SSDs in the market today do not have full scatter/gather support -
>     > rather they only support something called PRP (Physical Region Pages)
>     > which require all scatter gather elements except the first to be 4KB
>     > aligned.  There are other storage interfaces such as Linux AIO that also impose
> alignment restrictions.
>     >
>     > -Jim
>     >
>     >
>     >     Currently, the rte_mempool API doesn't support aligned mempool
>     > objects. This means that when we allocate a 4k buffer and want it
>     > aligned to 4k, we actually need to allocate an 8k buffer and calculate
>     > an offset into it each time we want to use it.
>     Query-2> why not create contiguous 4K aligned memory with rte_malloc?
> 
>     >     We recently did a proof of concept using the rte_mempool_ops hook
>     > where we allocated a mempool and populated it with aligned entries.
>     > This allowed us to retrieve aligned addresses directly from
>     > rte_mempool_get(), but didn't help with the allocation size.
>     >     Because the rte_mempool struct assumes that each element has a
>     > header attached to it, we still need to live up to that assumption for
>     > each object we create in a mempool. This means that the actual size of
>     > a buffer becomes 4k + 24 bytes. In order to get to our next aligned
>     > address, we need to add about 4k of padding to each element.
>     >     Modifying the current rte_mempool struct to allow entries without
>     > headers seems impossible since it would break rte_mempool_for_obj_iter
>     > and rte_mempool_from_obj. However I still think there is a lot of
>     > benefit to be gained from a mempool structure that supports aligned objects
> without headers.
>     >     I am wondering if DPDK would be open to us introducing an
>     > rte_mempool_aligned structure. This structure would essentially be a
>     > wrapper around a regular mempool struct. However, it would not require
>     > headers or trailers for each object in the pool.
>     Query-3> using mempool with 0 size for data portion we can either create a
> indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc
> areas.
> 
>     Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR
> summit 2019).
> 
>     Advantage: no change in mempool library, mbuf library, or rte_malloc.
> Application works with zero change.
> 
>     >
>     >     This structure would only be applicable to a subset of mempools
>     > with the following characteristics:
>     >     	1. mempools for which the following flags were set:
>     > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
>     > MEMPOOL_F_NO_SPREAD
>     >     	2. mempools that do not require the use of the following
>     > functions rte_mempool_from_obj (requires a pointer to the mp in the
>     > header of each obj), rte_mempool_for_obj_iter.
>     >     	3. Any attempt to create this object when
>     > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
>     > can't check the header cookies.
>     >
>     >     My thought would be that we could implement this data structure in
>     > a header and it would look something like this:
>     >
>     >     Struct rte_mempool_aligned {
>     >     	Struct rte_mempool mp;
>     >     	Size_t obj_alignment;
>     >     };
>     >
>     >     The rest of the functions in the header would primarily be
>     > wrappers around the original functions. Most functions
>     > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue,
>     > rte_mempool_get_count, etc.) could be implemented directly as
>     > wrappers, and others such as rte_mempool_create and the populate
>     > functions would have to be re-implemented to some degree in the new
>     > header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
>     >
>     >     Would the community be welcoming of a new rte_mempool_aligned
>     > struct? If you don't feel like this would be the way to go, are there
>     > other options in DPDK for creating a pool of pre-allocated aligned objects?
>     >
>     >     Thank you,
>     >
>     >     Seth Howell
>     >
>     >
>     >
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-26 18:59       ` Harris, James R
  2019-03-26 18:59         ` Harris, James R
  2019-03-27  2:33         ` Varghese, Vipin
@ 2019-03-27  8:28         ` Varghese, Vipin
  2019-03-27  8:28           ` Varghese, Vipin
  2 siblings, 1 reply; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-27  8:28 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

[-- Attachment #1: Type: text/plain, Size: 7988 bytes --]

Sharing the possible design for 4K aligned address for objects.



> -----Original Message-----
> From: Harris, James R
> Sent: Wednesday, March 27, 2019 12:29 AM
> To: Howell, Seth <seth.howell@intel.com>; Varghese, Vipin
> <vipin.varghese@intel.com>; dev@dpdk.org
> Subject: Re: Aligned rte_mempool for storage applications
> 
> 
> 
> On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:
> 
>     Hi Vipin,
> 
>     Thanks for your quick reply. I will respond to your queries in order.
>     1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our
> other buffers are much larger (>64KiB)
>     2. These buffers are used in the I/O path, so performance is very important.
> Allocating and freeing a buffer each time we use it could be pretty costly.
> 
> I think Vipin may have been suggesting allocating one (or multiple) very large
> buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that
> would still require SPDK to develop its own mempool-like feature to hold those
> buffers.  We'd really like to use the DPDK rte_mempool implementation rather
> than inventing our own.
> 
>     3. Could you describe the idea of an indirect buffer in more detail? I don't think
> I quite understand that concept. I know we couldn't use mbufs because we often
> have buffers that are larger than 64k. I think there are more reasons we don't use
> the mbuf structure in our use case, but am not familiar with all of them. Maybe
> Jim can explain those in more detail.
> 
> SPDK doesn't use rte_mbufs (except when absolutely required for things like
> DPDK cryptodev/compressdev).  Most of that data structure is filled with network
> packet related fields that would never be used for storage.  We could create our
> own very small data structure and do something similar to Vipin's indirect mbuf
> suggestion.  And I think this is what Vipin was starting to allude to in query #2.
> 
> It would be less optimal than a native aligned mempool because we'd be adding
> an extra pointer dereference on every get from the mempool - but probably only
> slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect
> some benchmarking data to measure the performance impact of this extra
> dereference.
> 
> Thanks Vipin - this definitely gives us an alternative direction to investigate that
> we hadn't considered.
> 
> -Jim
> 
> 
> 
>     Thanks,
> 
>     Seth
>     -----Original Message-----
>     From: Varghese, Vipin
>     Sent: Monday, March 25, 2019 7:53 PM
>     To: Harris, James R <james.r.harris@intel.com>; Howell, Seth
> <seth.howell@intel.com>; dev@dpdk.org
>     Subject: RE: Aligned rte_mempool for storage applications
> 
>     Hi Seth,
> 
>     If I may I would like to suggest and ask a query on the mempool alignment
> details. Please find my suggestion and query inline to the email.
> 
>     Snipped
>     >
>     >     In SPDK, we use the rte_mempool struct for many internal structure
>     > collections. The per-thread cache and ease of allocation of mempools
>     > are very useful features.
>     >     Some of the collections we store in SPDK are pools of I/O buffers.
>     > Typically, these pools contain elements of at least 4096 bytes, and we
>     > would like them to be aligned to 4k for performance reasons.
>     Query-1> is the total memory required to be 4096 only (data portion)?
> 
>     >
>     > [Jim] Just to clarify Seth's point - the performance reasons are
>     > specifically to avoid wasteful memcopies.  The vast majority of NVMe
>     > SSDs in the market today do not have full scatter/gather support -
>     > rather they only support something called PRP (Physical Region Pages)
>     > which require all scatter gather elements except the first to be 4KB
>     > aligned.  There are other storage interfaces such as Linux AIO that also impose
> alignment restrictions.
>     >
>     > -Jim
>     >
>     >
>     >     Currently, the rte_mempool API doesn't support aligned mempool
>     > objects. This means that when we allocate a 4k buffer and want it
>     > aligned to 4k, we actually need to allocate an 8k buffer and calculate
>     > an offset into it each time we want to use it.
>     Query-2> why not create contiguous 4K aligned memory with rte_malloc?
> 
>     >     We recently did a proof of concept using the rte_mempool_ops hook
>     > where we allocated a mempool and populated it with aligned entries.
>     > This allowed us to retrieve aligned addresses directly from
>     > rte_mempool_get(), but didn't help with the allocation size.
>     >     Because the rte_mempool struct assumes that each element has a
>     > header attached to it, we still need to live up to that assumption for
>     > each object we create in a mempool. This means that the actual size of
>     > a buffer becomes 4k + 24 bytes. In order to get to our next aligned
>     > address, we need to add about 4k of padding to each element.
>     >     Modifying the current rte_mempool struct to allow entries without
>     > headers seems impossible since it would break rte_mempool_for_obj_iter
>     > and rte_mempool_from_obj. However I still think there is a lot of
>     > benefit to be gained from a mempool structure that supports aligned objects
> without headers.
>     >     I am wondering if DPDK would be open to us introducing an
>     > rte_mempool_aligned structure. This structure would essentially be a
>     > wrapper around a regular mempool struct. However, it would not require
>     > headers or trailers for each object in the pool.
>     Query-3> using mempool with 0 size for data portion we can either create a
> indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc
> areas.
> 
>     Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR
> summit 2019).
> 
>     Advantage: no change in mempool library, mbuf library, or rte_malloc.
> Application works with zero change.
> 
>     >
>     >     This structure would only be applicable to a subset of mempools
>     > with the following characteristics:
>     >     	1. mempools for which the following flags were set:
>     > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
>     > MEMPOOL_F_NO_SPREAD
>     >     	2. mempools that do not require the use of the following
>     > functions rte_mempool_from_obj (requires a pointer to the mp in the
>     > header of each obj), rte_mempool_for_obj_iter.
>     >     	3. Any attempt to create this object when
>     > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
>     > can't check the header cookies.
>     >
>     >     My thought would be that we could implement this data structure in
>     > a header and it would look something like this:
>     >
>     >     Struct rte_mempool_aligned {
>     >     	Struct rte_mempool mp;
>     >     	Size_t obj_alignment;
>     >     };
>     >
>     >     The rest of the functions in the header would primarily be
>     > wrappers around the original functions. Most functions
>     > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue,
>     > rte_mempool_get_count, etc.) could be implemented directly as
>     > wrappers, and others such as rte_mempool_create and the populate
>     > functions would have to be re-implemented to some degree in the new
>     > header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
>     >
>     >     Would the community be welcoming of a new rte_mempool_aligned
>     > struct? If you don't feel like this would be the way to go, are there
>     > other options in DPDK for creating a pool of pre-allocated aligned objects?
>     >
>     >     Thank you,
>     >
>     >     Seth Howell
>     >
>     >
>     >
> 
> 


[-- Attachment #2: spdk.PNG --]
[-- Type: image/png, Size: 50338 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-dev] Aligned rte_mempool for storage applications
  2019-03-27  8:28         ` Varghese, Vipin
@ 2019-03-27  8:28           ` Varghese, Vipin
  0 siblings, 0 replies; 14+ messages in thread
From: Varghese, Vipin @ 2019-03-27  8:28 UTC (permalink / raw)
  To: Harris, James R, Howell, Seth, dev

[-- Attachment #1: Type: text/plain, Size: 7988 bytes --]

Sharing the possible design for 4K aligned address for objects.



> -----Original Message-----
> From: Harris, James R
> Sent: Wednesday, March 27, 2019 12:29 AM
> To: Howell, Seth <seth.howell@intel.com>; Varghese, Vipin
> <vipin.varghese@intel.com>; dev@dpdk.org
> Subject: Re: Aligned rte_mempool for storage applications
> 
> 
> 
> On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:
> 
>     Hi Vipin,
> 
>     Thanks for your quick reply. I will respond to your queries in order.
>     1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our
> other buffers are much larger (>64KiB)
>     2. These buffers are used in the I/O path, so performance is very important.
> Allocating and freeing a buffer each time we use it could be pretty costly.
> 
> I think Vipin may have been suggesting allocating one (or multiple) very large
> buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that
> would still require SPDK to develop its own mempool-like feature to hold those
> buffers.  We'd really like to use the DPDK rte_mempool implementation rather
> than inventing our own.
> 
>     3. Could you describe the idea of an indirect buffer in more detail? I don't think
> I quite understand that concept. I know we couldn't use mbufs because we often
> have buffers that are larger than 64k. I think there are more reasons we don't use
> the mbuf structure in our use case, but am not familiar with all of them. Maybe
> Jim can explain those in more detail.
> 
> SPDK doesn't use rte_mbufs (except when absolutely required for things like
> DPDK cryptodev/compressdev).  Most of that data structure is filled with network
> packet related fields that would never be used for storage.  We could create our
> own very small data structure and do something similar to Vipin's indirect mbuf
> suggestion.  And I think this is what Vipin was starting to allude to in query #2.
> 
> It would be less optimal than a native aligned mempool because we'd be adding
> an extra pointer dereference on every get from the mempool - but probably only
> slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect
> some benchmarking data to measure the performance impact of this extra
> dereference.
> 
> Thanks Vipin - this definitely gives us an alternative direction to investigate that
> we hadn't considered.
> 
> -Jim
> 
> 
> 
>     Thanks,
> 
>     Seth
>     -----Original Message-----
>     From: Varghese, Vipin
>     Sent: Monday, March 25, 2019 7:53 PM
>     To: Harris, James R <james.r.harris@intel.com>; Howell, Seth
> <seth.howell@intel.com>; dev@dpdk.org
>     Subject: RE: Aligned rte_mempool for storage applications
> 
>     Hi Seth,
> 
>     If I may I would like to suggest and ask a query on the mempool alignment
> details. Please find my suggestion and query inline to the email.
> 
>     Snipped
>     >
>     >     In SPDK, we use the rte_mempool struct for many internal structure
>     > collections. The per-thread cache and ease of allocation of mempools
>     > are very useful features.
>     >     Some of the collections we store in SPDK are pools of I/O buffers.
>     > Typically, these pools contain elements of at least 4096 bytes, and we
>     > would like them to be aligned to 4k for performance reasons.
>     Query-1> is the total memory required to be 4096 only (data portion)?
> 
>     >
>     > [Jim] Just to clarify Seth's point - the performance reasons are
>     > specifically to avoid wasteful memcopies.  The vast majority of NVMe
>     > SSDs in the market today do not have full scatter/gather support -
>     > rather they only support something called PRP (Physical Region Pages)
>     > which require all scatter gather elements except the first to be 4KB
>     > aligned.  There are other storage interfaces such as Linux AIO that also impose
> alignment restrictions.
>     >
>     > -Jim
>     >
>     >
>     >     Currently, the rte_mempool API doesn't support aligned mempool
>     > objects. This means that when we allocate a 4k buffer and want it
>     > aligned to 4k, we actually need to allocate an 8k buffer and calculate
>     > an offset into it each time we want to use it.
>     Query-2> why not create contiguous 4K aligned memory with rte_malloc?
> 
>     >     We recently did a proof of concept using the rte_mempool_ops hook
>     > where we allocated a mempool and populated it with aligned entries.
>     > This allowed us to retrieve aligned addresses directly from
>     > rte_mempool_get(), but didn't help with the allocation size.
>     >     Because the rte_mempool struct assumes that each element has a
>     > header attached to it, we still need to live up to that assumption for
>     > each object we create in a mempool. This means that the actual size of
>     > a buffer becomes 4k + 24 bytes. In order to get to our next aligned
>     > address, we need to add about 4k of padding to each element.
>     >     Modifying the current rte_mempool struct to allow entries without
>     > headers seems impossible since it would break rte_mempool_for_obj_iter
>     > and rte_mempool_from_obj. However I still think there is a lot of
>     > benefit to be gained from a mempool structure that supports aligned objects
> without headers.
>     >     I am wondering if DPDK would be open to us introducing an
>     > rte_mempool_aligned structure. This structure would essentially be a
>     > wrapper around a regular mempool struct. However, it would not require
>     > headers or trailers for each object in the pool.
>     Query-3> using mempool with 0 size for data portion we can either create a
> indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc
> areas.
> 
>     Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR
> summit 2019).
> 
>     Advantage: no change in mempool library, mbuf library, or rte_malloc.
> Application works with zero change.
> 
>     >
>     >     This structure would only be applicable to a subset of mempools
>     > with the following characteristics:
>     >     	1. mempools for which the following flags were set:
>     > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
>     > MEMPOOL_F_NO_SPREAD
>     >     	2. mempools that do not require the use of the following
>     > functions rte_mempool_from_obj (requires a pointer to the mp in the
>     > header of each obj), rte_mempool_for_obj_iter.
>     >     	3. Any attempt to create this object when
>     > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
>     > can't check the header cookies.
>     >
>     >     My thought would be that we could implement this data structure in
>     > a header and it would look something like this:
>     >
>     >     Struct rte_mempool_aligned {
>     >     	Struct rte_mempool mp;
>     >     	Size_t obj_alignment;
>     >     };
>     >
>     >     The rest of the functions in the header would primarily be
>     > wrappers around the original functions. Most functions
>     > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue,
>     > rte_mempool_get_count, etc.) could be implemented directly as
>     > wrappers, and others such as rte_mempool_create and the populate
>     > functions would have to be re-implemented to some degree in the new
>     > header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
>     >
>     >     Would the community be welcoming of a new rte_mempool_aligned
>     > struct? If you don't feel like this would be the way to go, are there
>     > other options in DPDK for creating a pool of pre-allocated aligned objects?
>     >
>     >     Thank you,
>     >
>     >     Seth Howell
>     >
>     >
>     >
> 
> 


[-- Attachment #2: spdk.PNG --]
[-- Type: image/png, Size: 50338 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-03-27  8:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-25 21:06 [dpdk-dev] Aligned rte_mempool for storage applications Howell, Seth
2019-03-25 21:06 ` Howell, Seth
2019-03-25 21:13 ` Harris, James R
2019-03-25 21:13   ` Harris, James R
2019-03-26  2:52   ` Varghese, Vipin
2019-03-26  2:52     ` Varghese, Vipin
2019-03-26 18:34     ` Howell, Seth
2019-03-26 18:34       ` Howell, Seth
2019-03-26 18:59       ` Harris, James R
2019-03-26 18:59         ` Harris, James R
2019-03-27  2:33         ` Varghese, Vipin
2019-03-27  2:33           ` Varghese, Vipin
2019-03-27  8:28         ` Varghese, Vipin
2019-03-27  8:28           ` Varghese, Vipin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).