rte_malloc() and alignment

DPDK patches and discussions
 help / color / mirror / Atom feed

* rte_malloc() and alignment
@ 2024-02-06 16:17 Mattias Rönnblom
  2024-02-07  4:46 ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Mattias Rönnblom @ 2024-02-06 16:17 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: Mattias Rönnblom

The rte_malloc() API documentation has the following to say about the 
align parameter:

"If 0, the return is a pointer that is suitably aligned for any kind of 
variable (in the same manner as malloc()). Otherwise, the return is a 
pointer that is a multiple of align. In this case, it must be a power of 
two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"

After reading this, one might be left with the impression that the 
parenthesis refers to only the "otherwise" (non-zero-align) case, since 
surely, cache line alignment should be sufficient for any kind of 
variable and it semantics would be "in the same manner as malloc()".

However, in the actual RTE malloc implementation, any align parameter 
value less than RTE_CACHE_LINE_SIZE results in an alignment of 
RTE_CACHE_LINE_SIZE, unless I'm missing something.

Is there any conceivable scenario where passing a non-zero align 
parameter is useful?

Would it be an improvement to rephrase the documentation to:

"The alignment of the allocated memory meets all of the following criteria:
1) able to hold any built-in type.
2) be at least as large as the align parameter.
3) be at least as large as RTE_CACHE_LINE_SIZE.

The align parameter must be a power-of-2 or 0.
"

...so it actually describes what is implemented? And also adds the 
theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE 
amount of alignment.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_malloc() and alignment
  2024-02-06 16:17 rte_malloc() and alignment Mattias Rönnblom
@ 2024-02-07  4:46 ` Stephen Hemminger
  2024-02-07  8:05   ` Dmitry Kozlyuk
  2024-02-07  8:23   ` Mattias Rönnblom
  0 siblings, 2 replies; 4+ messages in thread
From: Stephen Hemminger @ 2024-02-07  4:46 UTC (permalink / raw)
  To: Mattias Rönnblom; +Cc: dev, Mattias Rönnblom

On Tue, 6 Feb 2024 17:17:31 +0100
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> The rte_malloc() API documentation has the following to say about the 
> align parameter:
> 
> "If 0, the return is a pointer that is suitably aligned for any kind of 
> variable (in the same manner as malloc()). Otherwise, the return is a 
> pointer that is a multiple of align. In this case, it must be a power of 
> two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"
> 
> After reading this, one might be left with the impression that the 
> parenthesis refers to only the "otherwise" (non-zero-align) case, since 
> surely, cache line alignment should be sufficient for any kind of 
> variable and it semantics would be "in the same manner as malloc()".
> 
> However, in the actual RTE malloc implementation, any align parameter 
> value less than RTE_CACHE_LINE_SIZE results in an alignment of 
> RTE_CACHE_LINE_SIZE, unless I'm missing something.
> 
> Is there any conceivable scenario where passing a non-zero align 
> parameter is useful?
> 
> Would it be an improvement to rephrase the documentation to:
> 
> "The alignment of the allocated memory meets all of the following criteria:
> 1) able to hold any built-in type.
> 2) be at least as large as the align parameter.
> 3) be at least as large as RTE_CACHE_LINE_SIZE.
> 
> The align parameter must be a power-of-2 or 0.
> "
> 
> ...so it actually describes what is implemented? And also adds the 
> theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE 
> amount of alignment.

My reading is that align of 0 means that rte_malloc() should act
same as malloc(), and give alignment for largest type. 

Walking through the code, the real work is in and at this point align
of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id()

/*
 * Iterates through the freelist for a heap to find a free element with the
 * biggest size and requested alignment. Will also set size to whatever element
 * size that was found.
 * Returns null on failure, or pointer to element on success.
 */
static struct malloc_elem *
find_biggest_element(struct malloc_heap *heap, size_t *size,
		unsigned int flags, size_t align, bool contig)


Then the elements are examined with:

size_t
malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)

But I don't see anywhere that 0 converts to being aligned on sizeof(double)
which is the largest type.

Not sure who has expertise here? The allocator is a bit of problem child.
It is complex, slow and critical.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_malloc() and alignment
  2024-02-07  4:46 ` Stephen Hemminger
@ 2024-02-07  8:05   ` Dmitry Kozlyuk
  2024-02-07  8:23   ` Mattias Rönnblom
  1 sibling, 0 replies; 4+ messages in thread
From: Dmitry Kozlyuk @ 2024-02-07  8:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mattias Rönnblom, dev, Mattias Rönnblom, Anatoly Burakov

2024-02-06 20:46 (UTC-0800), Stephen Hemminger:
> On Tue, 6 Feb 2024 17:17:31 +0100
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
> > The rte_malloc() API documentation has the following to say about the 
> > align parameter:
> > 
> > "If 0, the return is a pointer that is suitably aligned for any kind of 
> > variable (in the same manner as malloc()). Otherwise, the return is a 
> > pointer that is a multiple of align. In this case, it must be a power of 
> > two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"
> > 
> > After reading this, one might be left with the impression that the 
> > parenthesis refers to only the "otherwise" (non-zero-align) case, since 
> > surely, cache line alignment should be sufficient for any kind of 
> > variable and it semantics would be "in the same manner as malloc()".
> > 
> > However, in the actual RTE malloc implementation, any align parameter 
> > value less than RTE_CACHE_LINE_SIZE results in an alignment of 
> > RTE_CACHE_LINE_SIZE, unless I'm missing something.
> > 
> > Is there any conceivable scenario where passing a non-zero align 
> > parameter is useful?
> > 
> > Would it be an improvement to rephrase the documentation to:
> > 
> > "The alignment of the allocated memory meets all of the following criteria:
> > 1) able to hold any built-in type.
> > 2) be at least as large as the align parameter.
> > 3) be at least as large as RTE_CACHE_LINE_SIZE.
> > 
> > The align parameter must be a power-of-2 or 0.
> > "
> > 
> > ...so it actually describes what is implemented? And also adds the 
> > theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE 
> > amount of alignment.  
> 
> My reading is that align of 0 means that rte_malloc() should act
> same as malloc(), and give alignment for largest type. 
> 
> Walking through the code, the real work is in and at this point align
> of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id()
> 
> /*
>  * Iterates through the freelist for a heap to find a free element with the
>  * biggest size and requested alignment. Will also set size to whatever element
>  * size that was found.
>  * Returns null on failure, or pointer to element on success.
>  */
> static struct malloc_elem *
> find_biggest_element(struct malloc_heap *heap, size_t *size,
> 		unsigned int flags, size_t align, bool contig)
> 
> 
> Then the elements are examined with:
> 
> size_t
> malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
> 
> But I don't see anywhere that 0 converts to being aligned on sizeof(double)
> which is the largest type.

One may also read "in the same manner as malloc()" as referring to "suitably
aligned", which means that the alignment is "as suitable as malloc()'s"
and it may also be larger. Then comes the assumption that no built-in type has
alignment larger than a cache line (can vectored types be the case?).

> Not sure who has expertise here?

Added Anatoly.

> The allocator is a bit of problem child.
> It is complex, slow and critical.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_malloc() and alignment
  2024-02-07  4:46 ` Stephen Hemminger
  2024-02-07  8:05   ` Dmitry Kozlyuk
@ 2024-02-07  8:23   ` Mattias Rönnblom
  1 sibling, 0 replies; 4+ messages in thread
From: Mattias Rönnblom @ 2024-02-07  8:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Mattias Rönnblom

On 2024-02-07 05:46, Stephen Hemminger wrote:
> On Tue, 6 Feb 2024 17:17:31 +0100
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> The rte_malloc() API documentation has the following to say about the
>> align parameter:
>>
>> "If 0, the return is a pointer that is suitably aligned for any kind of
>> variable (in the same manner as malloc()). Otherwise, the return is a
>> pointer that is a multiple of align. In this case, it must be a power of
>> two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"
>>
>> After reading this, one might be left with the impression that the
>> parenthesis refers to only the "otherwise" (non-zero-align) case, since
>> surely, cache line alignment should be sufficient for any kind of
>> variable and it semantics would be "in the same manner as malloc()".
>>
>> However, in the actual RTE malloc implementation, any align parameter
>> value less than RTE_CACHE_LINE_SIZE results in an alignment of
>> RTE_CACHE_LINE_SIZE, unless I'm missing something.
>>
>> Is there any conceivable scenario where passing a non-zero align
>> parameter is useful?
>>
>> Would it be an improvement to rephrase the documentation to:
>>
>> "The alignment of the allocated memory meets all of the following criteria:
>> 1) able to hold any built-in type.
>> 2) be at least as large as the align parameter.
>> 3) be at least as large as RTE_CACHE_LINE_SIZE.
>>
>> The align parameter must be a power-of-2 or 0.
>> "
>>
>> ...so it actually describes what is implemented? And also adds the
>> theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE
>> amount of alignment.
> 
> My reading is that align of 0 means that rte_malloc() should act
> same as malloc(), and give alignment for largest type.
> 

That would be mine as well, if my Bayesian prior hadn't been "doesn't 
DPDK cache-aligned *all* heap allocations?".

> Walking through the code, the real work is in and at this point align
> of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id()
> 
> /*
>   * Iterates through the freelist for a heap to find a free element with the
>   * biggest size and requested alignment. Will also set size to whatever element
>   * size that was found.
>   * Returns null on failure, or pointer to element on success.
>   */
> static struct malloc_elem *
> find_biggest_element(struct malloc_heap *heap, size_t *size,
> 		unsigned int flags, size_t align, bool contig)
> 
> 

I continued to heap_alloc() (malloc_heap.c:239), and there one can find:

align = RTE_CACHE_LINE_ROUNDUP(align);

That's where I stopped, still knowing I was pretty clueless in regards 
to the full picture.

There are two reasons to asked for aligned memory. One is to fit a 
certain primitive type into that region, knowing that certain CPUs may 
slow to or unable to do unaligned loads and stores (e.g., for MMX 
registers).

Another is to avoid false sharing between the piece of memory you just 
allocated and adjacent memory blocks (which you know nothing about).

I wonder if aren't best off keeping those two concerns separate, and 
maybe let the memory allocator deal with both.

It seems that alignment-for-load/store you can just solve by having all 
allocations be naturally aligned up to a certain, ISA-specific, size (16 
bytes on x86_64, I think). By naturally aligned I mean that a two-byte 
allocation would be aligned by 2, a four-byte by 4 etc.

The false sharing issue is a more difficult one. In a world without 
next-line-prefetchers (or more elaborate variants thereof), you could 
just cache-align every distinct allocation (which I'm guessing is the 
rationale for malloc_heap:239). The situation we seem to be in today, 
not only the line the core loads/stores to is fetched, but also the next 
(few?) line(s) as well, no amount of struct alignment will fix the issue 
- you need guaranteed padding. You would also want a global knob to turn 
all that extra padding off, since disabling hardware prefetchers may 
well be possible (as well as impossible).

A scenario you want to take into account is one where you have large 
amount of relatively rarely accessed data, where you don't need to worry 
about false sharing, and thus you don't want any alignment or padding 
beyond what the ISA requires. That would just make the whole thing grow, 
potentially with a lot.

That leads me to something like

void *rte_malloc_socket(size_t n, int socket, unsigned int flags);

Where you get memory which is naturally aligned, and "false 
sharing-protected" by default. Such protection would entail having 
enough padding *between* blocks (both before, and after). With this API, 
libs/apps/PMDs should not use any __rte_cache_aligned or RTE_CACHE_GUARD 
type constructs (except for block-internal padding, which one might 
argue shouldn't be used).

With a flag

#define RTE_MALLOC_FLAG_HINT_RARELY_USED

the application specifies that this data is rarely accessed, so false 
sharing is not a concern -> no padding is required. Or you turn it 
around, so "rarely used" is the default.

You could also have a flag
#define RTE_MALLOC_FLAG_NO_ALIGNMENT
which would turn off natural alignment, potentially saving some space.

Another related thing I think would be very useful is to have per-lcore 
heaps, or something to that effect. Then you could allocate memory that 
you know, with some certainty, only will be *frequently* accessed from a 
particular core. (Same MT safe alloc/free, still.)

void *rte_malloc_lcore(size_t n, unsigned int lcore_id, unsigned int flags);

In that case, there's no need to cache-align the data (rather the 
opposite, it will just make the effective working set grow and cache 
misses to increase).

Just to be clear: per-lcore heaps isn't for making the allocations go 
faster (although that might happen to), it's to allow apps, libraries 
and PMDs to spatially organize data is such a manner items of data 
primarily accessed by one lcore is close to each other, rather than 
items allocated by a particular module/driver. Working with, not 
against, the CPU.

If you combine such a feature with per-lcore static 
(initialization-time) allocations, I don't see why almost all 
__rte_cache_aligned in DPDK and DPDK-based apps wouldn't go away. How 
much padding is that? I wonder how much of the memory working set 
resident in the cache hierarchy in the typical DPDK app is padding.

> Then the elements are examined with:
> 
> size_t
> malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
> 
> But I don't see anywhere that 0 converts to being aligned on sizeof(double)
> which is the largest type.
> 

I also didn't find this.

> Not sure who has expertise here? The allocator is a bit of problem child.
> It is complex, slow and critical.

Not a great combo. Is anyone planning to attempt to improve upon this 
situation?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-07  8:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-06 16:17 rte_malloc() and alignment Mattias Rönnblom
2024-02-07  4:46 ` Stephen Hemminger
2024-02-07  8:05   ` Dmitry Kozlyuk
2024-02-07  8:23   ` Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).