From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
Subject: Re: rte_malloc() and alignment
Date: Wed, 7 Feb 2024 09:23:03 +0100 [thread overview]
Message-ID: <26057e70-4b67-4684-a956-498351994583@lysator.liu.se> (raw)
In-Reply-To: <20240206204622.6fc99cac@hermes.local>
On 2024-02-07 05:46, Stephen Hemminger wrote:
> On Tue, 6 Feb 2024 17:17:31 +0100
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
>> The rte_malloc() API documentation has the following to say about the
>> align parameter:
>>
>> "If 0, the return is a pointer that is suitably aligned for any kind of
>> variable (in the same manner as malloc()). Otherwise, the return is a
>> pointer that is a multiple of align. In this case, it must be a power of
>> two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"
>>
>> After reading this, one might be left with the impression that the
>> parenthesis refers to only the "otherwise" (non-zero-align) case, since
>> surely, cache line alignment should be sufficient for any kind of
>> variable and it semantics would be "in the same manner as malloc()".
>>
>> However, in the actual RTE malloc implementation, any align parameter
>> value less than RTE_CACHE_LINE_SIZE results in an alignment of
>> RTE_CACHE_LINE_SIZE, unless I'm missing something.
>>
>> Is there any conceivable scenario where passing a non-zero align
>> parameter is useful?
>>
>> Would it be an improvement to rephrase the documentation to:
>>
>> "The alignment of the allocated memory meets all of the following criteria:
>> 1) able to hold any built-in type.
>> 2) be at least as large as the align parameter.
>> 3) be at least as large as RTE_CACHE_LINE_SIZE.
>>
>> The align parameter must be a power-of-2 or 0.
>> "
>>
>> ...so it actually describes what is implemented? And also adds the
>> theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE
>> amount of alignment.
>
> My reading is that align of 0 means that rte_malloc() should act
> same as malloc(), and give alignment for largest type.
>
That would be mine as well, if my Bayesian prior hadn't been "doesn't
DPDK cache-aligned *all* heap allocations?".
> Walking through the code, the real work is in and at this point align
> of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id()
>
> /*
> * Iterates through the freelist for a heap to find a free element with the
> * biggest size and requested alignment. Will also set size to whatever element
> * size that was found.
> * Returns null on failure, or pointer to element on success.
> */
> static struct malloc_elem *
> find_biggest_element(struct malloc_heap *heap, size_t *size,
> unsigned int flags, size_t align, bool contig)
>
>
I continued to heap_alloc() (malloc_heap.c:239), and there one can find:
align = RTE_CACHE_LINE_ROUNDUP(align);
That's where I stopped, still knowing I was pretty clueless in regards
to the full picture.
There are two reasons to asked for aligned memory. One is to fit a
certain primitive type into that region, knowing that certain CPUs may
slow to or unable to do unaligned loads and stores (e.g., for MMX
registers).
Another is to avoid false sharing between the piece of memory you just
allocated and adjacent memory blocks (which you know nothing about).
I wonder if aren't best off keeping those two concerns separate, and
maybe let the memory allocator deal with both.
It seems that alignment-for-load/store you can just solve by having all
allocations be naturally aligned up to a certain, ISA-specific, size (16
bytes on x86_64, I think). By naturally aligned I mean that a two-byte
allocation would be aligned by 2, a four-byte by 4 etc.
The false sharing issue is a more difficult one. In a world without
next-line-prefetchers (or more elaborate variants thereof), you could
just cache-align every distinct allocation (which I'm guessing is the
rationale for malloc_heap:239). The situation we seem to be in today,
not only the line the core loads/stores to is fetched, but also the next
(few?) line(s) as well, no amount of struct alignment will fix the issue
- you need guaranteed padding. You would also want a global knob to turn
all that extra padding off, since disabling hardware prefetchers may
well be possible (as well as impossible).
A scenario you want to take into account is one where you have large
amount of relatively rarely accessed data, where you don't need to worry
about false sharing, and thus you don't want any alignment or padding
beyond what the ISA requires. That would just make the whole thing grow,
potentially with a lot.
That leads me to something like
void *rte_malloc_socket(size_t n, int socket, unsigned int flags);
Where you get memory which is naturally aligned, and "false
sharing-protected" by default. Such protection would entail having
enough padding *between* blocks (both before, and after). With this API,
libs/apps/PMDs should not use any __rte_cache_aligned or RTE_CACHE_GUARD
type constructs (except for block-internal padding, which one might
argue shouldn't be used).
With a flag
#define RTE_MALLOC_FLAG_HINT_RARELY_USED
the application specifies that this data is rarely accessed, so false
sharing is not a concern -> no padding is required. Or you turn it
around, so "rarely used" is the default.
You could also have a flag
#define RTE_MALLOC_FLAG_NO_ALIGNMENT
which would turn off natural alignment, potentially saving some space.
Another related thing I think would be very useful is to have per-lcore
heaps, or something to that effect. Then you could allocate memory that
you know, with some certainty, only will be *frequently* accessed from a
particular core. (Same MT safe alloc/free, still.)
void *rte_malloc_lcore(size_t n, unsigned int lcore_id, unsigned int flags);
In that case, there's no need to cache-align the data (rather the
opposite, it will just make the effective working set grow and cache
misses to increase).
Just to be clear: per-lcore heaps isn't for making the allocations go
faster (although that might happen to), it's to allow apps, libraries
and PMDs to spatially organize data is such a manner items of data
primarily accessed by one lcore is close to each other, rather than
items allocated by a particular module/driver. Working with, not
against, the CPU.
If you combine such a feature with per-lcore static
(initialization-time) allocations, I don't see why almost all
__rte_cache_aligned in DPDK and DPDK-based apps wouldn't go away. How
much padding is that? I wonder how much of the memory working set
resident in the cache hierarchy in the typical DPDK app is padding.
> Then the elements are examined with:
>
> size_t
> malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
>
> But I don't see anywhere that 0 converts to being aligned on sizeof(double)
> which is the largest type.
>
I also didn't find this.
> Not sure who has expertise here? The allocator is a bit of problem child.
> It is complex, slow and critical.
Not a great combo. Is anyone planning to attempt to improve upon this
situation?
prev parent reply other threads:[~2024-02-07 8:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-06 16:17 Mattias Rönnblom
2024-02-07 4:46 ` Stephen Hemminger
2024-02-07 8:05 ` Dmitry Kozlyuk
2024-02-07 8:23 ` Mattias Rönnblom [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=26057e70-4b67-4684-a956-498351994583@lysator.liu.se \
--to=hofors@lysator.liu.se \
--cc=dev@dpdk.org \
--cc=mattias.ronnblom@ericsson.com \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).