DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
	"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
Subject: Re: rte_malloc() and alignment
Date: Wed, 7 Feb 2024 09:23:03 +0100	[thread overview]
Message-ID: <26057e70-4b67-4684-a956-498351994583@lysator.liu.se> (raw)
In-Reply-To: <20240206204622.6fc99cac@hermes.local>

On 2024-02-07 05:46, Stephen Hemminger wrote:
> On Tue, 6 Feb 2024 17:17:31 +0100
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> The rte_malloc() API documentation has the following to say about the
>> align parameter:
>>
>> "If 0, the return is a pointer that is suitably aligned for any kind of
>> variable (in the same manner as malloc()). Otherwise, the return is a
>> pointer that is a multiple of align. In this case, it must be a power of
>> two. (Minimum alignment is the cacheline size, i.e. 64-bytes)"
>>
>> After reading this, one might be left with the impression that the
>> parenthesis refers to only the "otherwise" (non-zero-align) case, since
>> surely, cache line alignment should be sufficient for any kind of
>> variable and it semantics would be "in the same manner as malloc()".
>>
>> However, in the actual RTE malloc implementation, any align parameter
>> value less than RTE_CACHE_LINE_SIZE results in an alignment of
>> RTE_CACHE_LINE_SIZE, unless I'm missing something.
>>
>> Is there any conceivable scenario where passing a non-zero align
>> parameter is useful?
>>
>> Would it be an improvement to rephrase the documentation to:
>>
>> "The alignment of the allocated memory meets all of the following criteria:
>> 1) able to hold any built-in type.
>> 2) be at least as large as the align parameter.
>> 3) be at least as large as RTE_CACHE_LINE_SIZE.
>>
>> The align parameter must be a power-of-2 or 0.
>> "
>>
>> ...so it actually describes what is implemented? And also adds the
>> theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE
>> amount of alignment.
> 
> My reading is that align of 0 means that rte_malloc() should act
> same as malloc(), and give alignment for largest type.
> 

That would be mine as well, if my Bayesian prior hadn't been "doesn't 
DPDK cache-aligned *all* heap allocations?".

> Walking through the code, the real work is in and at this point align
> of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id()
> 
> /*
>   * Iterates through the freelist for a heap to find a free element with the
>   * biggest size and requested alignment. Will also set size to whatever element
>   * size that was found.
>   * Returns null on failure, or pointer to element on success.
>   */
> static struct malloc_elem *
> find_biggest_element(struct malloc_heap *heap, size_t *size,
> 		unsigned int flags, size_t align, bool contig)
> 
> 

I continued to heap_alloc() (malloc_heap.c:239), and there one can find:

align = RTE_CACHE_LINE_ROUNDUP(align);

That's where I stopped, still knowing I was pretty clueless in regards 
to the full picture.

There are two reasons to asked for aligned memory. One is to fit a 
certain primitive type into that region, knowing that certain CPUs may 
slow to or unable to do unaligned loads and stores (e.g., for MMX 
registers).

Another is to avoid false sharing between the piece of memory you just 
allocated and adjacent memory blocks (which you know nothing about).

I wonder if aren't best off keeping those two concerns separate, and 
maybe let the memory allocator deal with both.

It seems that alignment-for-load/store you can just solve by having all 
allocations be naturally aligned up to a certain, ISA-specific, size (16 
bytes on x86_64, I think). By naturally aligned I mean that a two-byte 
allocation would be aligned by 2, a four-byte by 4 etc.

The false sharing issue is a more difficult one. In a world without 
next-line-prefetchers (or more elaborate variants thereof), you could 
just cache-align every distinct allocation (which I'm guessing is the 
rationale for malloc_heap:239). The situation we seem to be in today, 
not only the line the core loads/stores to is fetched, but also the next 
(few?) line(s) as well, no amount of struct alignment will fix the issue 
- you need guaranteed padding. You would also want a global knob to turn 
all that extra padding off, since disabling hardware prefetchers may 
well be possible (as well as impossible).

A scenario you want to take into account is one where you have large 
amount of relatively rarely accessed data, where you don't need to worry 
about false sharing, and thus you don't want any alignment or padding 
beyond what the ISA requires. That would just make the whole thing grow, 
potentially with a lot.

That leads me to something like

void *rte_malloc_socket(size_t n, int socket, unsigned int flags);

Where you get memory which is naturally aligned, and "false 
sharing-protected" by default. Such protection would entail having 
enough padding *between* blocks (both before, and after). With this API, 
libs/apps/PMDs should not use any __rte_cache_aligned or RTE_CACHE_GUARD 
type constructs (except for block-internal padding, which one might 
argue shouldn't be used).

With a flag

#define RTE_MALLOC_FLAG_HINT_RARELY_USED

the application specifies that this data is rarely accessed, so false 
sharing is not a concern -> no padding is required. Or you turn it 
around, so "rarely used" is the default.

You could also have a flag
#define RTE_MALLOC_FLAG_NO_ALIGNMENT
which would turn off natural alignment, potentially saving some space.

Another related thing I think would be very useful is to have per-lcore 
heaps, or something to that effect. Then you could allocate memory that 
you know, with some certainty, only will be *frequently* accessed from a 
particular core. (Same MT safe alloc/free, still.)

void *rte_malloc_lcore(size_t n, unsigned int lcore_id, unsigned int flags);

In that case, there's no need to cache-align the data (rather the 
opposite, it will just make the effective working set grow and cache 
misses to increase).

Just to be clear: per-lcore heaps isn't for making the allocations go 
faster (although that might happen to), it's to allow apps, libraries 
and PMDs to spatially organize data is such a manner items of data 
primarily accessed by one lcore is close to each other, rather than 
items allocated by a particular module/driver. Working with, not 
against, the CPU.

If you combine such a feature with per-lcore static 
(initialization-time) allocations, I don't see why almost all 
__rte_cache_aligned in DPDK and DPDK-based apps wouldn't go away. How 
much padding is that? I wonder how much of the memory working set 
resident in the cache hierarchy in the typical DPDK app is padding.

> Then the elements are examined with:
> 
> size_t
> malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
> 
> But I don't see anywhere that 0 converts to being aligned on sizeof(double)
> which is the largest type.
> 

I also didn't find this.

> Not sure who has expertise here? The allocator is a bit of problem child.
> It is complex, slow and critical.

Not a great combo. Is anyone planning to attempt to improve upon this 
situation?

      parent reply	other threads:[~2024-02-07  8:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-06 16:17 Mattias Rönnblom
2024-02-07  4:46 ` Stephen Hemminger
2024-02-07  8:05   ` Dmitry Kozlyuk
2024-02-07  8:23   ` Mattias Rönnblom [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=26057e70-4b67-4684-a956-498351994583@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=dev@dpdk.org \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).