From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4729A43913; Wed, 7 Feb 2024 09:23:09 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1D41640295; Wed, 7 Feb 2024 09:23:09 +0100 (CET) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id E6AAA40279 for ; Wed, 7 Feb 2024 09:23:06 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 8662416553 for ; Wed, 7 Feb 2024 09:23:06 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 7AAB71647C; Wed, 7 Feb 2024 09:23:06 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.4 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 9CCAF164D8; Wed, 7 Feb 2024 09:23:04 +0100 (CET) Message-ID: <26057e70-4b67-4684-a956-498351994583@lysator.liu.se> Date: Wed, 7 Feb 2024 09:23:03 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: rte_malloc() and alignment To: Stephen Hemminger Cc: "dev@dpdk.org" , =?UTF-8?Q?Mattias_R=C3=B6nnblom?= References: <20240206204622.6fc99cac@hermes.local> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <20240206204622.6fc99cac@hermes.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-02-07 05:46, Stephen Hemminger wrote: > On Tue, 6 Feb 2024 17:17:31 +0100 > Mattias Rönnblom wrote: > >> The rte_malloc() API documentation has the following to say about the >> align parameter: >> >> "If 0, the return is a pointer that is suitably aligned for any kind of >> variable (in the same manner as malloc()). Otherwise, the return is a >> pointer that is a multiple of align. In this case, it must be a power of >> two. (Minimum alignment is the cacheline size, i.e. 64-bytes)" >> >> After reading this, one might be left with the impression that the >> parenthesis refers to only the "otherwise" (non-zero-align) case, since >> surely, cache line alignment should be sufficient for any kind of >> variable and it semantics would be "in the same manner as malloc()". >> >> However, in the actual RTE malloc implementation, any align parameter >> value less than RTE_CACHE_LINE_SIZE results in an alignment of >> RTE_CACHE_LINE_SIZE, unless I'm missing something. >> >> Is there any conceivable scenario where passing a non-zero align >> parameter is useful? >> >> Would it be an improvement to rephrase the documentation to: >> >> "The alignment of the allocated memory meets all of the following criteria: >> 1) able to hold any built-in type. >> 2) be at least as large as the align parameter. >> 3) be at least as large as RTE_CACHE_LINE_SIZE. >> >> The align parameter must be a power-of-2 or 0. >> " >> >> ...so it actually describes what is implemented? And also adds the >> theoretical (?) case of a built-in type requiring > RTE_CACHE_LINE_SIZE >> amount of alignment. > > My reading is that align of 0 means that rte_malloc() should act > same as malloc(), and give alignment for largest type. > That would be mine as well, if my Bayesian prior hadn't been "doesn't DPDK cache-aligned *all* heap allocations?". > Walking through the code, the real work is in and at this point align > of 0 has been convert to 1. in malloc_heap_alloc_on_heap_id() > > /* > * Iterates through the freelist for a heap to find a free element with the > * biggest size and requested alignment. Will also set size to whatever element > * size that was found. > * Returns null on failure, or pointer to element on success. > */ > static struct malloc_elem * > find_biggest_element(struct malloc_heap *heap, size_t *size, > unsigned int flags, size_t align, bool contig) > > I continued to heap_alloc() (malloc_heap.c:239), and there one can find: align = RTE_CACHE_LINE_ROUNDUP(align); That's where I stopped, still knowing I was pretty clueless in regards to the full picture. There are two reasons to asked for aligned memory. One is to fit a certain primitive type into that region, knowing that certain CPUs may slow to or unable to do unaligned loads and stores (e.g., for MMX registers). Another is to avoid false sharing between the piece of memory you just allocated and adjacent memory blocks (which you know nothing about). I wonder if aren't best off keeping those two concerns separate, and maybe let the memory allocator deal with both. It seems that alignment-for-load/store you can just solve by having all allocations be naturally aligned up to a certain, ISA-specific, size (16 bytes on x86_64, I think). By naturally aligned I mean that a two-byte allocation would be aligned by 2, a four-byte by 4 etc. The false sharing issue is a more difficult one. In a world without next-line-prefetchers (or more elaborate variants thereof), you could just cache-align every distinct allocation (which I'm guessing is the rationale for malloc_heap:239). The situation we seem to be in today, not only the line the core loads/stores to is fetched, but also the next (few?) line(s) as well, no amount of struct alignment will fix the issue - you need guaranteed padding. You would also want a global knob to turn all that extra padding off, since disabling hardware prefetchers may well be possible (as well as impossible). A scenario you want to take into account is one where you have large amount of relatively rarely accessed data, where you don't need to worry about false sharing, and thus you don't want any alignment or padding beyond what the ISA requires. That would just make the whole thing grow, potentially with a lot. That leads me to something like void *rte_malloc_socket(size_t n, int socket, unsigned int flags); Where you get memory which is naturally aligned, and "false sharing-protected" by default. Such protection would entail having enough padding *between* blocks (both before, and after). With this API, libs/apps/PMDs should not use any __rte_cache_aligned or RTE_CACHE_GUARD type constructs (except for block-internal padding, which one might argue shouldn't be used). With a flag #define RTE_MALLOC_FLAG_HINT_RARELY_USED the application specifies that this data is rarely accessed, so false sharing is not a concern -> no padding is required. Or you turn it around, so "rarely used" is the default. You could also have a flag #define RTE_MALLOC_FLAG_NO_ALIGNMENT which would turn off natural alignment, potentially saving some space. Another related thing I think would be very useful is to have per-lcore heaps, or something to that effect. Then you could allocate memory that you know, with some certainty, only will be *frequently* accessed from a particular core. (Same MT safe alloc/free, still.) void *rte_malloc_lcore(size_t n, unsigned int lcore_id, unsigned int flags); In that case, there's no need to cache-align the data (rather the opposite, it will just make the effective working set grow and cache misses to increase). Just to be clear: per-lcore heaps isn't for making the allocations go faster (although that might happen to), it's to allow apps, libraries and PMDs to spatially organize data is such a manner items of data primarily accessed by one lcore is close to each other, rather than items allocated by a particular module/driver. Working with, not against, the CPU. If you combine such a feature with per-lcore static (initialization-time) allocations, I don't see why almost all __rte_cache_aligned in DPDK and DPDK-based apps wouldn't go away. How much padding is that? I wonder how much of the memory working set resident in the cache hierarchy in the typical DPDK app is padding. > Then the elements are examined with: > > size_t > malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align) > > But I don't see anywhere that 0 converts to being aligned on sizeof(double) > which is the largest type. > I also didn't find this. > Not sure who has expertise here? The allocator is a bit of problem child. > It is complex, slow and critical. Not a great combo. Is anyone planning to attempt to improve upon this situation?