From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1687E45B40;
	Tue, 15 Oct 2024 08:29:25 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 92FF64027C;
	Tue, 15 Oct 2024 08:29:24 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 46C6340270
 for <dev@dpdk.org>; Tue, 15 Oct 2024 08:29:23 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id D178612E4A
 for <dev@dpdk.org>; Tue, 15 Oct 2024 08:29:22 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id C55E012EA1; Tue, 15 Oct 2024 08:29:22 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.2
Received: from [192.168.1.85] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id E7C7712DCE;
 Tue, 15 Oct 2024 08:29:19 +0200 (CEST)
Message-ID: <07e55111-2ba0-4465-b866-8af8ad5d7cd1@lysator.liu.se>
Date: Tue, 15 Oct 2024 08:29:19 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation
 facility
To: =?UTF-8?Q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>,
 =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>,
 Jerin Jacob <jerinjacobk@gmail.com>, thomas@monjalon.net
Cc: dev@dpdk.org, Chengwen Feng <fengchengwen@huawei.com>,
 Stephen Hemminger <stephen@networkplumber.org>,
 Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
 David Marchand <david.marchand@redhat.com>,
 Anatoly Burakov <anatoly.burakov@intel.com>
References: <20240910070344.699183-2-mattias.ronnblom@ericsson.com>
 <20240911170430.701685-1-mattias.ronnblom@ericsson.com>
 <20240911170430.701685-2-mattias.ronnblom@ericsson.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F6D3@smartserver.smartshare.dk>
 <CALBAE1Mn3jtwiuitCv6MM=D27fV5RF2T0D264OBePRQjRnXX8g@mail.gmail.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F6DB@smartserver.smartshare.dk>
 <CALBAE1N7gnXvRkAuxx+vF3G9ji7RSsFwTzGidRhA3gBp+jp45A@mail.gmail.com>
 <CALBAE1OJ38C=SXibcCjKykm2qHR=EVEg6NkSRjSmTSnZd50M3w@mail.gmail.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F7D0@smartserver.smartshare.dk>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F7D0@smartserver.smartshare.dk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-10-14 09:56, Morten Brørup wrote:
>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>> Sent: Wednesday, 18 September 2024 12.12
>>
>> On Thu, Sep 12, 2024 at 8:52 PM Jerin Jacob <jerinjacobk@gmail.com>
>> wrote:
>>>
>>> On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup
>> <mb@smartsharesystems.com> wrote:
>>>>
>>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>>>>> Sent: Thursday, 12 September 2024 15.17
>>>>>
>>>>> On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup
>> <mb@smartsharesystems.com>
>>>>> wrote:
>>>>>>
>>>>>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR *
>> RTE_MAX_LCORE)
>>>>>>
>>>>>> Considering hugepages...
>>>>>>
>>>>>> Lcore variables may be allocated before DPDK's memory allocator
>>>>> (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore
>> variables.
>>>>>>
>>>>>> And lcore variables are not usable (shared) for DPDK multi-
>> process, so the
>>>>> lcore_buffer could be allocated through the O/S APIs as anonymous
>> hugepages,
>>>>> instead of using rte_malloc().
>>>>>>
>>>>>> The alternative, using rte_malloc(), would disallow allocating
>> lcore
>>>>> variables before DPDK's memory allocator has been initialized,
>> which I think
>>>>> is too late.
>>>>>
>>>>> I thought it is not. A lot of the subsystems are initialized
>> after the
>>>>> memory subsystem is initialized.
>>>>> [1] example given in documentation. I thought, RTE_INIT needs to
>>>>> replaced if the subsystem called after memory initialized (which
>> is
>>>>> the case for most of the libraries)
>>>>
>>>> The list of RTE_INIT functions are called before main(). It is not
>> very useful.
>>>>
>>>> Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by
>> something similar, which calls the list of "INIT" functions at the
>> appropriate time during EAL initialization.
>>>>
>>>> DPDK should then use this "INIT" list for all its initialization,
>> so the init function of new features (such as this, and trace) can be
>> inserted at the correct location in the list.
>>>>
>>>>> Trace library had a similar situation. It is managed like [2]
>>>>
>>>> Yes, if we insist on using rte_malloc() for lcore variables, the
>> alternative is to prohibit establishing lcore variables in functions
>> called through RTE_INIT.
>>>
>>> I was not insisting on using ONLY rte_malloc(). Since rte_malloc()
>> can
>>> be called before rte_eal_init)(it will return NULL). Alloc routine
>> can
>>> check first rte_malloc() is available if not switch over glibc.
>>
>>
>> @Mattias Rönnblom This comment is not addressed in v7. Could you check?
> 
> Mattias, following up on Jerin's suggestion:
> 
> When allocating an lcore variable, and the buffer holding lcore variables is out of space (or was never allocated), a new buffer is allocated.
> 
> Here's the twist I think Jerin is asking for:
> You could check if rte_malloc() is available, and use that (instead of the heap) when allocating a new buffer holding lcore variables.
> This check can be performed (aggressively) when allocating a new lcore variable, or (conservatively) only when allocating a new buffer.
> 
> 
> Now, if using hugepages, the value of RTE_MAX_LCORE_VAR (the maximum size of one lcore variable instance) becomes more important.
> 
> Let's consider systems with 2 MB hugepages:
> 
> If it supports two lcores (RTE_MAX_LCORE is 2), the current RTE_MAX_LCORE_VAR default of 1 MB is a perfect match; it will use 2 MB of RAM as one 2 MB hugepage.
> 
> If it supports 128 lcores, the current RTE_MAX_LCORE_VAR default of 1 MB will use 128 MB of RAM.
> 
> If we scale it back, so it only uses one 2 MB hugepage, RTE_MAX_LCORE_VAR will have to be 2 MB / 128 lcores = 16 KB.
> 16 KB might be too small. E.g. a mempool cache uses 2 * 512 * sizeof(void *) = 8 KB + a few bytes for the information about the cache. So I can easily point at one example where 16 KB is going very close to the edge.
> 
> So, as you already asked, what is a reasonable default minimum value of RTE_MAX_LCORE_VAR?
> 
> Maybe we should just stick with your initial suggestion (1 MB) and see how it goes.
> 

Sure. Let's stick with 1 MB.

I'm guessing that if/when someone takes a closer look how to do 
per-lcore *dynamic* allocations, this API and its implementation will be 
revisited as well.

> 
> <roadmap>
> At the recent DPDK Summit, we discussed memory consumption in one of the workshops.
> One of the possible means for reducing memory consumption is making RTE_MAX_LCORE dynamic, so an application using only a few cores will scale its per-lcore tables to the actual number of lcores, instead of scaling to some hardcoded maximum.
> 
> With this in mind, I'm less worried about the RTE_MAX_LCORE multiplier.
> </roadmap>
> 

A interesting hack would be disable huge page usage, set up a swap file 
in a zram device, and then MADV_PAGEOUT the DPDK process after startup.

I wonder how much smaller DPDK process RSS would be, when it had paged 
back in all the pages that were actually required.