From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f54.google.com (mail-pg0-f54.google.com [74.125.83.54]) by dpdk.org (Postfix) with ESMTP id 1A80F7CFF for ; Wed, 25 Apr 2018 18:12:38 +0200 (CEST) Received: by mail-pg0-f54.google.com with SMTP id i194so13668756pgd.0 for ; Wed, 25 Apr 2018 09:12:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=V7w7C+B+qWw54r2bZGBXXwoS2WzjW0saso3rkRWx2Ds=; b=ridIywpXG70noMFVfs9SjxDar8WnC4pomJmg+ldeFM6n9fyALmI+jp0T3yTOLd7dpy w61GaK2YMWAvKmd9HD0rl0muDkyjGJaNaJCFRgnmlGrk9IqZeyl9k649NivQpyHdtxqU rDEiKPeFb+C/IeO+gTZdYmGEhq4MIcvSPktoSpD04vgXxA7fiuxDUt9h/MsL6C1V+tLV SMYu52wiXSTe/VEiHD5blRaigWXcc0f2DCV5yh1Vl96Eq/cLUO5LOpwOEVt5vGWjv+Dc V0yBMFng3bfUBoY9Jxh2xOmeLpMohL/EFhDTnr9Gg3K/Mz5qkVqGN+yxaatTvjLfNqm/ xj1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=V7w7C+B+qWw54r2bZGBXXwoS2WzjW0saso3rkRWx2Ds=; b=qunxep+aI+VM1g6xTpgu8JjvjXggUJPWWb9chHzou21SywP3zQP/H3Xzn211KCf32O JsQFT75UD7IveiNpNELsgGPoxdtzPW0WCE0Cy/OpCCkaY9TnmP73S2Sbm3Pli1cFkrNf BRO9zLJ1S0/gNXK5VMcBt8FKX7Lnf0K0BsA4klhYOfQa0OIPXtf7ke+uOM2eMIFjd9Ee tQoDjGWYlx5t0VH4/Tkm6wB46gRYw2FYNbLjoqMn+4WbPxOz1vU96rMg+EsqoYnX8GVo wBp8vw6Hg/z8Fanzig6AWffo/pNzrdklIvAyyimt6fu1+E1csPia7zenrOeT5RQ8oEsz +ICw== X-Gm-Message-State: ALQs6tBLIPyhIipsS4rfpgC8FDeiYkVWDF0DGk/EVZD9wWt1xUtjJm1Y Arrpb5UXcvD1GgahmpTyqO8DGw== X-Google-Smtp-Source: AIpwx48jDN4dCAXc1b0oy5lXTZnaspR+/R79uR1X5aWkxuetgoJ/8zkSMTZs4QWEuUMNgtGMS/9x4g== X-Received: by 10.99.132.198 with SMTP id k189mr23464689pgd.298.1524672757119; Wed, 25 Apr 2018 09:12:37 -0700 (PDT) Received: from xeon-e3 (204-195-71-95.wavecable.com. [204.195.71.95]) by smtp.gmail.com with ESMTPSA id x12sm18216306pfn.127.2018.04.25.09.12.36 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 25 Apr 2018 09:12:37 -0700 (PDT) Date: Wed, 25 Apr 2018 09:12:34 -0700 From: Stephen Hemminger To: "Burakov, Anatoly" Cc: Thomas Monjalon , dev@dpdk.org, andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, Yongseok Koh , nelio.laranjeiro@6wind.com, olivier.matz@6wind.com, rahul.lakkireddy@chelsio.com, jerin.jacob@cavium.com, hemant.agrawal@nxp.com, alejandro.lucero@netronome.com, arybchenko@solarflare.com, ferruh.yigit@intel.com, Srinath Mannam Message-ID: <20180425091234.5565aafb@xeon-e3> In-Reply-To: <53e192ed-15d8-5fa1-3048-964d92b917b1@intel.com> References: <1667872.djnhp43hg1@xps> <38b1d748-f815-5cf6-acea-e58d291be40d@intel.com> <53e192ed-15d8-5fa1-3048-964d92b917b1@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Apr 2018 16:12:38 -0000 On Wed, 25 Apr 2018 17:02:48 +0100 "Burakov, Anatoly" wrote: > On 14-Feb-18 10:07 AM, Burakov, Anatoly wrote: > > On 14-Feb-18 8:04 AM, Thomas Monjalon wrote: =20 > >> Hi Anatoly, > >> > >> 19/12/2017 12:14, Anatoly Burakov: =20 > >>> =C2=A0 * Memory tagging. This is related to previous item. Right now,= we=20 > >>> can only ask > >>> =C2=A0=C2=A0=C2=A0 malloc to allocate memory by page size, but one co= uld potentially=20 > >>> have > >>> =C2=A0=C2=A0=C2=A0 different memory regions backed by pages of simila= r sizes (for=20 > >>> example, > >>> =C2=A0=C2=A0=C2=A0 locked 1G pages, to completely avoid TLB misses, a= longside=20 > >>> regular 1G pages), > >>> =C2=A0=C2=A0=C2=A0 and it would be good to have that kind of mechanis= m to=20 > >>> distinguish between > >>> =C2=A0=C2=A0=C2=A0 different memory types available to a DPDK applica= tion. One=20 > >>> could, for example, > >>> =C2=A0=C2=A0=C2=A0 tag memory by "purpose" (i.e. "fast", "slow"), or = in other ways. =20 > >> > >> How do you imagine memory tagging? > >> Should it be a parameter when requesting some memory from rte_malloc > >> or rte_mempool? =20 > >=20 > > We can't make it a parameter for mempool without making it a parameter= =20 > > for rte_malloc, as every memory allocation in DPDK works through=20 > > rte_malloc. So at the very least, rte_malloc will have it. And as long= =20 > > as rte_malloc has it, there's no reason why memzones and mempools=20 > > couldn't - not much code to add. > > =20 > >> Could it be a bit-field allowing to combine some properties? > >> Does it make sense to have "DMA" as one of the purpose? =20 > >=20 > > Something like a bitfield would be my preference, yes. That way we coul= d=20 > > classify memory in certain ways and allocate based on that. Which=20 > > "certain ways" these are, i'm not sure. For example, in addition to=20 > > tagging memory as "DMA-capable" (which i think is a given), one might=20 > > tag certain memory as "non-default", as in, never allocate from this=20 > > chunk of memory unless explicitly asked to do so - this could be useful= =20 > > for types of memory that are a precious resource. > >=20 > > Then again, it is likely that we won't have many types of memory in=20 > > DPDK, and any other type would be implementation-specific, so maybe jus= t=20 > > stringly-typing it is OK (maybe we can finally make use of "type"=20 > > parameter in rte_malloc!). > > =20 > >> > >> How to transparently allocate the best memory for the NIC? > >> You take care of the NUMA socket property, but there can be more > >> requirements, like getting memory from the NIC itself. =20 > >=20 > > I would think that we can't make it generic enough to cover all cases,= =20 > > so it's best to expose some API's and let PMD's handle this themselves. > > =20 > >> > >> +Cc more people (6WIND, Cavium, Chelsio, Mellanox, Netronome, NXP,=20 > >> Solarflare) > >> in order to trigger a discussion about the ideal requirements. > >> =20 > > =20 >=20 > Hi all, >=20 > I would like to restart this discussion, again :) I would like to hear=20 > some feedback on my thoughts below. >=20 > I've had some more thinking about it, and while i have lots of use-cases= =20 > in mind, i suspect covering them all while keeping a sane API is=20 > unrealistic. >=20 > So, first things first. >=20 > Main issue we have is the 1:1 correspondence of malloc heap, and socket=20 > ID. This has led to various attempts to hijack socket id's to do=20 > something else - i've seen this approach a few times before, most=20 > recently in a patch by Srinath/Broadcom [1]. We need to break this=20 > dependency somehow, and have a unique heap identifier. >=20 > Also, since memory allocators are expected to behave roughly similar to=20 > drivers (e.g. have a driver API and provide hooks for init/alloc/free=20 > functions, etc.), a request to allocate memory may not just go to the=20 > heap itself (which is handled internally by rte_malloc), but also go to=20 > its respective allocator. This is roughly similar to what is happening=20 > currently, except that which allocator functions to call will then=20 > depend on which driver allocated that heap. >=20 > So, we arrive at a dependency - heap =3D> allocator. Each heap must know= =20 > to which allocator it belongs - so, we also need some kind of way to=20 > identify not just the heap, but the allocator as well. >=20 > In the above quotes from previous mails i suggested categorizing memory=20 > by "types", but now that i think of it, the API would've been too=20 > complex, as we would've ideally had to cover use cases such as "allocate= =20 > memory of this type, no matter from which allocator it comes from",=20 > "allocate memory from this particular heap", "allocate memory from this=20 > particular allocator"... It gets complicated pretty fast. >=20 > What i propose instead, is this. In 99% of time, user wants our hugepage= =20 > allocator. So, by default, all allocations will come through that. In=20 > the event that user needs memory from a specific heap, we need to=20 > provide a new set of API's to request memory from a specific heap. >=20 > Do we expect situations where user might *not* want default allocator,=20 > but also *not* know which exact heap he wants? If the answer is no=20 > (which i'm counting on :) ), then allocating from a specific malloc=20 > driver becomes as simple as something like this: >=20 > mem =3D rte_malloc_from_heap("my_very_special_heap"); >=20 > (stringly-typed heap ID is just an example) >=20 > So, old API's remain intact, and are always passed through to a default=20 > allocator, while new API's will grant access to other allocators. >=20 > Heap ID alone, however, may not provide enough flexibility. For example,= =20 > if a malloc driver allocates a specific kind of memory that is=20 > NUMA-aware, it would perhaps be awkward to call different heap ID's when= =20 > the memory being allocated is arguably the same, just subdivided into=20 > several blocks. Moreover, figuring out situations like this would likely= =20 > require some cooperation from the allocator itself (possibly some=20 > allocator-specific API's), but should we add malloc heap arguments,=20 > those would have to be generic. I'm not sure if we want to go that far,=20 > though. >=20 > Does that sound reasonable? >=20 > Another tangentially related issue raised by Olivier [1] is of=20 > allocating memory in blocks, rather than using rte_malloc. Current=20 > implementation has rte_malloc storing its metadata right in the memory -= =20 > this leads to unnecessary memory fragmentation in certain cases, such as= =20 > allocating memory page-by-page, and in general polluting memory we might= =20 > not want to pollute with malloc metadata. >=20 > To fix this, memory allocator would have to store malloc data=20 > externally, which comes with a few caveats (reverse mapping of pointers=20 > to malloc elements, storing, looking up and accounting for said=20 > elements, etc.). It's not currently planned to work on it, but it's=20 > certainly something to think about :) >=20 > [1] http://dpdk.org/dev/patchwork/patch/36596/ > [2] http://dpdk.org/ml/archives/dev/2018-March/093212.html Maybe the existing rte_malloc which tries to always work like malloc is not the best API for applications? I always thought the Samba talloc API was le= ss error prone since it supports reference counting and hierarchal allocation.