From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 001F1DD2; Sat, 13 Jan 2018 15:13:57 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Jan 2018 06:13:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.46,353,1511856000"; d="scan'208";a="10030476" Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.252.9.53]) ([10.252.9.53]) by fmsmga002.fm.intel.com with ESMTP; 13 Jan 2018 06:13:54 -0800 To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net, techboard@dpdk.org, jerin.jacob@caviumnetworks.com, rosenbaumalex@gmail.com, "Ananyev, Konstantin" , ferruh.yigit@intel.com References: From: "Burakov, Anatoly" Message-ID: <5ea96aa4-1cbb-a509-b582-418e8bd71552@intel.com> Date: Sat, 13 Jan 2018 14:13:53 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Jan 2018 14:13:58 -0000 On 19-Dec-17 11:14 AM, Anatoly Burakov wrote: > This patchset introduces a prototype implementation of dynamic memory allocation > for DPDK. It is intended to start a conversation and build consensus on the best > way to implement this functionality. The patchset works well enough to pass all > unit tests, and to work with traffic forwarding, provided the device drivers are > adjusted to ensure contiguous memory allocation where it matters. > > The vast majority of changes are in the EAL and malloc, the external API > disruption is minimal: a new set of API's are added for contiguous memory > allocation (for rte_malloc and rte_memzone), and a few API additions in > rte_memory. Every other API change is internal to EAL, and all of the memory > allocation/freeing is handled through rte_malloc, with no externally visible > API changes, aside from a call to get physmem layout, which no longer makes > sense given that there are multiple memseg lists. > > Quick outline of all changes done as part of this patchset: > > * Malloc heap adjusted to handle holes in address space > * Single memseg list replaced by multiple expandable memseg lists > * VA space for hugepages is preallocated in advance > * Added dynamic alloc/free for pages, happening as needed on malloc/free > * Added contiguous memory allocation API's for rte_malloc and rte_memzone > * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory > with VFIO > > The biggest difference is a "memseg" now represents a single page (as opposed to > being a big contiguous block of pages). As a consequence, both memzones and > malloc elements are no longer guaranteed to be physically contiguous, unless > the user asks for it. To preserve whatever functionality that was dependent > on previous behavior, a legacy memory option is also provided, however it is > expected to be temporary solution. The drivers weren't adjusted in this patchset, > and it is expected that whoever shall test the drivers with this patchset will > modify their relevant drivers to support the new set of API's. Basic testing > with forwarding traffic was performed, both with UIO and VFIO, and no performance > degradation was observed. > > Why multiple memseg lists instead of one? It makes things easier on a number of > fronts. Since memseg is a single page now, the list will get quite big, and we > need to locate pages somehow when we allocate and free them. We could of course > just walk the list and allocate one contiguous chunk of VA space for memsegs, > but i chose to use separate lists instead, to speed up many operations with the > list. > > It would be great to see the following discussions within the community regarding > both current implementation and future work: > > * Any suggestions to improve current implementation. The whole system with > multiple memseg lists is kind of unweildy, so maybe there are better ways to > do the same thing. Maybe use a single list after all? We're not expecting > malloc/free on hot path, so maybe it doesn't matter that we have to walk > the list of potentially thousands of pages? > * Pluggable memory allocators. Right now, allocators are hardcoded, but down > the line it would be great to have custom allocators (e.g. for externally > allocated memory). I've tried to keep the memalloc API minimal and generic > enough to be able to easily change it down the line, but suggestions are > welcome. Memory drivers, with ops for alloc/free etc.? > * Memory tagging. This is related to previous item. Right now, we can only ask > malloc to allocate memory by page size, but one could potentially have > different memory regions backed by pages of similar sizes (for example, > locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages), > and it would be good to have that kind of mechanism to distinguish between > different memory types available to a DPDK application. One could, for example, > tag memory by "purpose" (i.e. "fast", "slow"), or in other ways. > * Secondary process implementation, in particular when it comes to allocating/ > freeing new memory. Current plan is to make use of RPC mechanism proposed by > Jianfeng [2] to communicate between primary and secondary processes, however > other suggestions are welcome. > * Support for non-hugepage memory. This work is planned down the line. Aside > from obvious concerns about physical addresses, 4K pages are small and will > eat up enormous amounts of memseg list space, so my proposal would be to > allocate 4K pages in bigger blocks (say, 2MB). > * 32-bit support. Current implementation lacks it, and i don't see a trivial > way to make it work if we are to preallocate huge chunks of VA space in > advance. We could limit it to 1G per page size, but even that, on multiple > sockets, won't work that well, and we can't know in advance what kind of > memory user will try to allocate. Drop it? Leave it in legacy mode only? > * Preallocation. Right now, malloc will free any and all memory that it can, > which could lead to a (perhaps counterintuitive) situation where a user > calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses > all of the preallocated memory in the process. Would preallocating memory > *and keeping it no matter what* be a valid use case? E.g. if DPDK was run > without any memory requirements specified, grow and shrink as needed, but > DPDK was asked to preallocate memory, we can grow but we can't shrink > past the preallocated amount? > > Any other feedback about things i didn't think of or missed is greatly > appreciated. > > [1] http://dpdk.org/dev/patchwork/patch/24484/ > [2] http://dpdk.org/dev/patchwork/patch/31838/ > Hi all, Could this proposal be discussed at the next tech board meeting? -- Thanks, Anatoly