From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 849811B017 for ; Tue, 19 Dec 2017 12:14:54 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 19 Dec 2017 03:14:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,426,1508828400"; d="scan'208";a="185440356" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga005.jf.intel.com with ESMTP; 19 Dec 2017 03:14:51 -0800 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id vBJBEpUp003082; Tue, 19 Dec 2017 11:14:51 GMT Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id vBJBEoAc010170; Tue, 19 Dec 2017 11:14:50 GMT Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id vBJBEodq010166; Tue, 19 Dec 2017 11:14:50 GMT From: Anatoly Burakov To: dev@dpdk.org Cc: andras.kovacs@ericsson.com, laszlo.vadkeri@ericsson.com, keith.wiles@intel.com, benjamin.walker@intel.com, bruce.richardson@intel.com, thomas@monjalon.net Date: Tue, 19 Dec 2017 11:14:27 +0000 Message-Id: X-Mailer: git-send-email 1.7.0.7 Subject: [dpdk-dev] [RFC v2 00/23] Dynamic memory allocation for DPDK X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Dec 2017 11:14:55 -0000 This patchset introduces a prototype implementation of dynamic memory allocation for DPDK. It is intended to start a conversation and build consensus on the best way to implement this functionality. The patchset works well enough to pass all unit tests, and to work with traffic forwarding, provided the device drivers are adjusted to ensure contiguous memory allocation where it matters. The vast majority of changes are in the EAL and malloc, the external API disruption is minimal: a new set of API's are added for contiguous memory allocation (for rte_malloc and rte_memzone), and a few API additions in rte_memory. Every other API change is internal to EAL, and all of the memory allocation/freeing is handled through rte_malloc, with no externally visible API changes, aside from a call to get physmem layout, which no longer makes sense given that there are multiple memseg lists. Quick outline of all changes done as part of this patchset: * Malloc heap adjusted to handle holes in address space * Single memseg list replaced by multiple expandable memseg lists * VA space for hugepages is preallocated in advance * Added dynamic alloc/free for pages, happening as needed on malloc/free * Added contiguous memory allocation API's for rte_malloc and rte_memzone * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory with VFIO The biggest difference is a "memseg" now represents a single page (as opposed to being a big contiguous block of pages). As a consequence, both memzones and malloc elements are no longer guaranteed to be physically contiguous, unless the user asks for it. To preserve whatever functionality that was dependent on previous behavior, a legacy memory option is also provided, however it is expected to be temporary solution. The drivers weren't adjusted in this patchset, and it is expected that whoever shall test the drivers with this patchset will modify their relevant drivers to support the new set of API's. Basic testing with forwarding traffic was performed, both with UIO and VFIO, and no performance degradation was observed. Why multiple memseg lists instead of one? It makes things easier on a number of fronts. Since memseg is a single page now, the list will get quite big, and we need to locate pages somehow when we allocate and free them. We could of course just walk the list and allocate one contiguous chunk of VA space for memsegs, but i chose to use separate lists instead, to speed up many operations with the list. It would be great to see the following discussions within the community regarding both current implementation and future work: * Any suggestions to improve current implementation. The whole system with multiple memseg lists is kind of unweildy, so maybe there are better ways to do the same thing. Maybe use a single list after all? We're not expecting malloc/free on hot path, so maybe it doesn't matter that we have to walk the list of potentially thousands of pages? * Pluggable memory allocators. Right now, allocators are hardcoded, but down the line it would be great to have custom allocators (e.g. for externally allocated memory). I've tried to keep the memalloc API minimal and generic enough to be able to easily change it down the line, but suggestions are welcome. Memory drivers, with ops for alloc/free etc.? * Memory tagging. This is related to previous item. Right now, we can only ask malloc to allocate memory by page size, but one could potentially have different memory regions backed by pages of similar sizes (for example, locked 1G pages, to completely avoid TLB misses, alongside regular 1G pages), and it would be good to have that kind of mechanism to distinguish between different memory types available to a DPDK application. One could, for example, tag memory by "purpose" (i.e. "fast", "slow"), or in other ways. * Secondary process implementation, in particular when it comes to allocating/ freeing new memory. Current plan is to make use of RPC mechanism proposed by Jianfeng [2] to communicate between primary and secondary processes, however other suggestions are welcome. * Support for non-hugepage memory. This work is planned down the line. Aside from obvious concerns about physical addresses, 4K pages are small and will eat up enormous amounts of memseg list space, so my proposal would be to allocate 4K pages in bigger blocks (say, 2MB). * 32-bit support. Current implementation lacks it, and i don't see a trivial way to make it work if we are to preallocate huge chunks of VA space in advance. We could limit it to 1G per page size, but even that, on multiple sockets, won't work that well, and we can't know in advance what kind of memory user will try to allocate. Drop it? Leave it in legacy mode only? * Preallocation. Right now, malloc will free any and all memory that it can, which could lead to a (perhaps counterintuitive) situation where a user calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses all of the preallocated memory in the process. Would preallocating memory *and keeping it no matter what* be a valid use case? E.g. if DPDK was run without any memory requirements specified, grow and shrink as needed, but DPDK was asked to preallocate memory, we can grow but we can't shrink past the preallocated amount? Any other feedback about things i didn't think of or missed is greatly appreciated. [1] http://dpdk.org/dev/patchwork/patch/24484/ [2] http://dpdk.org/dev/patchwork/patch/31838/ Anatoly Burakov (23): eal: move get_virtual_area out of linuxapp eal_memory.c eal: add function to report number of detected sockets eal: add rte_fbarray eal: move all locking to heap eal: protect malloc heap stats with a lock eal: make malloc a doubly-linked list eal: make malloc_elem_join_adjacent_free public eal: add "single file segments" command-line option eal: add "legacy memory" option eal: read hugepage counts from node-specific sysfs path eal: replace memseg with memseg lists eal: add support for dynamic memory allocation eal: make use of dynamic memory allocation for init eal: add support for dynamic unmapping of pages eal: add API to check if memory is physically contiguous eal: enable dynamic memory allocation/free on malloc/free eal: add backend support for contiguous memory allocation eal: add rte_malloc support for allocating contiguous memory eal: enable reserving physically contiguous memzones eal: make memzones use rte_fbarray mempool: add support for the new memory allocation methods vfio: allow to map other memory regions eal: map/unmap memory with VFIO when alloc/free pages config/common_base | 5 +- drivers/bus/pci/linux/pci.c | 29 +- drivers/net/ena/ena_ethdev.c | 10 +- drivers/net/virtio/virtio_user/vhost_kernel.c | 106 ++-- lib/librte_eal/common/Makefile | 2 +- lib/librte_eal/common/eal_common_fbarray.c | 585 ++++++++++++++++++++++ lib/librte_eal/common/eal_common_lcore.c | 11 + lib/librte_eal/common/eal_common_memalloc.c | 79 +++ lib/librte_eal/common/eal_common_memory.c | 315 +++++++++++- lib/librte_eal/common/eal_common_memzone.c | 250 ++++++--- lib/librte_eal/common/eal_common_options.c | 8 + lib/librte_eal/common/eal_filesystem.h | 13 + lib/librte_eal/common/eal_hugepages.h | 1 + lib/librte_eal/common/eal_internal_cfg.h | 6 + lib/librte_eal/common/eal_memalloc.h | 55 ++ lib/librte_eal/common/eal_options.h | 4 + lib/librte_eal/common/eal_private.h | 29 ++ lib/librte_eal/common/include/rte_eal.h | 1 + lib/librte_eal/common/include/rte_eal_memconfig.h | 26 +- lib/librte_eal/common/include/rte_fbarray.h | 98 ++++ lib/librte_eal/common/include/rte_lcore.h | 8 + lib/librte_eal/common/include/rte_malloc.h | 181 +++++++ lib/librte_eal/common/include/rte_malloc_heap.h | 6 + lib/librte_eal/common/include/rte_memory.h | 16 + lib/librte_eal/common/include/rte_memzone.h | 158 ++++++ lib/librte_eal/common/malloc_elem.c | 411 ++++++++++++--- lib/librte_eal/common/malloc_elem.h | 30 +- lib/librte_eal/common/malloc_heap.c | 433 ++++++++++++++-- lib/librte_eal/common/malloc_heap.h | 14 +- lib/librte_eal/common/rte_malloc.c | 139 +++-- lib/librte_eal/linuxapp/eal/Makefile | 4 + lib/librte_eal/linuxapp/eal/eal.c | 23 +- lib/librte_eal/linuxapp/eal/eal_hugepage_info.c | 73 ++- lib/librte_eal/linuxapp/eal/eal_memalloc.c | 556 ++++++++++++++++++++ lib/librte_eal/linuxapp/eal/eal_memory.c | 452 ++++++++++------- lib/librte_eal/linuxapp/eal/eal_vfio.c | 280 ++++++++--- lib/librte_eal/linuxapp/eal/eal_vfio.h | 11 + lib/librte_mempool/rte_mempool.c | 84 +++- test/test/test_malloc.c | 29 +- test/test/test_memory.c | 44 +- test/test/test_memzone.c | 17 +- 41 files changed, 3999 insertions(+), 603 deletions(-) create mode 100755 lib/librte_eal/common/eal_common_fbarray.c create mode 100755 lib/librte_eal/common/eal_common_memalloc.c create mode 100755 lib/librte_eal/common/eal_memalloc.h create mode 100755 lib/librte_eal/common/include/rte_fbarray.h create mode 100755 lib/librte_eal/linuxapp/eal/eal_memalloc.c -- 2.7.4