From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id AA6C21B1E6 for ; Tue, 4 Dec 2018 17:49:40 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 Dec 2018 08:49:39 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,314,1539673200"; d="scan'208";a="104794465" Received: from aburakov-mobl1.ger.corp.intel.com (HELO [10.237.220.109]) ([10.237.220.109]) by fmsmga007.fm.intel.com with ESMTP; 04 Dec 2018 08:49:36 -0800 To: Jim Harris , dev@dpdk.org References: <20181129191330.31792-1-james.r.harris@intel.com> From: "Burakov, Anatoly" Cc: Thomas Monjalon , "Richardson, Bruce" , "Yigit, Ferruh" , Yongseok Koh , Maxime Coquelin , Shahaf Shuler , Hemant Agrawal , "jerin.jacob@caviumnetworks.com" , "Ananyev, Konstantin" , Olivier Matz , Stephen Hemminger Message-ID: <65d5d7f7-ca22-972e-2991-1f638253c4ba@intel.com> Date: Tue, 4 Dec 2018 16:49:34 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1 MIME-Version: 1.0 In-Reply-To: <20181129191330.31792-1-james.r.harris@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [RFC] mem: do not merge elems from different heap allocs X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Dec 2018 16:49:41 -0000 On 29-Nov-18 7:13 PM, Jim Harris wrote: > SPDK uses the rte_mem_event_callback_register API to > create RDMA memory regions (MRs) for newly allocated regions > of memory. This is used in both the SPDK NVMe-oF target > and the NVMe-oF host driver. > > DPDK creates internal malloc_elem structures for these > allocated regions. As users malloc and free memory, DPDK > will sometimes merge malloc_elems that originated from > different allocations that were notified through the > registered mem_event callback routine. This results > in subsequent allocations that can span across multiple > RDMA MRs. This requires SPDK to check each DPDK buffer to > see if it crosses an MR boundary, and if so, would have to > add considerable logic and complexity to describe that > buffer before it can be accessed by the RNIC. It is somewhat > analagous to rte_malloc returning a buffer that is not > IOVA-contiguous. > > As a malloc_elem gets split and some of these elements > get freed, it can also result in DPDK sending an > RTE_MEM_EVENT_FREE notification for a subset of the > original RTE_MEM_EVENT_ALLOC notification. This is also > problematic for RDMA memory regions, since unregistering > the memory region is all-or-nothing. It is not possible > to unregister part of a memory region. > > SPDK is currently working around this latter issue by > setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the > memory segments associated with an RTE_MEM_EVENT_ALLOC > notification. But the first issue with merged elements > crossing MR boundaries is still possible and we have > some rather simple allocation patterns that can > trigger it. So we would like to propose disabling the > merging of malloc_elems that originate from different > RTE_MEM_EVENT_ALLOC events. > > This patch demonstrates how this merging could be > avoided. There are some allocation patterns (especially > frequent multi-huge-page rte_mallocs and rte_frees) that > could result in higher memory usage which need to be > evaluated. It's possible this behavior would need to be > explicitly enabled through an init flag or a separate API, > or implicitly based on whether any memory registration > callbacks have been registered - implementation of one > of those is deferred pending feedback on this RFC. > > Signed-off-by: Jim Harris Hi Jim, I can see the use case, and i don't there's a better way to do this than to fix it at the allocator stage. You're correct in pointing out that some amount of memory will be wasted with this approach due to free space fragmentation. However, if we make this behavior non-default, that is fine. One thing i'm a bit concerned about is that this change will push malloc element size into two cache lines any arch that is 64-bit and has a 64-byte cache line. Malloc is not a performance API, so structure size is not important as long as it's cache-line aligned, but wasting additional 64 bytes per allocation on some architectures (including ours!) can be a concern. Moreover, this extra overhead will be there regardless of whether this functionality is enabled or disabled (unless we're talking #ifdef-ery, which i'd rather not do). This overhead will even be there on FreeBSD, which doesn't support dynamic memory allocation in the first place. I don't see a good way to avoid this other than #ifdef-magic, but as i said, i'd really like to avoid having another config option (or worse, OS-specific code in common files). I personally think additional cache line per allocation on x86_64 and similar archs isn't *that* big of a price to pay for this functionality (which i think is useful for certain use cases like yours!), but i'm adding Techboard folks to this thread in case i'm mistaken on that front :) Provided we go forward with the proposal, i believe the best way to implement this would be through an EAL flag. Also, some code comments below. > --- > lib/librte_eal/common/malloc_elem.c | 14 ++++++++++---- > lib/librte_eal/common/malloc_elem.h | 6 +++++- > lib/librte_eal/common/malloc_heap.c | 7 ++++++- > 3 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c > index 9d3dcb6a9..785ecdadd 100644 > --- a/lib/librte_eal/common/malloc_elem.c > +++ b/lib/librte_eal/common/malloc_elem.c > @@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align) > */ > void > malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, > - struct rte_memseg_list *msl, size_t size) > + struct rte_memseg_list *msl, size_t size, > + struct malloc_elem *orig_elem, size_t orig_size) > { > elem->heap = heap; > elem->msl = msl; > @@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, > elem->state = ELEM_FREE; > elem->size = size; > elem->pad = 0; > + elem->orig_elem = orig_elem; > + elem->orig_size = orig_size; > set_header(elem); > set_trailer(elem); > } > @@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) > const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem; > const size_t new_elem_size = elem->size - old_elem_size; > > - malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size); > + malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size, > + elem->orig_elem, elem->orig_size); > split_pt->prev = elem; > split_pt->next = next_elem; > if (next_elem) > @@ -469,7 +473,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) > * with it, need to remove from free list. > */ > if (elem->next != NULL && elem->next->state == ELEM_FREE && > - next_elem_is_adjacent(elem)) { > + next_elem_is_adjacent(elem) && > + elem->orig_elem == elem->next->orig_elem) { Why not make this check part of adjacency test? IMO it would be much clearer that way. > void *erase; > size_t erase_len; > > @@ -490,7 +495,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) > * with it, need to remove from free list. > */ > if (elem->prev != NULL && elem->prev->state == ELEM_FREE && > - prev_elem_is_adjacent(elem)) { > + prev_elem_is_adjacent(elem) && > + elem->orig_elem == elem->prev->orig_elem) { Same as above. > struct malloc_elem *new_elem; > void *erase; > size_t erase_len; > diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h > index e2bda4c02..207767c94 100644 > --- a/lib/librte_eal/common/malloc_elem.h > +++ b/lib/librte_eal/common/malloc_elem.h > @@ -32,6 +32,8 @@ struct malloc_elem { > volatile enum elem_state state; > uint32_t pad; > size_t size; > + struct malloc_elem *orig_elem; > + size_t orig_size; > #ifdef RTE_MALLOC_DEBUG > uint64_t header_cookie; /* Cookie marking start of data */ > /* trailer cookie at start + size */ This changes struct size to two cache lines on x86_64, which breaks the malloc autotest. That's not the fault of this patchset (malloc autotest was kind of stupid in how it handled overhead anyway), but this can be fixed by detecting malloc overhead size at runtime. Something like the following: static int test_multi_alloc_statistics(void) { struct rte_malloc_socket_stats pre_stats, post_stats; int overhead = 0; rte_malloc_get_socket_stats(socket, &pre_stats); /* allocate one cacheline */ dummy = rte_zmalloc_socket(NULL, RTE_CACHE_LINE_SIZE, 0, socket); if (dummy == NULL) return -1; rte_malloc_get_socket_stats(socket, &post_stats); /* after subtracting cache line, remainder is overhead */ overhead = post_stats.heap_allocsz_bytes - pre_stats.heap_allocsz_bytes; overhead -= RTE_CACHE_LINE_SIZE; rte_free(dummy); } > @@ -116,7 +118,9 @@ void > malloc_elem_init(struct malloc_elem *elem, > struct malloc_heap *heap, > struct rte_memseg_list *msl, > - size_t size); > + size_t size, > + struct malloc_elem *orig_elem, > + size_t orig_size); > > void > malloc_elem_insert(struct malloc_elem *elem); > diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c > index c6a6d4f6b..602528299 100644 > --- a/lib/librte_eal/common/malloc_heap.c > +++ b/lib/librte_eal/common/malloc_heap.c > @@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl, > { > struct malloc_elem *elem = start; > > - malloc_elem_init(elem, heap, msl, len); > + malloc_elem_init(elem, heap, msl, len, elem, len); > > malloc_elem_insert(elem); > > @@ -857,6 +857,11 @@ malloc_heap_free(struct malloc_elem *elem) > if (elem->size < page_sz) > goto free_unlock; > > + /* we can only free memory back to the system as it was > + * originally allocated */ > + if (elem->size != elem->orig_size) > + goto free_unlock; > + > /* probably, but let's make sure, as we may not be using up full page */ > start = elem; > len = elem->size; > -- Thanks, Anatoly