From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 6EE701B4F4 for ; Thu, 29 Nov 2018 20:07:41 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Nov 2018 11:07:39 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,295,1539673200"; d="scan'208";a="96874168" Received: from jrharri1-hsx.ch.intel.com ([143.182.136.73]) by orsmga008.jf.intel.com with ESMTP; 29 Nov 2018 11:07:39 -0800 From: Jim Harris To: dev@dpdk.org Cc: anatoly.burakov@intel.com Date: Thu, 29 Nov 2018 12:13:30 -0700 Message-Id: <20181129191330.31792-1-james.r.harris@intel.com> X-Mailer: git-send-email 2.12.2 Subject: [dpdk-dev] [RFC] mem: do not merge elems from different heap allocs X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Nov 2018 19:07:42 -0000 SPDK uses the rte_mem_event_callback_register API to create RDMA memory regions (MRs) for newly allocated regions of memory. This is used in both the SPDK NVMe-oF target and the NVMe-oF host driver. DPDK creates internal malloc_elem structures for these allocated regions. As users malloc and free memory, DPDK will sometimes merge malloc_elems that originated from different allocations that were notified through the registered mem_event callback routine. This results in subsequent allocations that can span across multiple RDMA MRs. This requires SPDK to check each DPDK buffer to see if it crosses an MR boundary, and if so, would have to add considerable logic and complexity to describe that buffer before it can be accessed by the RNIC. It is somewhat analagous to rte_malloc returning a buffer that is not IOVA-contiguous. As a malloc_elem gets split and some of these elements get freed, it can also result in DPDK sending an RTE_MEM_EVENT_FREE notification for a subset of the original RTE_MEM_EVENT_ALLOC notification. This is also problematic for RDMA memory regions, since unregistering the memory region is all-or-nothing. It is not possible to unregister part of a memory region. SPDK is currently working around this latter issue by setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the memory segments associated with an RTE_MEM_EVENT_ALLOC notification. But the first issue with merged elements crossing MR boundaries is still possible and we have some rather simple allocation patterns that can trigger it. So we would like to propose disabling the merging of malloc_elems that originate from different RTE_MEM_EVENT_ALLOC events. This patch demonstrates how this merging could be avoided. There are some allocation patterns (especially frequent multi-huge-page rte_mallocs and rte_frees) that could result in higher memory usage which need to be evaluated. It's possible this behavior would need to be explicitly enabled through an init flag or a separate API, or implicitly based on whether any memory registration callbacks have been registered - implementation of one of those is deferred pending feedback on this RFC. Signed-off-by: Jim Harris --- lib/librte_eal/common/malloc_elem.c | 14 ++++++++++---- lib/librte_eal/common/malloc_elem.h | 6 +++++- lib/librte_eal/common/malloc_heap.c | 7 ++++++- 3 files changed, 21 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c index 9d3dcb6a9..785ecdadd 100644 --- a/lib/librte_eal/common/malloc_elem.c +++ b/lib/librte_eal/common/malloc_elem.c @@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align) */ void malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, - struct rte_memseg_list *msl, size_t size) + struct rte_memseg_list *msl, size_t size, + struct malloc_elem *orig_elem, size_t orig_size) { elem->heap = heap; elem->msl = msl; @@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, elem->state = ELEM_FREE; elem->size = size; elem->pad = 0; + elem->orig_elem = orig_elem; + elem->orig_size = orig_size; set_header(elem); set_trailer(elem); } @@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem; const size_t new_elem_size = elem->size - old_elem_size; - malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size); + malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size, + elem->orig_elem, elem->orig_size); split_pt->prev = elem; split_pt->next = next_elem; if (next_elem) @@ -469,7 +473,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) * with it, need to remove from free list. */ if (elem->next != NULL && elem->next->state == ELEM_FREE && - next_elem_is_adjacent(elem)) { + next_elem_is_adjacent(elem) && + elem->orig_elem == elem->next->orig_elem) { void *erase; size_t erase_len; @@ -490,7 +495,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem) * with it, need to remove from free list. */ if (elem->prev != NULL && elem->prev->state == ELEM_FREE && - prev_elem_is_adjacent(elem)) { + prev_elem_is_adjacent(elem) && + elem->orig_elem == elem->prev->orig_elem) { struct malloc_elem *new_elem; void *erase; size_t erase_len; diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h index e2bda4c02..207767c94 100644 --- a/lib/librte_eal/common/malloc_elem.h +++ b/lib/librte_eal/common/malloc_elem.h @@ -32,6 +32,8 @@ struct malloc_elem { volatile enum elem_state state; uint32_t pad; size_t size; + struct malloc_elem *orig_elem; + size_t orig_size; #ifdef RTE_MALLOC_DEBUG uint64_t header_cookie; /* Cookie marking start of data */ /* trailer cookie at start + size */ @@ -116,7 +118,9 @@ void malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, struct rte_memseg_list *msl, - size_t size); + size_t size, + struct malloc_elem *orig_elem, + size_t orig_size); void malloc_elem_insert(struct malloc_elem *elem); diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c index c6a6d4f6b..602528299 100644 --- a/lib/librte_eal/common/malloc_heap.c +++ b/lib/librte_eal/common/malloc_heap.c @@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl, { struct malloc_elem *elem = start; - malloc_elem_init(elem, heap, msl, len); + malloc_elem_init(elem, heap, msl, len, elem, len); malloc_elem_insert(elem); @@ -857,6 +857,11 @@ malloc_heap_free(struct malloc_elem *elem) if (elem->size < page_sz) goto free_unlock; + /* we can only free memory back to the system as it was + * originally allocated */ + if (elem->size != elem->orig_size) + goto free_unlock; + /* probably, but let's make sure, as we may not be using up full page */ start = elem; len = elem->size; -- 2.12.2