From: Jim Harris <james.r.harris@intel.com>
To: dev@dpdk.org
Cc: anatoly.burakov@intel.com
Subject: [dpdk-dev] [RFC] mem: do not merge elems from different heap allocs
Date: Thu, 29 Nov 2018 12:13:30 -0700 [thread overview]
Message-ID: <20181129191330.31792-1-james.r.harris@intel.com> (raw)
SPDK uses the rte_mem_event_callback_register API to
create RDMA memory regions (MRs) for newly allocated regions
of memory. This is used in both the SPDK NVMe-oF target
and the NVMe-oF host driver.
DPDK creates internal malloc_elem structures for these
allocated regions. As users malloc and free memory, DPDK
will sometimes merge malloc_elems that originated from
different allocations that were notified through the
registered mem_event callback routine. This results
in subsequent allocations that can span across multiple
RDMA MRs. This requires SPDK to check each DPDK buffer to
see if it crosses an MR boundary, and if so, would have to
add considerable logic and complexity to describe that
buffer before it can be accessed by the RNIC. It is somewhat
analagous to rte_malloc returning a buffer that is not
IOVA-contiguous.
As a malloc_elem gets split and some of these elements
get freed, it can also result in DPDK sending an
RTE_MEM_EVENT_FREE notification for a subset of the
original RTE_MEM_EVENT_ALLOC notification. This is also
problematic for RDMA memory regions, since unregistering
the memory region is all-or-nothing. It is not possible
to unregister part of a memory region.
SPDK is currently working around this latter issue by
setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the
memory segments associated with an RTE_MEM_EVENT_ALLOC
notification. But the first issue with merged elements
crossing MR boundaries is still possible and we have
some rather simple allocation patterns that can
trigger it. So we would like to propose disabling the
merging of malloc_elems that originate from different
RTE_MEM_EVENT_ALLOC events.
This patch demonstrates how this merging could be
avoided. There are some allocation patterns (especially
frequent multi-huge-page rte_mallocs and rte_frees) that
could result in higher memory usage which need to be
evaluated. It's possible this behavior would need to be
explicitly enabled through an init flag or a separate API,
or implicitly based on whether any memory registration
callbacks have been registered - implementation of one
of those is deferred pending feedback on this RFC.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
---
lib/librte_eal/common/malloc_elem.c | 14 ++++++++++----
lib/librte_eal/common/malloc_elem.h | 6 +++++-
lib/librte_eal/common/malloc_heap.c | 7 ++++++-
3 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 9d3dcb6a9..785ecdadd 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
*/
void
malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
- struct rte_memseg_list *msl, size_t size)
+ struct rte_memseg_list *msl, size_t size,
+ struct malloc_elem *orig_elem, size_t orig_size)
{
elem->heap = heap;
elem->msl = msl;
@@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
elem->state = ELEM_FREE;
elem->size = size;
elem->pad = 0;
+ elem->orig_elem = orig_elem;
+ elem->orig_size = orig_size;
set_header(elem);
set_trailer(elem);
}
@@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
const size_t new_elem_size = elem->size - old_elem_size;
- malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
+ malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
+ elem->orig_elem, elem->orig_size);
split_pt->prev = elem;
split_pt->next = next_elem;
if (next_elem)
@@ -469,7 +473,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
* with it, need to remove from free list.
*/
if (elem->next != NULL && elem->next->state == ELEM_FREE &&
- next_elem_is_adjacent(elem)) {
+ next_elem_is_adjacent(elem) &&
+ elem->orig_elem == elem->next->orig_elem) {
void *erase;
size_t erase_len;
@@ -490,7 +495,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
* with it, need to remove from free list.
*/
if (elem->prev != NULL && elem->prev->state == ELEM_FREE &&
- prev_elem_is_adjacent(elem)) {
+ prev_elem_is_adjacent(elem) &&
+ elem->orig_elem == elem->prev->orig_elem) {
struct malloc_elem *new_elem;
void *erase;
size_t erase_len;
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index e2bda4c02..207767c94 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -32,6 +32,8 @@ struct malloc_elem {
volatile enum elem_state state;
uint32_t pad;
size_t size;
+ struct malloc_elem *orig_elem;
+ size_t orig_size;
#ifdef RTE_MALLOC_DEBUG
uint64_t header_cookie; /* Cookie marking start of data */
/* trailer cookie at start + size */
@@ -116,7 +118,9 @@ void
malloc_elem_init(struct malloc_elem *elem,
struct malloc_heap *heap,
struct rte_memseg_list *msl,
- size_t size);
+ size_t size,
+ struct malloc_elem *orig_elem,
+ size_t orig_size);
void
malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index c6a6d4f6b..602528299 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
{
struct malloc_elem *elem = start;
- malloc_elem_init(elem, heap, msl, len);
+ malloc_elem_init(elem, heap, msl, len, elem, len);
malloc_elem_insert(elem);
@@ -857,6 +857,11 @@ malloc_heap_free(struct malloc_elem *elem)
if (elem->size < page_sz)
goto free_unlock;
+ /* we can only free memory back to the system as it was
+ * originally allocated */
+ if (elem->size != elem->orig_size)
+ goto free_unlock;
+
/* probably, but let's make sure, as we may not be using up full page */
start = elem;
len = elem->size;
--
2.12.2
next reply other threads:[~2018-11-29 19:07 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-29 19:13 Jim Harris [this message]
2018-12-04 16:49 ` Burakov, Anatoly
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181129191330.31792-1-james.r.harris@intel.com \
--to=james.r.harris@intel.com \
--cc=anatoly.burakov@intel.com \
--cc=dev@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).