DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
@ 2023-02-10  6:30 Fengnan Chang
  2023-02-15 10:10 ` Thomas Monjalon
  2023-02-20 10:59 ` David Marchand
  0 siblings, 2 replies; 9+ messages in thread
From: Fengnan Chang @ 2023-02-10  6:30 UTC (permalink / raw)
  To: anatoly.burakov; +Cc: rsanford, dev, Fengnan Chang

Here is a simple test case:
"
uint64_t entry_time, time;
size_t size = 4096;
unsigned align = 4096;
for (int j = 0; j < 10; j++) {
	entry_time = rte_get_timer_cycles();
	for (int i = 0; i < 2000; i++) {
		rte_malloc(NULL, size, align);
	}
	time = (rte_get_timer_cycles()-entry_time) * 1000000 /
		rte_get_timer_hz();
	printf("total open time %lu avg time %lu\n", time, time/2000);
}
"

Single rte_malloc cost time may becomes wrose as the number of malloc
increases, In my env, first round avg time is 15us, second is 44us,
third is 77us, fourth is 168us...

The reason is,in the malloc process, malloc_elem_alloc may split new_elem
if there have too much free space after new_elem, and insert the trailer
into freelist. When alloc 4k with align 4k, the trailer very likely insert
to free_head[2] again, it makes free_head[2] longer. when alloc 4k again,
it will search free_head[2] from begin, with the number of malloc increases,
search free_head[2] need more time, so the performance will become worse.
Same problem will also occurs in alloc 64k with align 64k, but if alloc
4k with align 64, doesn't have this problem.

Fix this by adjust free_head list size range, make free_head[3] hold
elements which bigger or equal 4k, free_head[4] hold elements which bigger
or equal 16k.
In terms of probabilities, when alloc 4k or 16k, the probability of finding
a suitable elem from a larger size list is greater than from a smaller
size list.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
 lib/eal/common/malloc_elem.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index 83f05497cc..35a2313d04 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -367,11 +367,11 @@ prev_elem_is_adjacent(struct malloc_elem *elem)
  * containing larger elements.
  *
  * Example element size ranges for a heap with five free lists:
- *   heap->free_head[0] - (0   , 2^8]
- *   heap->free_head[1] - (2^8 , 2^10]
- *   heap->free_head[2] - (2^10 ,2^12]
- *   heap->free_head[3] - (2^12, 2^14]
- *   heap->free_head[4] - (2^14, MAX_SIZE]
+ *   heap->free_head[0] - (0   , 2^8)
+ *   heap->free_head[1] - [2^8 , 2^10)
+ *   heap->free_head[2] - [2^10 ,2^12)
+ *   heap->free_head[3] - [2^12, 2^14)
+ *   heap->free_head[4] - [2^14, MAX_SIZE]
  */
 size_t
 malloc_elem_free_list_index(size_t size)
@@ -385,8 +385,8 @@ malloc_elem_free_list_index(size_t size)
 	if (size <= (1UL << MALLOC_MINSIZE_LOG2))
 		return 0;
 
-	/* Find next power of 2 >= size. */
-	log2 = sizeof(size) * 8 - __builtin_clzl(size - 1);
+	/* Find next power of 2 > size. */
+	log2 = sizeof(size) * 8 - __builtin_clzl(size);
 
 	/* Compute freelist index, based on log2(size). */
 	index = (log2 - MALLOC_MINSIZE_LOG2 + MALLOC_LOG2_INCREMENT - 1) /
-- 
2.37.0 (Apple Git-136)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-10  6:30 [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases Fengnan Chang
@ 2023-02-15 10:10 ` Thomas Monjalon
  2023-02-15 11:10   ` Morten Brørup
  2023-02-16 10:40   ` Liang Ma
  2023-02-20 10:59 ` David Marchand
  1 sibling, 2 replies; 9+ messages in thread
From: Thomas Monjalon @ 2023-02-15 10:10 UTC (permalink / raw)
  To: anatoly.burakov, Fengnan Chang
  Cc: dev, rsanford, bruce.richardson, Morten Brørup,
	maxime.coquelin, david.marchand, jerinj, honnappa.nagarahalli

Looking for reviewers please.

10/02/2023 07:30, Fengnan Chang:
> Here is a simple test case:
> "
> uint64_t entry_time, time;
> size_t size = 4096;
> unsigned align = 4096;
> for (int j = 0; j < 10; j++) {
> 	entry_time = rte_get_timer_cycles();
> 	for (int i = 0; i < 2000; i++) {
> 		rte_malloc(NULL, size, align);
> 	}
> 	time = (rte_get_timer_cycles()-entry_time) * 1000000 /
> 		rte_get_timer_hz();
> 	printf("total open time %lu avg time %lu\n", time, time/2000);
> }
> "
> 
> Single rte_malloc cost time may becomes wrose as the number of malloc
> increases, In my env, first round avg time is 15us, second is 44us,
> third is 77us, fourth is 168us...
> 
> The reason is,in the malloc process, malloc_elem_alloc may split new_elem
> if there have too much free space after new_elem, and insert the trailer
> into freelist. When alloc 4k with align 4k, the trailer very likely insert
> to free_head[2] again, it makes free_head[2] longer. when alloc 4k again,
> it will search free_head[2] from begin, with the number of malloc increases,
> search free_head[2] need more time, so the performance will become worse.
> Same problem will also occurs in alloc 64k with align 64k, but if alloc
> 4k with align 64, doesn't have this problem.
> 
> Fix this by adjust free_head list size range, make free_head[3] hold
> elements which bigger or equal 4k, free_head[4] hold elements which bigger
> or equal 16k.
> In terms of probabilities, when alloc 4k or 16k, the probability of finding
> a suitable elem from a larger size list is greater than from a smaller
> size list.
> 
> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> ---
>  lib/eal/common/malloc_elem.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
> index 83f05497cc..35a2313d04 100644
> --- a/lib/eal/common/malloc_elem.c
> +++ b/lib/eal/common/malloc_elem.c
> @@ -367,11 +367,11 @@ prev_elem_is_adjacent(struct malloc_elem *elem)
>   * containing larger elements.
>   *
>   * Example element size ranges for a heap with five free lists:
> - *   heap->free_head[0] - (0   , 2^8]
> - *   heap->free_head[1] - (2^8 , 2^10]
> - *   heap->free_head[2] - (2^10 ,2^12]
> - *   heap->free_head[3] - (2^12, 2^14]
> - *   heap->free_head[4] - (2^14, MAX_SIZE]
> + *   heap->free_head[0] - (0   , 2^8)
> + *   heap->free_head[1] - [2^8 , 2^10)
> + *   heap->free_head[2] - [2^10 ,2^12)
> + *   heap->free_head[3] - [2^12, 2^14)
> + *   heap->free_head[4] - [2^14, MAX_SIZE]
>   */
>  size_t
>  malloc_elem_free_list_index(size_t size)
> @@ -385,8 +385,8 @@ malloc_elem_free_list_index(size_t size)
>  	if (size <= (1UL << MALLOC_MINSIZE_LOG2))
>  		return 0;
>  
> -	/* Find next power of 2 >= size. */
> -	log2 = sizeof(size) * 8 - __builtin_clzl(size - 1);
> +	/* Find next power of 2 > size. */
> +	log2 = sizeof(size) * 8 - __builtin_clzl(size);
>  
>  	/* Compute freelist index, based on log2(size). */
>  	index = (log2 - MALLOC_MINSIZE_LOG2 + MALLOC_LOG2_INCREMENT - 1) /
> 






^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-15 10:10 ` Thomas Monjalon
@ 2023-02-15 11:10   ` Morten Brørup
  2023-02-15 17:16     ` Stephen Hemminger
  2023-02-16 14:02     ` Liang Ma
  2023-02-16 10:40   ` Liang Ma
  1 sibling, 2 replies; 9+ messages in thread
From: Morten Brørup @ 2023-02-15 11:10 UTC (permalink / raw)
  To: Thomas Monjalon, anatoly.burakov, Fengnan Chang
  Cc: dev, rsanford, bruce.richardson, maxime.coquelin, david.marchand,
	jerinj, honnappa.nagarahalli, Fidaullah Noonari

+CC: Fidaullah Noonari <fidaullah.noonari@emumba.com>, your name also shows up in the git log; perhaps you can help review this patch.

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 15 February 2023 11.10
> 
> Looking for reviewers please.
> 
> 10/02/2023 07:30, Fengnan Chang:
> > Here is a simple test case:
> > "
> > uint64_t entry_time, time;
> > size_t size = 4096;
> > unsigned align = 4096;
> > for (int j = 0; j < 10; j++) {
> > 	entry_time = rte_get_timer_cycles();
> > 	for (int i = 0; i < 2000; i++) {
> > 		rte_malloc(NULL, size, align);
> > 	}
> > 	time = (rte_get_timer_cycles()-entry_time) * 1000000 /
> > 		rte_get_timer_hz();
> > 	printf("total open time %lu avg time %lu\n", time, time/2000);
> > }
> > "
> >
> > Single rte_malloc cost time may becomes wrose as the number of malloc
> > increases, In my env, first round avg time is 15us, second is 44us,
> > third is 77us, fourth is 168us...
> >
> > The reason is,in the malloc process, malloc_elem_alloc may split
> new_elem
> > if there have too much free space after new_elem, and insert the
> trailer
> > into freelist. When alloc 4k with align 4k, the trailer very likely
> insert
> > to free_head[2] again, it makes free_head[2] longer. when alloc 4k
> again,
> > it will search free_head[2] from begin, with the number of malloc
> increases,
> > search free_head[2] need more time, so the performance will become
> worse.
> > Same problem will also occurs in alloc 64k with align 64k, but if
> alloc
> > 4k with align 64, doesn't have this problem.
> >
> > Fix this by adjust free_head list size range, make free_head[3] hold
> > elements which bigger or equal 4k, free_head[4] hold elements which
> bigger
> > or equal 16k.
> > In terms of probabilities, when alloc 4k or 16k, the probability of
> finding
> > a suitable elem from a larger size list is greater than from a
> smaller
> > size list.
> >
> > Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> > ---
> >  lib/eal/common/malloc_elem.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/eal/common/malloc_elem.c
> b/lib/eal/common/malloc_elem.c
> > index 83f05497cc..35a2313d04 100644
> > --- a/lib/eal/common/malloc_elem.c
> > +++ b/lib/eal/common/malloc_elem.c
> > @@ -367,11 +367,11 @@ prev_elem_is_adjacent(struct malloc_elem *elem)
> >   * containing larger elements.
> >   *
> >   * Example element size ranges for a heap with five free lists:
> > - *   heap->free_head[0] - (0   , 2^8]
> > - *   heap->free_head[1] - (2^8 , 2^10]
> > - *   heap->free_head[2] - (2^10 ,2^12]
> > - *   heap->free_head[3] - (2^12, 2^14]
> > - *   heap->free_head[4] - (2^14, MAX_SIZE]
> > + *   heap->free_head[0] - (0   , 2^8)
> > + *   heap->free_head[1] - [2^8 , 2^10)
> > + *   heap->free_head[2] - [2^10 ,2^12)
> > + *   heap->free_head[3] - [2^12, 2^14)
> > + *   heap->free_head[4] - [2^14, MAX_SIZE]
> >   */
> >  size_t
> >  malloc_elem_free_list_index(size_t size)
> > @@ -385,8 +385,8 @@ malloc_elem_free_list_index(size_t size)
> >  	if (size <= (1UL << MALLOC_MINSIZE_LOG2))
> >  		return 0;
> >
> > -	/* Find next power of 2 >= size. */
> > -	log2 = sizeof(size) * 8 - __builtin_clzl(size - 1);
> > +	/* Find next power of 2 > size. */
> > +	log2 = sizeof(size) * 8 - __builtin_clzl(size);
> >
> >  	/* Compute freelist index, based on log2(size). */
> >  	index = (log2 - MALLOC_MINSIZE_LOG2 + MALLOC_LOG2_INCREMENT - 1)
> /
> >

I gave up reviewing in depth, because the existing code is not easy to quickly understand, and it would take too long for me. E.g. the malloc_elem->size is field is undocumented, and find_suitable_element() calls the malloc_elem_free_list_index() function with the raw size (as passed to the function), but malloc_elem_free_list_insert() calls the malloc_elem_free_list_index() with malloc_elem->size - MALLOC_ELEM_HEADER_LEN.

Looking isolated at the patch itself...

I agree with the way the patch modifies the ranges of the free list, and the consequential removal of the "- 1" from the calculation of log2.

Intuitively, the lists should cover ranges such as [0x100..0x3FF], which this patch does, not [0x101..0x400], how it was previously... The ranges with this patch make much more sense.

So if the existing code is otherwise correct, i.e. handles the size with/without MALLOC_ELEM_HEADER_LEN correctly, my gut feeling says this patch is an improvement.

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-15 11:10   ` Morten Brørup
@ 2023-02-15 17:16     ` Stephen Hemminger
  2023-02-16  2:54       ` [External] " Fengnan Chang
  2023-02-16 14:02     ` Liang Ma
  1 sibling, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2023-02-15 17:16 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Thomas Monjalon, anatoly.burakov, Fengnan Chang, dev, rsanford,
	bruce.richardson, maxime.coquelin, david.marchand, jerinj,
	honnappa.nagarahalli, Fidaullah Noonari

On Wed, 15 Feb 2023 12:10:23 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> Looking isolated at the patch itself...
> 
> I agree with the way the patch modifies the ranges of the free list, and the consequential removal of the "- 1" from the calculation of log2.
> 
> Intuitively, the lists should cover ranges such as [0x100..0x3FF], which this patch does, not [0x101..0x400], how it was previously... The ranges with this patch make much more sense.
> 
> So if the existing code is otherwise correct, i.e. handles the size with/without MALLOC_ELEM_HEADER_LEN correctly, my gut feeling says this patch is an improvement.
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

It would be good to have a malloc performance test.
Possibly something reused from some other project.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-15 17:16     ` Stephen Hemminger
@ 2023-02-16  2:54       ` Fengnan Chang
  0 siblings, 0 replies; 9+ messages in thread
From: Fengnan Chang @ 2023-02-16  2:54 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Morten Brørup, Thomas Monjalon, anatoly.burakov, dev,
	rsanford, bruce.richardson, maxime.coquelin, david.marchand,
	jerinj, honnappa.nagarahalli, Fidaullah Noonari

Stephen Hemminger <stephen@networkplumber.org> 于2023年2月16日周四 01:16写道:
>
> On Wed, 15 Feb 2023 12:10:23 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > Looking isolated at the patch itself...
> >
> > I agree with the way the patch modifies the ranges of the free list, and the consequential removal of the "- 1" from the calculation of log2.
> >
> > Intuitively, the lists should cover ranges such as [0x100..0x3FF], which this patch does, not [0x101..0x400], how it was previously... The ranges with this patch make much more sense.
> >
> > So if the existing code is otherwise correct, i.e. handles the size with/without MALLOC_ELEM_HEADER_LEN correctly, my gut feeling says this patch is an improvement.
> >
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
>
> It would be good to have a malloc performance test.
> Possibly something reused from some other project.

I have done some performance tests in SPDK before,maybe available for
your reference:

https://bytedance.feishu.cn/wiki/wikcnKWXQmN4qhyxLcWJqP4vrOs

In one word, the performance of 4k malloc has improved, and other
cases almost the same.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-15 10:10 ` Thomas Monjalon
  2023-02-15 11:10   ` Morten Brørup
@ 2023-02-16 10:40   ` Liang Ma
  1 sibling, 0 replies; 9+ messages in thread
From: Liang Ma @ 2023-02-16 10:40 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: anatoly.burakov, Fengnan Chang, dev, rsanford, bruce.richardson,
	Morten Brørup, maxime.coquelin, david.marchand, jerinj,
	honnappa.nagarahalli

On Wed, Feb 15, 2023 at 11:10:25AM +0100, Thomas Monjalon wrote:
> Looking for reviewers please.
 I will help on this. 
> 
> 10/02/2023 07:30, Fengnan Chang:
> > Here is a simple test case:
> > "
> > uint64_t entry_time, time;
> > size_t size = 4096;
> > unsigned align = 4096;
> > for (int j = 0; j < 10; j++) {
> > 	entry_time = rte_get_timer_cycles();
> > 	for (int i = 0; i < 2000; i++) {
> > 		rte_malloc(NULL, size, align);
> > 	}
> > 	time = (rte_get_timer_cycles()-entry_time) * 1000000 /
> > 		rte_get_timer_hz();
> > 	printf("total open time %lu avg time %lu\n", time, time/2000);
> > }
> > "
> > 
> > Single rte_malloc cost time may becomes wrose as the number of malloc
> > increases, In my env, first round avg time is 15us, second is 44us,
> > third is 77us, fourth is 168us...
> > 
> > The reason is,in the malloc process, malloc_elem_alloc may split new_elem
> > if there have too much free space after new_elem, and insert the trailer
> > into freelist. When alloc 4k with align 4k, the trailer very likely insert
> > to free_head[2] again, it makes free_head[2] longer. when alloc 4k again,
> > it will search free_head[2] from begin, with the number of malloc increases,
> > search free_head[2] need more time, so the performance will become worse.
> > Same problem will also occurs in alloc 64k with align 64k, but if alloc
> > 4k with align 64, doesn't have this problem.
> > 
> > Fix this by adjust free_head list size range, make free_head[3] hold
> > elements which bigger or equal 4k, free_head[4] hold elements which bigger
> > or equal 16k.
> > In terms of probabilities, when alloc 4k or 16k, the probability of finding
> > a suitable elem from a larger size list is greater than from a smaller
> > size list.
> > 
> > Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> > ---
> >  lib/eal/common/malloc_elem.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
> > index 83f05497cc..35a2313d04 100644
> > --- a/lib/eal/common/malloc_elem.c
> > +++ b/lib/eal/common/malloc_elem.c
> > @@ -367,11 +367,11 @@ prev_elem_is_adjacent(struct malloc_elem *elem)
> >   * containing larger elements.
> >   *
> >   * Example element size ranges for a heap with five free lists:
> > - *   heap->free_head[0] - (0   , 2^8]
> > - *   heap->free_head[1] - (2^8 , 2^10]
> > - *   heap->free_head[2] - (2^10 ,2^12]
> > - *   heap->free_head[3] - (2^12, 2^14]
> > - *   heap->free_head[4] - (2^14, MAX_SIZE]
> > + *   heap->free_head[0] - (0   , 2^8)
> > + *   heap->free_head[1] - [2^8 , 2^10)
> > + *   heap->free_head[2] - [2^10 ,2^12)
> > + *   heap->free_head[3] - [2^12, 2^14)
> > + *   heap->free_head[4] - [2^14, MAX_SIZE]
> >   */
> >  size_t
> >  malloc_elem_free_list_index(size_t size)
> > @@ -385,8 +385,8 @@ malloc_elem_free_list_index(size_t size)
> >  	if (size <= (1UL << MALLOC_MINSIZE_LOG2))
> >  		return 0;
> >  
> > -	/* Find next power of 2 >= size. */
> > -	log2 = sizeof(size) * 8 - __builtin_clzl(size - 1);
> > +	/* Find next power of 2 > size. */
> > +	log2 = sizeof(size) * 8 - __builtin_clzl(size);
> >  
> >  	/* Compute freelist index, based on log2(size). */
> >  	index = (log2 - MALLOC_MINSIZE_LOG2 + MALLOC_LOG2_INCREMENT - 1) /
> > 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-15 11:10   ` Morten Brørup
  2023-02-15 17:16     ` Stephen Hemminger
@ 2023-02-16 14:02     ` Liang Ma
  2023-02-17  2:14       ` [External] " Fengnan Chang
  1 sibling, 1 reply; 9+ messages in thread
From: Liang Ma @ 2023-02-16 14:02 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Thomas Monjalon, anatoly.burakov, Fengnan Chang, dev, rsanford,
	bruce.richardson, maxime.coquelin, david.marchand, jerinj,
	honnappa.nagarahalli, Fidaullah Noonari

On Wed, Feb 15, 2023 at 12:10:23PM +0100, Morten Brørup wrote:
> +CC: Fidaullah Noonari <fidaullah.noonari@emumba.com>, your name also shows up in the git log; perhaps you can help review this patch.
> 
> 
> I gave up reviewing in depth, because the existing code is not easy to quickly understand, and it would take too long for me. E.g. the malloc_elem->size is field is undocumented, and find_suitable_element() calls the malloc_elem_free_list_index() function with the raw size (as passed to the function), but malloc_elem_free_list_insert() calls the malloc_elem_free_list_index() with malloc_elem->size - MALLOC_ELEM_HEADER_LEN.
> 
> Looking isolated at the patch itself...
> 
> I agree with the way the patch modifies the ranges of the free list, and the consequential removal of the "- 1" from the calculation of log2.
> 
> Intuitively, the lists should cover ranges such as [0x100..0x3FF], which this patch does, not [0x101..0x400], how it was previously... The ranges with this patch make much more sense.
> 
> So if the existing code is otherwise correct, i.e. handles the size with/without MALLOC_ELEM_HEADER_LEN correctly, my gut feeling says this patch is an improvement.
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
I run the test with DPDK malloc perf test. The issue is replicated.
IMO, the whole process is if application use rte_malloc with a relative
large alignment size. e.g. 4K alignment. Currently implementation will
generate lots "fragment" because of the large alignment in related mem
element free list. In the test code, 4K malloc size + 4k alignment will
lead to the actually space is needed has to take from upper level mem
element idx free list. The consequence is that will generate lots
fragment element and those element is inserted to the current level mem
free-list. However, when the rte_malloc select which free list to start
scan with, it doesn't take the alignment into account, which ends up
with waste some time in the current level free-list. If the scan logic
take alignment into account, it might "smartly" skip current level and
jump to the upper level directly. 

> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-16 14:02     ` Liang Ma
@ 2023-02-17  2:14       ` Fengnan Chang
  0 siblings, 0 replies; 9+ messages in thread
From: Fengnan Chang @ 2023-02-17  2:14 UTC (permalink / raw)
  To: Liang Ma
  Cc: Morten Brørup, Thomas Monjalon, anatoly.burakov, dev,
	rsanford, bruce.richardson, maxime.coquelin, david.marchand,
	jerinj, honnappa.nagarahalli, Fidaullah Noonari

Liang Ma <liangma@liangbit.com> 于2023年2月16日周四 22:04写道:
>
> On Wed, Feb 15, 2023 at 12:10:23PM +0100, Morten Brørup wrote:
> > +CC: Fidaullah Noonari <fidaullah.noonari@emumba.com>, your name also shows up in the git log; perhaps you can help review this patch.
> >
> >
> > I gave up reviewing in depth, because the existing code is not easy to quickly understand, and it would take too long for me. E.g. the malloc_elem->size is field is undocumented, and find_suitable_element() calls the malloc_elem_free_list_index() function with the raw size (as passed to the function), but malloc_elem_free_list_insert() calls the malloc_elem_free_list_index() with malloc_elem->size - MALLOC_ELEM_HEADER_LEN.
> >
> > Looking isolated at the patch itself...
> >
> > I agree with the way the patch modifies the ranges of the free list, and the consequential removal of the "- 1" from the calculation of log2.
> >
> > Intuitively, the lists should cover ranges such as [0x100..0x3FF], which this patch does, not [0x101..0x400], how it was previously... The ranges with this patch make much more sense.
> >
> > So if the existing code is otherwise correct, i.e. handles the size with/without MALLOC_ELEM_HEADER_LEN correctly, my gut feeling says this patch is an improvement.
> >
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> I run the test with DPDK malloc perf test. The issue is replicated.
> IMO, the whole process is if application use rte_malloc with a relative
> large alignment size. e.g. 4K alignment. Currently implementation will
> generate lots "fragment" because of the large alignment in related mem
> element free list. In the test code, 4K malloc size + 4k alignment will
> lead to the actually space is needed has to take from upper level mem
> element idx free list. The consequence is that will generate lots
> fragment element and those element is inserted to the current level mem
> free-list. However, when the rte_malloc select which free list to start
> scan with, it doesn't take the alignment into account, which ends up
> with waste some time in the current level free-list. If the scan logic
> take alignment into account, it might "smartly" skip current level and
> jump to the upper level directly.
>

Thank you for the detailed explanation !
You may have already found the problem that this patch can only solve
the fragmentation problem of 4k/16k/64k scenes, not work for 3k/15k/63k.
For example, alloc 3k and align with 1k, also has this problem.

Now this patch can't handle all similar scenarios,We already started
working on a new soultion which aims to solve all similar problems .

BTW, this patch is still useful, because even without fragmentation
problems, the probability of finding a suitable elem from a larger size list
is greater than from a smaller size list.


> >
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases
  2023-02-10  6:30 [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases Fengnan Chang
  2023-02-15 10:10 ` Thomas Monjalon
@ 2023-02-20 10:59 ` David Marchand
  1 sibling, 0 replies; 9+ messages in thread
From: David Marchand @ 2023-02-20 10:59 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: anatoly.burakov, rsanford, dev, Morten Brørup, Liang Ma,
	Stephen Hemminger

On Fri, Feb 10, 2023 at 7:30 AM Fengnan Chang
<changfengnan@bytedance.com> wrote:
>
> Here is a simple test case:
> "
> uint64_t entry_time, time;
> size_t size = 4096;
> unsigned align = 4096;
> for (int j = 0; j < 10; j++) {
>         entry_time = rte_get_timer_cycles();
>         for (int i = 0; i < 2000; i++) {
>                 rte_malloc(NULL, size, align);
>         }
>         time = (rte_get_timer_cycles()-entry_time) * 1000000 /
>                 rte_get_timer_hz();
>         printf("total open time %lu avg time %lu\n", time, time/2000);
> }
> "
>
> Single rte_malloc cost time may becomes wrose as the number of malloc
> increases, In my env, first round avg time is 15us, second is 44us,
> third is 77us, fourth is 168us...
>
> The reason is,in the malloc process, malloc_elem_alloc may split new_elem
> if there have too much free space after new_elem, and insert the trailer
> into freelist. When alloc 4k with align 4k, the trailer very likely insert
> to free_head[2] again, it makes free_head[2] longer. when alloc 4k again,
> it will search free_head[2] from begin, with the number of malloc increases,
> search free_head[2] need more time, so the performance will become worse.
> Same problem will also occurs in alloc 64k with align 64k, but if alloc
> 4k with align 64, doesn't have this problem.
>
> Fix this by adjust free_head list size range, make free_head[3] hold
> elements which bigger or equal 4k, free_head[4] hold elements which bigger
> or equal 16k.
> In terms of probabilities, when alloc 4k or 16k, the probability of finding
> a suitable elem from a larger size list is greater than from a smaller
> size list.
>
> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

The change looks simple enough.
I see an improvement with the (verbatim) malloc_perf_autotest unit
tests for 1M allocations too.

Let's take this change now and see how it goes with -rc1 testing.


Applied, thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-02-20 10:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10  6:30 [PATCH] malloc: fix malloc performance may becomes worse as the number of malloc increases Fengnan Chang
2023-02-15 10:10 ` Thomas Monjalon
2023-02-15 11:10   ` Morten Brørup
2023-02-15 17:16     ` Stephen Hemminger
2023-02-16  2:54       ` [External] " Fengnan Chang
2023-02-16 14:02     ` Liang Ma
2023-02-17  2:14       ` [External] " Fengnan Chang
2023-02-16 10:40   ` Liang Ma
2023-02-20 10:59 ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).