[RFC] mbuf: performance optimization

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC] mbuf: performance optimization
@ 2024-01-21  5:32 Morten Brørup
  2024-01-21 17:07 ` Stephen Hemminger
  2024-01-22 14:27 ` [PATCH v2] mempool: test performance with larger bursts Morten Brørup
  0 siblings, 2 replies; 5+ messages in thread
From: Morten Brørup @ 2024-01-21  5:32 UTC (permalink / raw)
  To: dev; +Cc: andrew.rybchenko

What is the largest realistic value of mbuf->priv_size (the size of an mbuf's application private data area) in any use case?

I am wondering if its size could be reduced from 16 to 8 bit. If a max value of 255 isn't enough, then perhaps by knowing that the private data area must be aligned by 8 (RTE_MBUF_PRIV_ALIGN), its value can hold the size divided by 8. That would make the max value nearly 2 KB.

I suppose that reducing mbuf->nb_segs from 16 to 8 bit is realistic, considering that a maximum size IP packet (64 KB) is unlikely to use more than 64 plus some segments. Does anyone know of any use case with more than 255 segments in an mbuf?

These two changes would allow us to move mbuf->priv_size from the second to the first cache line.

Furthermore, mbuf->timesync should be a dynamic field. This, along with the above changes, would give us one more available 32-bit dynamic field.

Med venlig hilsen / Kind regards,
-Morten Brørup

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] mbuf: performance optimization
  2024-01-21  5:32 [RFC] mbuf: performance optimization Morten Brørup
@ 2024-01-21 17:07 ` Stephen Hemminger
  2024-01-21 17:19   ` Morten Brørup
  2024-01-22 14:27 ` [PATCH v2] mempool: test performance with larger bursts Morten Brørup
  1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2024-01-21 17:07 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, andrew.rybchenko

On Sun, 21 Jan 2024 06:32:42 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> I suppose that reducing mbuf->nb_segs from 16 to 8 bit is realistic, considering that a maximum size IP packet (64 KB) is unlikely to use more than 64 plus some segments. Does anyone know of any use case with more than 255 segments in an mbuf?

There is the case of Linux internally using super large IPv6 (and now IPv4) frames.
See RFC 2675 IPv6 jumbograms



https://netdevconf.info/0x15/slides/35/BIG%20TCP.pdf

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC] mbuf: performance optimization
  2024-01-21 17:07 ` Stephen Hemminger
@ 2024-01-21 17:19   ` Morten Brørup
  0 siblings, 0 replies; 5+ messages in thread
From: Morten Brørup @ 2024-01-21 17:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, andrew.rybchenko

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Sunday, 21 January 2024 18.08
> 
> On Sun, 21 Jan 2024 06:32:42 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > I suppose that reducing mbuf->nb_segs from 16 to 8 bit is realistic,
> considering that a maximum size IP packet (64 KB) is unlikely to use
> more than 64 plus some segments. Does anyone know of any use case with
> more than 255 segments in an mbuf?
> 
> There is the case of Linux internally using super large IPv6 (and now
> IPv4) frames.
> See RFC 2675 IPv6 jumbograms
> 
> https://netdevconf.info/0x15/slides/35/BIG%20TCP.pdf

Just took at brief look at it... I suppose something similar could grow into DPDK, so we are probably better prepared by leaving nb_segs at 16 bit.

Then the proposed optimization falls to the ground. :-(

Thanks for valuable feedback, Stephen. :-)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v2] mempool: test performance with larger bursts
  2024-01-21  5:32 [RFC] mbuf: performance optimization Morten Brørup
  2024-01-21 17:07 ` Stephen Hemminger
@ 2024-01-22 14:27 ` Morten Brørup
  2024-01-22 14:39   ` Morten Brørup
  1 sibling, 1 reply; 5+ messages in thread
From: Morten Brørup @ 2024-01-22 14:27 UTC (permalink / raw)
  To: andrew.rybchenko, fengchengwen; +Cc: dev, Morten Brørup

Bursts of up to 64 or 128 packets are not uncommon, so increase the
maximum tested get and put burst sizes from 32 to 128.

Some applications keep more than 512 objects, so increase the maximum
number of kept objects from 512 to 8192, still in jumps of factor four.
This exceeds the typical mempool cache size of 512 objects, so the test
also exercises the mempool driver.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

---

v2: Addressed feedback by Chengwen Feng
* Added get and put burst sizes of 64 packets, which is probably also not
  uncommon.
* Fixed list of number of kept objects so list remains in jumps of factor
  four.
* Added three derivative test cases, for faster testing.
---
 app/test/test_mempool_perf.c | 107 ++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 45 deletions(-)

diff --git a/app/test/test_mempool_perf.c b/app/test/test_mempool_perf.c
index 96de347f04..a5a7d43608 100644
--- a/app/test/test_mempool_perf.c
+++ b/app/test/test_mempool_perf.c
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
- * Copyright(c) 2022 SmartShare Systems
+ * Copyright(c) 2022-2024 SmartShare Systems
  */
 
 #include <string.h>
@@ -54,22 +54,24 @@
  *
  *    - Bulk size (*n_get_bulk*, *n_put_bulk*)
  *
- *      - Bulk get from 1 to 32
- *      - Bulk put from 1 to 32
- *      - Bulk get and put from 1 to 32, compile time constant
+ *      - Bulk get from 1 to 128
+ *      - Bulk put from 1 to 128
+ *      - Bulk get and put from 1 to 128, compile time constant
  *
  *    - Number of kept objects (*n_keep*)
  *
  *      - 32
  *      - 128
  *      - 512
+ *      - 2048
+ *      - 8192
  */
 
 #define N 65536
 #define TIME_S 5
 #define MEMPOOL_ELT_SIZE 2048
-#define MAX_KEEP 512
-#define MEMPOOL_SIZE ((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE))-1)
+#define MAX_KEEP 8192
+#define MEMPOOL_SIZE ((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE*2))-1)
 
 /* Number of pointers fitting into one cache line. */
 #define CACHE_LINE_BURST (RTE_CACHE_LINE_SIZE / sizeof(uintptr_t))
@@ -204,6 +206,10 @@ per_lcore_mempool_test(void *arg)
 					CACHE_LINE_BURST, CACHE_LINE_BURST);
 		else if (n_get_bulk == 32)
 			ret = test_loop(mp, cache, n_keep, 32, 32);
+		else if (n_get_bulk == 64)
+			ret = test_loop(mp, cache, n_keep, 64, 64);
+		else if (n_get_bulk == 128)
+			ret = test_loop(mp, cache, n_keep, 128, 128);
 		else
 			ret = -1;
 
@@ -289,9 +295,9 @@ launch_cores(struct rte_mempool *mp, unsigned int cores)
 static int
 do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
 {
-	unsigned int bulk_tab_get[] = { 1, 4, CACHE_LINE_BURST, 32, 0 };
-	unsigned int bulk_tab_put[] = { 1, 4, CACHE_LINE_BURST, 32, 0 };
-	unsigned int keep_tab[] = { 32, 128, 512, 0 };
+	unsigned int bulk_tab_get[] = { 1, 4, CACHE_LINE_BURST, 32, 64, 128, 0 };
+	unsigned int bulk_tab_put[] = { 1, 4, CACHE_LINE_BURST, 32, 64, 128, 0 };
+	unsigned int keep_tab[] = { 32, 128, 512, 2048, 8192, 0 };
 	unsigned *get_bulk_ptr;
 	unsigned *put_bulk_ptr;
 	unsigned *keep_ptr;
@@ -301,6 +307,9 @@ do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
 		for (put_bulk_ptr = bulk_tab_put; *put_bulk_ptr; put_bulk_ptr++) {
 			for (keep_ptr = keep_tab; *keep_ptr; keep_ptr++) {
 
+				if (*keep_ptr < *get_bulk_ptr || *keep_ptr < *put_bulk_ptr)
+					continue;
+
 				use_constant_values = 0;
 				n_get_bulk = *get_bulk_ptr;
 				n_put_bulk = *put_bulk_ptr;
@@ -323,7 +332,7 @@ do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
 }
 
 static int
-test_mempool_perf(void)
+do_all_mempool_perf_tests(unsigned int cores)
 {
 	struct rte_mempool *mp_cache = NULL;
 	struct rte_mempool *mp_nocache = NULL;
@@ -376,65 +385,73 @@ test_mempool_perf(void)
 
 	rte_mempool_obj_iter(default_pool, my_obj_init, NULL);
 
-	/* performance test with 1, 2 and max cores */
 	printf("start performance test (without cache)\n");
-
-	if (do_one_mempool_test(mp_nocache, 1) < 0)
+	if (do_one_mempool_test(mp_nocache, cores) < 0)
 		goto err;
 
-	if (do_one_mempool_test(mp_nocache, 2) < 0)
-		goto err;
-
-	if (do_one_mempool_test(mp_nocache, rte_lcore_count()) < 0)
-		goto err;
-
-	/* performance test with 1, 2 and max cores */
 	printf("start performance test for %s (without cache)\n",
 	       default_pool_ops);
-
-	if (do_one_mempool_test(default_pool, 1) < 0)
+	if (do_one_mempool_test(default_pool, cores) < 0)
 		goto err;
 
-	if (do_one_mempool_test(default_pool, 2) < 0)
+	printf("start performance test (with cache)\n");
+	if (do_one_mempool_test(mp_cache, cores) < 0)
 		goto err;
 
-	if (do_one_mempool_test(default_pool, rte_lcore_count()) < 0)
+	printf("start performance test (with user-owned cache)\n");
+	use_external_cache = 1;
+	if (do_one_mempool_test(mp_nocache, cores) < 0)
 		goto err;
 
-	/* performance test with 1, 2 and max cores */
-	printf("start performance test (with cache)\n");
+	rte_mempool_list_dump(stdout);
 
-	if (do_one_mempool_test(mp_cache, 1) < 0)
-		goto err;
+	ret = 0;
 
-	if (do_one_mempool_test(mp_cache, 2) < 0)
-		goto err;
+err:
+	rte_mempool_free(mp_cache);
+	rte_mempool_free(mp_nocache);
+	rte_mempool_free(default_pool);
+	return ret;
+}
 
-	if (do_one_mempool_test(mp_cache, rte_lcore_count()) < 0)
-		goto err;
+static int
+test_mempool_perf_1core(void)
+{
+	return do_all_mempool_perf_tests(1);
+}
 
-	/* performance test with 1, 2 and max cores */
-	printf("start performance test (with user-owned cache)\n");
-	use_external_cache = 1;
+static int
+test_mempool_perf_2cores(void)
+{
+	return do_all_mempool_perf_tests(2);
+}
 
-	if (do_one_mempool_test(mp_nocache, 1) < 0)
-		goto err;
+static int
+test_mempool_perf_allcores(void)
+{
+	return do_all_mempool_perf_tests(rte_lcore_count());
+}
 
-	if (do_one_mempool_test(mp_nocache, 2) < 0)
-		goto err;
+static int
+test_mempool_perf(void)
+{
+	int ret = -1;
 
-	if (do_one_mempool_test(mp_nocache, rte_lcore_count()) < 0)
+	/* performance test with 1, 2 and max cores */
+	if (do_all_mempool_perf_tests(1) < 0)
+		goto err;
+	if (do_all_mempool_perf_tests(2) < 0)
+		goto err;
+	if (do_all_mempool_perf_tests(rte_lcore_count()) < 0)
 		goto err;
-
-	rte_mempool_list_dump(stdout);
 
 	ret = 0;
 
 err:
-	rte_mempool_free(mp_cache);
-	rte_mempool_free(mp_nocache);
-	rte_mempool_free(default_pool);
 	return ret;
 }
 
 REGISTER_PERF_TEST(mempool_perf_autotest, test_mempool_perf);
+REGISTER_PERF_TEST(mempool_perf_autotest_1core, test_mempool_perf_1core);
+REGISTER_PERF_TEST(mempool_perf_autotest_2cores, test_mempool_perf_2cores);
+REGISTER_PERF_TEST(mempool_perf_autotest_allcores, test_mempool_perf_allcores);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] mempool: test performance with larger bursts
  2024-01-22 14:27 ` [PATCH v2] mempool: test performance with larger bursts Morten Brørup
@ 2024-01-22 14:39   ` Morten Brørup
  0 siblings, 0 replies; 5+ messages in thread
From: Morten Brørup @ 2024-01-22 14:39 UTC (permalink / raw)
  To: andrew.rybchenko, fengchengwen; +Cc: dev

Replied on wrong thread. Sorry.
Resubmitted with correct in-reply-to.

-Morten


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-01-22 14:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-21  5:32 [RFC] mbuf: performance optimization Morten Brørup
2024-01-21 17:07 ` Stephen Hemminger
2024-01-21 17:19   ` Morten Brørup
2024-01-22 14:27 ` [PATCH v2] mempool: test performance with larger bursts Morten Brørup
2024-01-22 14:39   ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).