From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Paul Emmerich <emmericp@net.in.tum.de>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
Date: Wed, 13 May 2015 09:03:18 +0000 [thread overview]
Message-ID: <2601191342CEEE43887BDE71AB9772582142EB46@irsmsx105.ger.corp.intel.com> (raw)
In-Reply-To: <555138C7.5010002@net.in.tum.de>
Hi Paul,
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Paul Emmerich
> Sent: Tuesday, May 12, 2015 12:19 AM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
>
> Found a really simple solution that almost restores the original
> performance: just add a prefetch on alloc. For some reason, I assumed
> that this was already done since the troublesome commit I investigated
> mentioned something about prefetching... I guess the commit referred to
> the hardware prefetcher in the CPU.
>
> Adding an explicit prefetch command in the mbuf alloc function gives a
> throughput of 12.7/10.35 Mpps in my benchmark with the
> simple/full-featured tx path.
>
> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since
> I'm primarily interested in the full-featured path and the drop from
> 10.7 to ~10.4 was due to another change.
>
> Patch: https://github.com/dpdk-org/dpdk/pull/2
> I also sent an email to the mailing list.
>
> I also think that the rx-path could also benefit from prefetching somewhere.
>
Before start to discuss your findings, there is one thing in your test app that looks strange to me:
You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
and you'll endup doing ring_dequeue().
I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
that should improve your numbers (at least it did to me).
About the patch:
So from what you are saying - the reason for the drop is not actually the TX path,
but rte_pktmbuf_alloc()->rte_pktmbuf_reset().
That makes sense - pktmbuf_reset() now has to update 2 cache line instead of one.
From other side - rte_pktmbuf_alloc() was never considered as a fastest path
(our RX/TX roitinies don't use it) - so we never put a big effort in trying to optimise it.
Though, I am really not a big fan of manual prefetching.
Its particular behaviour may vary from one cpu to another,
and is real effect is sort of hard to predict,
in some cases can even cause a performance degradation.
Let say on my IVB box, your patch didn't show any difference at all.
So I think that 'prefetch' should be used only when it really gives great performance boost
and same results can't be achieved by other methods.
For that particular case - at least that 'prefetch' should be moved from __rte_mbuf_raw_alloc()
to rte_pktmbuf_alloc(), to avoid any negative impact on RX path.
Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
I don't have box with the same off-hand, but on my IVB box results are quite promising:
on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%),
for full_tx the drop reduced to 2%.
That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.
Thanks
Konstantin
patch1:
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ab6de67..23d79ca 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -810,6 +810,45 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
return (m);
}
+static inline int
+rte_pktmbuf_bulk_alloc(struct rte_mempool *mp, struct rte_mbuf **m, uint32_t n)
+{
+ int32_t rc;
+ uint32_t i;
+
+ rc = rte_mempool_get_bulk(mp, (void **)m, n);
+
+ if (rc == 0) {
+ i = 0;
+ switch (n % 4) {
+ while (i != n) {
+ case 0:
+ RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+ rte_mbuf_refcnt_set(m[i], 1);
+ rte_pktmbuf_reset(m[i]);
+ i++;
+ case 3:
+ RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+ rte_mbuf_refcnt_set(m[i], 1);
+ rte_pktmbuf_reset(m[i]);
+ i++;
+ case 2:
+ RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+ rte_mbuf_refcnt_set(m[i], 1);
+ rte_pktmbuf_reset(m[i]);
+ i++;
+ case 1:
+ RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+ rte_mbuf_refcnt_set(m[i], 1);
+ rte_pktmbuf_reset(m[i]);
+ i++;
+ }
+ }
+ }
+
+ return rc;
+}
+
/**
* Attach packet mbuf to another packet mbuf.
*
patch2:
diff --git a/main.c b/main.c
index 2aa9fcf..749c52c 100644
--- a/main.c
+++ b/main.c
@@ -71,7 +71,7 @@ static struct rte_mempool* make_mempool() {
static int pool_id = 0;
char pool_name[32];
sprintf(pool_name, "pool%d", __sync_fetch_and_add(&pool_id, 1));
- return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 32,
+ return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 2 * BATCH_SIZE,
sizeof(struct rte_pktmbuf_pool_private),
rte_pktmbuf_pool_init, NULL,
rte_pktmbuf_init, NULL,
@@ -113,13 +113,21 @@ static uint32_t send_pkts(uint8_t port, struct rte_mempool* pool) {
// alloc bufs
struct rte_mbuf* bufs[BATCH_SIZE];
uint32_t i;
+ int32_t rc;
+
+ rc = rte_pktmbuf_bulk_alloc(pool, bufs, RTE_DIM(bufs));
+ if (rc < 0) {
+ RTE_LOG(ERR, USER1,
+ "%s: rte_pktmbuf_alloc(%zu) returns error code: %d\n",
+ __func__, RTE_DIM(bufs), rc);
+ return 0;
+ }
+
for (i = 0; i < BATCH_SIZE; i++) {
- struct rte_mbuf* buf = rte_pktmbuf_alloc(pool);
- rte_pktmbuf_data_len(buf) = 60;
- rte_pktmbuf_pkt_len(buf) = 60;
- bufs[i] = buf;
+ rte_pktmbuf_data_len(bufs[i]) = 60;
+ rte_pktmbuf_pkt_len(bufs[i]) = 60;
// write seq number
- uint64_t* pkt = rte_pktmbuf_mtod(buf, uint64_t*);
+ uint64_t* pkt = rte_pktmbuf_mtod(bufs[i], uint64_t*);
pkt[0] = seq++;
}
// send pkts
next prev parent reply other threads:[~2015-05-13 9:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-11 0:14 Paul Emmerich
2015-05-11 9:13 ` Luke Gorrie
2015-05-11 10:16 ` Paul Emmerich
2015-05-11 22:32 ` Paul Emmerich
2015-05-11 23:18 ` Paul Emmerich
2015-05-12 0:28 ` Marc Sune
2015-05-12 0:38 ` Marc Sune
2015-05-13 9:03 ` Ananyev, Konstantin [this message]
2016-02-15 19:15 ` Paul Emmerich
2016-02-19 12:31 ` Olivier MATZ
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2601191342CEEE43887BDE71AB9772582142EB46@irsmsx105.ger.corp.intel.com \
--to=konstantin.ananyev@intel.com \
--cc=dev@dpdk.org \
--cc=emmericp@net.in.tum.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).