[dpdk-dev] TX performance regression caused by the mbuf cachline split

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] TX performance regression caused by the mbuf cachline split
@ 2015-05-11  0:14 Paul Emmerich
  2015-05-11  9:13 ` Luke Gorrie
  2015-05-11 22:32 ` Paul Emmerich
  0 siblings, 2 replies; 10+ messages in thread
From: Paul Emmerich @ 2015-05-11  0:14 UTC (permalink / raw)
  To: dev

Hi,

this is a follow-up to my post from 3 weeks ago [1]. I'm starting a new 
thread here since I now got a completely new test setup for improved 
reproducibility.

Background for anyone that didn't catch my last post:
I'm investigating a performance regression in my packet generator [2] 
that occurs since I tried to upgrade from DPDK 1.7.1 to 1.8 or 2.0. DPDK 
1.7.1 is about 25% faster than 2.0 in my application.
I suspected that this is due to the new 2-cacheline mbufs, which I now 
confirmed with a bisect.

My old test setup was based on the l2fwd example and required an 
external packet generator and was kind of hard to reproduce.

I built a simple tx benchmark application that simply sends nonsensical 
packets with a sequence number as fast as possible on two ports with a 
single single core. You can download the benchmark app at [3].

Hardware setup:
CPU: E5-2620 v3 underclocked to 1.2 GHz
RAM: 4x 8 GB 1866 MHz DDR4 memory
NIC: X540-T2

Baseline test results:

DPDK   simple tx  full-featured tx
1.7.1  14.1 Mpps  10.7 Mpps
2.0.0  11.0 Mpps   9.3 Mpps

DPDK 1.7.1 is 28%/15% faster than 2.0 with simple/full-featured tx in 
this benchmark.

I then did a few runs of git bisect to identify commits that caused a 
significant drop in performance. You can find the script that I used to 
quickly test the performance of a version at [4].

Commit                                    simple  full-featured
7869536f3f8edace05043be6f322b835702b201c  13.9    10.4
mbuf: flatten struct vlan_macip

The commit log explains that there is a perf regression and that it 
cannot be avoided to be future-compatible. The log claims < 5% which is 
consistent with my test results (old code is 4% faster). I guess that is 
okay and cannot be avoided.

Commit                                    simple  full-featured
08b563ffb19d8baf59dd84200f25bc85031d18a7  12.8    10.4
mbuf: replace data pointer by an offset

This affects the simple tx path significantly.
This performance regression is probably simply be caused by the 
(temporarily) disabled vector tx code that is mentioned in the commit 
log. Not investigated further.

Commit                                    simple  full-featured
f867492346bd271742dd34974e9cf8ac55ddb869  10.7    9.1
mbuf: split mbuf across two cache lines.

This one is the real culprit.
The commit log does not mention any performance evaluations and a quick 
scan of the mailing list also doesn't reveal any evaluations of the 
impact of this change.

It looks like the main problem for tx is that the mempool pointer is in 
the second cacheline.

I think the new mbuf structure is too bloated. It forces you to pay for 
features that you don't need or don't want. I understand that it needs 
to support all possible filters and offload features. But it's kind of 
hard to justify 25% difference in performance for a framework that sets 
performance above everything (Does it? I Picked that up from the 
discussion in the "Beyond DPDK 2.0" thread).

I've counted 56 bytes in use in the first cacheline in v2.0.0.

Would it be possible to move the pool pointer and tx offload fields to 
the first cacheline?

We would just need to free up 8 bytes. One candidate would be the seqn 
field, does it really have to be in the first cache line? Another 
candidate is the size of the ol_flags field? Do we really need 64 flags? 
Sharing bits between rx and tx worked fine.

I naively tried to move the pool pointer into the first cache line in 
the v2.0.0 tag and the performance actually decreased, I'm not yet sure 
why this happens. There are probably assumptions about the cacheline 
locations and prefetching in the code that would need to be adjusted.

Another possible solution would be a more dynamic approach to mbufs: the 
mbuf struct could be made configurable to fit the requirements of the 
application. This would probably require code generation or a lot of 
ugly preprocessor hacks and add a lot of complexity to the code.
The question would be if DPDK really values performance above everything 
else.

Paul

P.S.: I'm kind of disappointed by the lack of regression tests for the 
performance. I think that such tests should be an integral part of a 
framework with the explicit goal to be fast. For example, the main page 
at dpdk.org claims a performance of "usually less than 80 cycles" for a 
rx or tx operation. This claim is no longer true :(
Touching the layout of a core data structure like the mbuf shouldn't be 
done without carefully evaluating the performance impacts.
But this discussion probably belongs in the "Beyond DPDK 2.0" thread.

P.P.S.: Benchmarking an rx-only application (e.g. traffic analysis) 
would also be interesting, but that's not really on my todo list right 
now. Mixed rx/tx like forwarding is also affected as discussed in my 
last thread [1]).

[1] http://dpdk.org/ml/archives/dev/2015-April/016921.html
[2] https://github.com/emmericp/MoonGen
[3] https://github.com/emmericp/dpdk-tx-performance
[4] https://gist.github.com/emmericp/02c5885908c3cb5ac5b7

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11  0:14 [dpdk-dev] TX performance regression caused by the mbuf cachline split Paul Emmerich
@ 2015-05-11  9:13 ` Luke Gorrie
  2015-05-11 10:16   ` Paul Emmerich
  2015-05-11 22:32 ` Paul Emmerich
  1 sibling, 1 reply; 10+ messages in thread
From: Luke Gorrie @ 2015-05-11  9:13 UTC (permalink / raw)
  To: Paul Emmerich; +Cc: dev

Hi Paul,

On 11 May 2015 at 02:14, Paul Emmerich <emmericp@net.in.tum.de> wrote:

> Another possible solution would be a more dynamic approach to mbufs:

Let me suggest a slightly more extreme idea for your consideration. This
method can easily do > 100 Mpps with one very lightly loaded core. I don't
know if it works for your application or not but I share it just in case.

Background: Load generators are specialist applications and can benefit
from specialist transmit mechanisms.

You can instruct the NIC to send up to 32K packets with one operation: load
the address of a descriptor list into the TDBA register (Transmit
Descriptor Base Address).

The descriptor list is a simple series of 64-bit values: addr0, flags0,
addr1, flags1, ... etc. It is easy to construct by hand.

The NIC can also be made to play the packets in a loop. You just have to
periodically reset the DMA cursor to make all the packets valid again. That
is a simple register poke: TDT = TDH-1.

We do this routinely when we want to generate a large amount of traffic
with few resources, typically when generating load using spare capacity of
a device under test. (I have sample code but it is not based on DPDK.)

If you want all of your packets to be unique then you have to be a bit more
clever. For example you could poll to see the DMA progress: let half the
packets be sent, then rewrite those while the other half are sent, and so
on. Kind of like the way video games tracked the progress of the display
scan beam to update parts of the frame buffer that were not being DMA'd.

This method may impose other limitations that are not acceptable for your
application of course. But if not then it can drastically reduce the number
of instructions and cache footprint required to generate load. You don't
have to touch mbufs or descriptors at all. You just update the payload and
update the DMA register every millisecond or so.

Cheers,
-Luke

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11  9:13 ` Luke Gorrie
@ 2015-05-11 10:16   ` Paul Emmerich
  0 siblings, 0 replies; 10+ messages in thread
From: Paul Emmerich @ 2015-05-11 10:16 UTC (permalink / raw)
  To: Luke Gorrie; +Cc: dev

Hi Luke,

thanks for your suggestion, I actually looked at how your packet 
generator in SnabbSwitch works before and it's quite clever. But 
unfortunately that's not what I'm looking for.

I'm looking for a generic solution that works with whatever NIC is 
supported by DPDK and I don't want to write NIC-specific transmit logic.
I don't want to maintain, test, or debug drivers. That's why I chose 
DPDK in the first place.

The DPDK drivers (used to) hit a sweet spot for the performance. I can 
usually load about two 10 Gbit/s ports on a reasonably sized CPU core 
without worrying about writing my own device drivers*. This allows for 
packet generation at interesting packet rates on low-end servers (e.g. 
servers with Xeon E3 1230 v2 CPUs and dual-port NICs). Servers with more 
ports usually also have the necessary CPU power to handle it.

I also don't want to be limited to packet generation in the long run. 
For example, I have a student who is working on an IPSec offloading 
application and another student working on a proof-of-concept router.

Paul

*) yes, I still need some NIC-specific low-level code (timestamping) and 
a small patch in the DPDK drivers (flag to disable CRC offloading on a 
per-packet basis) for some features of my packet generator.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11  0:14 [dpdk-dev] TX performance regression caused by the mbuf cachline split Paul Emmerich
  2015-05-11  9:13 ` Luke Gorrie
@ 2015-05-11 22:32 ` Paul Emmerich
  2015-05-11 23:18   ` Paul Emmerich
  1 sibling, 1 reply; 10+ messages in thread
From: Paul Emmerich @ 2015-05-11 22:32 UTC (permalink / raw)
  To: dev

Paul Emmerich:
> I naively tried to move the pool pointer into the first cache line in
> the v2.0.0 tag and the performance actually decreased, I'm not yet sure
> why this happens. There are probably assumptions about the cacheline
> locations and prefetching in the code that would need to be adjusted.

This happens because the next-pointer in the mbuf is touched almost 
everywhere, even for mbufs with only one segment because it is used to 
determine if there is another segment (instead of using the nb_segs field).

I guess a solution for me would be to use a custom layout that is 
optimized for tx. I can shrink ol_flags to 32 bits and move the seqn and 
hash fields to the second cache line. A quick-and-dirty test shows that 
this even gives me a slightly higher performance than DPDK 1.7 in the 
full-featured tx path.
This is probably going to break the vector rx/tx path, but I can't use 
that anyways since I always need offloading features (timestamping and 
checksums).

I'll have to see how this affects the rx path. But I value tx 
performance over rx performance. My rx logic is usually very simple.

This solution is kind of ugly. I would prefer to be able to use an 
unmodified version of DPDK :/

By the way, I think there is something wrong with this assumption in 
commit f867492346bd271742dd34974e9cf8ac55ddb869:
> The general approach that we are looking to take is to focus the first
> cache line on fields that are updated on RX , so that receive only deals
> with one cache line.

I think this might be wrong due to the next pointer. I'll probably build 
a simple rx-only benchmark in a few weeks or so. I suspect that it will 
also be significantly slower. But that should be fixable.

Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11 22:32 ` Paul Emmerich
@ 2015-05-11 23:18   ` Paul Emmerich
  2015-05-12  0:28     ` Marc Sune
  2015-05-13  9:03     ` Ananyev, Konstantin
  0 siblings, 2 replies; 10+ messages in thread
From: Paul Emmerich @ 2015-05-11 23:18 UTC (permalink / raw)
  To: dev

Found a really simple solution that almost restores the original 
performance: just add a prefetch on alloc. For some reason, I assumed 
that this was already done since the troublesome commit I investigated 
mentioned something about prefetching... I guess the commit referred to 
the hardware prefetcher in the CPU.

Adding an explicit prefetch command in the mbuf alloc function gives a 
throughput of 12.7/10.35 Mpps in my benchmark with the 
simple/full-featured tx path.

DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since 
I'm primarily interested in the full-featured path and the drop from 
10.7 to ~10.4 was due to another change.

Patch: https://github.com/dpdk-org/dpdk/pull/2
I also sent an email to the mailing list.

I also think that the rx-path could also benefit from prefetching somewhere.

Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11 23:18   ` Paul Emmerich
@ 2015-05-12  0:28     ` Marc Sune
  2015-05-12  0:38       ` Marc Sune
  2015-05-13  9:03     ` Ananyev, Konstantin
  1 sibling, 1 reply; 10+ messages in thread
From: Marc Sune @ 2015-05-12  0:28 UTC (permalink / raw)
  To: dev



On 12/05/15 01:18, Paul Emmerich wrote:
> Found a really simple solution that almost restores the original 
> performance: just add a prefetch on alloc. For some reason, I assumed 
> that this was already done since the troublesome commit I investigated 
> mentioned something about prefetching... I guess the commit referred 
> to the hardware prefetcher in the CPU.
>
> Adding an explicit prefetch command in the mbuf alloc function gives a 
> throughput of 12.7/10.35 Mpps in my benchmark with the 
> simple/full-featured tx path.
>
> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since 
> I'm primarily interested in the full-featured path and the drop from 
> 10.7 to ~10.4 was due to another change.

Maybe a stupid question;

Does the performance of v1.7.1 also improve if you backport this patch 
to it?

Marc

>
> Patch: https://github.com/dpdk-org/dpdk/pull/2
> I also sent an email to the mailing list.
>
> I also think that the rx-path could also benefit from prefetching 
> somewhere.
>
>
> Paul
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-12  0:28     ` Marc Sune
@ 2015-05-12  0:38       ` Marc Sune
  0 siblings, 0 replies; 10+ messages in thread
From: Marc Sune @ 2015-05-12  0:38 UTC (permalink / raw)
  To: dev



On 12/05/15 02:28, Marc Sune wrote:
>
>
> On 12/05/15 01:18, Paul Emmerich wrote:
>> Found a really simple solution that almost restores the original 
>> performance: just add a prefetch on alloc. For some reason, I assumed 
>> that this was already done since the troublesome commit I 
>> investigated mentioned something about prefetching... I guess the 
>> commit referred to the hardware prefetcher in the CPU.
>>
>> Adding an explicit prefetch command in the mbuf alloc function gives 
>> a throughput of 12.7/10.35 Mpps in my benchmark with the 
>> simple/full-featured tx path.
>>
>> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since 
>> I'm primarily interested in the full-featured path and the drop from 
>> 10.7 to ~10.4 was due to another change.
>
> Maybe a stupid question;
>
> Does the performance of v1.7.1 also improve if you backport this patch 
> to it?

Self answered... split was done in 1.8, so it is indeed stupid.

Marc
>
> Marc
>
>>
>> Patch: https://github.com/dpdk-org/dpdk/pull/2
>> I also sent an email to the mailing list.
>>
>> I also think that the rx-path could also benefit from prefetching 
>> somewhere.
>>
>>
>> Paul
>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-11 23:18   ` Paul Emmerich
  2015-05-12  0:28     ` Marc Sune
@ 2015-05-13  9:03     ` Ananyev, Konstantin
  2016-02-15 19:15       ` Paul Emmerich
  1 sibling, 1 reply; 10+ messages in thread
From: Ananyev, Konstantin @ 2015-05-13  9:03 UTC (permalink / raw)
  To: Paul Emmerich; +Cc: dev


Hi Paul,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Paul Emmerich
> Sent: Tuesday, May 12, 2015 12:19 AM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
> 
> Found a really simple solution that almost restores the original
> performance: just add a prefetch on alloc. For some reason, I assumed
> that this was already done since the troublesome commit I investigated
> mentioned something about prefetching... I guess the commit referred to
> the hardware prefetcher in the CPU.
> 
> Adding an explicit prefetch command in the mbuf alloc function gives a
> throughput of 12.7/10.35 Mpps in my benchmark with the
> simple/full-featured tx path.
> 
> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since
> I'm primarily interested in the full-featured path and the drop from
> 10.7 to ~10.4 was due to another change.
> 
> Patch: https://github.com/dpdk-org/dpdk/pull/2
> I also sent an email to the mailing list.
> 
> I also think that the rx-path could also benefit from prefetching somewhere.
> 

Before start to discuss your findings, there is one thing in your test app that looks strange to me:
You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
and you'll endup doing ring_dequeue().
I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
that should improve your numbers (at least it did to me). 

About the patch:
So from what you are saying - the reason for the drop is not actually the TX path,
but rte_pktmbuf_alloc()->rte_pktmbuf_reset(). 
That makes sense -  pktmbuf_reset() now has to update 2 cache line instead of one.
 From other side - rte_pktmbuf_alloc() was never considered as a fastest path
(our RX/TX roitinies don't use it) - so we never put a big effort in trying to optimise it.

Though, I am really not a big fan of manual prefetching. 
Its particular behaviour may vary  from one cpu to another,
and is real effect is sort of hard to predict,
in some cases can even cause a performance degradation.
Let say on my IVB box, your patch didn't show any difference at all.
So I think that 'prefetch' should be used only when it really gives great performance boost
and same results can't be achieved by other methods.  
For that particular case - at least that 'prefetch' should be moved from __rte_mbuf_raw_alloc()
to  rte_pktmbuf_alloc(), to avoid any negative impact on RX path.
Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
I don't have box with the same off-hand, but on my IVB box results are quite promising:
on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%), 
for full_tx the drop reduced to 2%.
That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

Thanks
Konstantin

patch1:

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ab6de67..23d79ca 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -810,6 +810,45 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
        return (m);
 }

+static inline int
+rte_pktmbuf_bulk_alloc(struct rte_mempool *mp, struct rte_mbuf **m, uint32_t n)
+{
+       int32_t rc;
+       uint32_t i;
+
+       rc = rte_mempool_get_bulk(mp, (void **)m, n);
+
+       if (rc == 0) {
+               i = 0;
+               switch (n % 4) {
+               while (i != n) {
+                       case 0:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 3:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 2:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 1:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+               }
+               }
+       }
+
+       return rc;
+}
+
 /**
  * Attach packet mbuf to another packet mbuf.
  *

patch2:
diff --git a/main.c b/main.c
index 2aa9fcf..749c52c 100644
--- a/main.c
+++ b/main.c
@@ -71,7 +71,7 @@ static struct rte_mempool* make_mempool() {
        static int pool_id = 0;
        char pool_name[32];
        sprintf(pool_name, "pool%d", __sync_fetch_and_add(&pool_id, 1));
-       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 32,
+       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 2 * BATCH_SIZE,
                sizeof(struct rte_pktmbuf_pool_private),
                rte_pktmbuf_pool_init, NULL,
                rte_pktmbuf_init, NULL,
@@ -113,13 +113,21 @@ static uint32_t send_pkts(uint8_t port, struct rte_mempool* pool) {
        // alloc bufs
        struct rte_mbuf* bufs[BATCH_SIZE];
        uint32_t i;
+       int32_t rc;
+
+       rc = rte_pktmbuf_bulk_alloc(pool, bufs, RTE_DIM(bufs));
+       if (rc < 0) {
+               RTE_LOG(ERR, USER1,
+                       "%s: rte_pktmbuf_alloc(%zu) returns error code: %d\n",
+                       __func__, RTE_DIM(bufs), rc);
+               return 0;
+       }
+
        for (i = 0; i < BATCH_SIZE; i++) {
-               struct rte_mbuf* buf = rte_pktmbuf_alloc(pool);
-               rte_pktmbuf_data_len(buf) = 60;
-               rte_pktmbuf_pkt_len(buf) = 60;
-               bufs[i] = buf;
+               rte_pktmbuf_data_len(bufs[i]) = 60;
+               rte_pktmbuf_pkt_len(bufs[i]) = 60;
                // write seq number
-               uint64_t* pkt = rte_pktmbuf_mtod(buf, uint64_t*);
+               uint64_t* pkt = rte_pktmbuf_mtod(bufs[i], uint64_t*);
                pkt[0] = seq++;
        }
        // send pkts


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2015-05-13  9:03     ` Ananyev, Konstantin
@ 2016-02-15 19:15       ` Paul Emmerich
  2016-02-19 12:31         ` Olivier MATZ
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Emmerich @ 2016-02-15 19:15 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

Hi,

here's a kind of late follow-up. I've only recently found the need 
(mostly for the better support of XL710 NICs (which I still dislike but 
people are using them...)) to seriously address DPDK 2.x support in MoonGen.

On 13.05.15 11:03, Ananyev, Konstantin wrote:
> Before start to discuss your findings, there is one thing in your test app that looks strange to me:
> You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
> This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
> and you'll endup doing ring_dequeue().
> I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
> that should improve your numbers (at least it did to me).

Thanks for pointing that out. However, my real app did not have this bug 
and I also saw the performance improvement there.

> Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
> Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
> Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
> I don't have box with the same off-hand, but on my IVB box results are quite promising:
> on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%),
> for full_tx the drop reduced to 2%.
> That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
> latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

The bulk_alloc patch is great and helps. I'd love to see such a function 
in DPDK.

I agree that this is a better solution than prefetching. I also can't 
see a difference with/without prefetching when using bulk alloc.


  Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
  2016-02-15 19:15       ` Paul Emmerich
@ 2016-02-19 12:31         ` Olivier MATZ
  0 siblings, 0 replies; 10+ messages in thread
From: Olivier MATZ @ 2016-02-19 12:31 UTC (permalink / raw)
  To: Paul Emmerich, Ananyev, Konstantin; +Cc: dev

Hi Paul,

On 02/15/2016 08:15 PM, Paul Emmerich wrote:
> The bulk_alloc patch is great and helps. I'd love to see such a function
> in DPDK.
> 

A patch has been submitted by Huawei. I guess it will be integrated
soon.
See http://dpdk.org/dev/patchwork/patch/10122/


Regards,
Olivier

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-02-19 12:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-11  0:14 [dpdk-dev] TX performance regression caused by the mbuf cachline split Paul Emmerich
2015-05-11  9:13 ` Luke Gorrie
2015-05-11 10:16   ` Paul Emmerich
2015-05-11 22:32 ` Paul Emmerich
2015-05-11 23:18   ` Paul Emmerich
2015-05-12  0:28     ` Marc Sune
2015-05-12  0:38       ` Marc Sune
2015-05-13  9:03     ` Ananyev, Konstantin
2016-02-15 19:15       ` Paul Emmerich
2016-02-19 12:31         ` Olivier MATZ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).