DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] LLC miss in librte_distributor
@ 2014-11-11 15:37 jigsaw
  2014-11-12  8:37 ` jigsaw
  0 siblings, 1 reply; 5+ messages in thread
From: jigsaw @ 2014-11-11 15:37 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

I noticed that librte_distributor has quite sever LLC miss problem when
running on 16 cores.
While on 8 cores, there's no such problem.
The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
cores on 2 sockets.

The test case is the distributor_perf_autotest, i.e.
in app/test/test_distributor_perf.c.
The test result is collected by command:

perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
-cff -n2 --no-huge

Note that test results show that with or without hugepage, the LCC miss
rate remains the same. So I will just show --no-huge config.

With 8 cores, the LLC miss rate is OK:

LLC-load-misses  26750
LLC-loads  93979233
LLC-store-misses  432263
LLC-stores  69954746

That is 0.028% of load miss and 0.62% of store miss.

With 16 cores, the LLC miss rate is very high:

LLC-load-misses  70263520
LLC-loads  143807657
LLC-store-misses  23115990
LLC-stores  63692854

That is 48.9% load miss and 36.3% store miss.

Most of the load miss happens at first line of rte_distributor_poll_pkt.
Most of the store miss happens at ... I don't know, because perf record on
LLC-store-misses brings down my machine.

It's not so straightforward to me how could this happen: 8 core fine, but
16 cores very bad.
My guess is that 16 cores bring in more QPI transaction between sockets?
Or 16 cores bring a different LLC access pattern?

So I tried to reduce the padding inside union rte_distributor_buffer from 3
cachelines to 1 cacheline.

-     char pad[CACHE_LINE_SIZE*3];
+    char pad[CACHE_LINE_SIZE];

And it does have a obvious result:

LLC-load-misses  53159968
LLC-loads  167756282
LLC-store-misses  29012799
LLC-stores  63352541

Now it is 31.69% of load miss, and 45.79% of store miss.

It lows down the load miss rate, but raises the store miss rate.
Both numbers are still very high, sadly.
But the bright side is that it decrease the Time per burst and time per
packet.

The original version has:
=== Performance test of distributor ===
Time per burst:  8013
Time per packet: 250

And the patched ver has:
=== Performance test of distributor ===
Time per burst:  6834
Time per packet: 213


I tried a couple of other tricks. Such as adding more idle loops
in rte_distributor_get_pkt,
and making the rte_distributor_buffer thread_local to each worker core. But
none of this trick
has any noticeable outcome. These failures make me tend to believe the high
LLC miss rate
is related to QPI or NUMA. But my machine is not able to perf on uncore QPI
events so this
cannot be approved.


I cannot draw any conclusion or reveal the root cause after all. But I
suggest a further study on the performance bottleneck so as to find a good
solution.

thx &
rgds,
-qinglai

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] LLC miss in librte_distributor
  2014-11-11 15:37 [dpdk-dev] LLC miss in librte_distributor jigsaw
@ 2014-11-12  8:37 ` jigsaw
  2014-11-12 16:07   ` Bruce Richardson
  0 siblings, 1 reply; 5+ messages in thread
From: jigsaw @ 2014-11-12  8:37 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi,

OK it is now very clear it is due to memory transactions between different
nodes.

The test program is here:
https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b

The test machine topology is:

NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
boost from  0.09% to 33.45%.
The LLC cache store miss boost from 0.027% to 50.695%.

Clearly the root cause is transaction crossing the node boundary.

But then how to resolve this problem is another topic...

thx &
rgds,
-ql



On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:

> Hi Bruce,
>
> I noticed that librte_distributor has quite sever LLC miss problem when
> running on 16 cores.
> While on 8 cores, there's no such problem.
> The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> cores on 2 sockets.
>
> The test case is the distributor_perf_autotest, i.e.
> in app/test/test_distributor_perf.c.
> The test result is collected by command:
>
> perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
> -cff -n2 --no-huge
>
> Note that test results show that with or without hugepage, the LCC miss
> rate remains the same. So I will just show --no-huge config.
>
> With 8 cores, the LLC miss rate is OK:
>
> LLC-load-misses  26750
> LLC-loads  93979233
> LLC-store-misses  432263
> LLC-stores  69954746
>
> That is 0.028% of load miss and 0.62% of store miss.
>
> With 16 cores, the LLC miss rate is very high:
>
> LLC-load-misses  70263520
> LLC-loads  143807657
> LLC-store-misses  23115990
> LLC-stores  63692854
>
> That is 48.9% load miss and 36.3% store miss.
>
> Most of the load miss happens at first line of rte_distributor_poll_pkt.
> Most of the store miss happens at ... I don't know, because perf record on
> LLC-store-misses brings down my machine.
>
> It's not so straightforward to me how could this happen: 8 core fine, but
> 16 cores very bad.
> My guess is that 16 cores bring in more QPI transaction between sockets?
> Or 16 cores bring a different LLC access pattern?
>
> So I tried to reduce the padding inside union rte_distributor_buffer from
> 3 cachelines to 1 cacheline.
>
> -     char pad[CACHE_LINE_SIZE*3];
> +    char pad[CACHE_LINE_SIZE];
>
> And it does have a obvious result:
>
> LLC-load-misses  53159968
> LLC-loads  167756282
> LLC-store-misses  29012799
> LLC-stores  63352541
>
> Now it is 31.69% of load miss, and 45.79% of store miss.
>
> It lows down the load miss rate, but raises the store miss rate.
> Both numbers are still very high, sadly.
> But the bright side is that it decrease the Time per burst and time per
> packet.
>
> The original version has:
> === Performance test of distributor ===
> Time per burst:  8013
> Time per packet: 250
>
> And the patched ver has:
> === Performance test of distributor ===
> Time per burst:  6834
> Time per packet: 213
>
>
> I tried a couple of other tricks. Such as adding more idle loops
> in rte_distributor_get_pkt,
> and making the rte_distributor_buffer thread_local to each worker core.
> But none of this trick
> has any noticeable outcome. These failures make me tend to believe the
> high LLC miss rate
> is related to QPI or NUMA. But my machine is not able to perf on uncore
> QPI events so this
> cannot be approved.
>
>
> I cannot draw any conclusion or reveal the root cause after all. But I
> suggest a further study on the performance bottleneck so as to find a good
> solution.
>
> thx &
> rgds,
> -qinglai
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] LLC miss in librte_distributor
  2014-11-12  8:37 ` jigsaw
@ 2014-11-12 16:07   ` Bruce Richardson
  2014-11-12 17:11     ` jigsaw
  0 siblings, 1 reply; 5+ messages in thread
From: Bruce Richardson @ 2014-11-12 16:07 UTC (permalink / raw)
  To: jigsaw; +Cc: dev

On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
> Hi,
> 
> OK it is now very clear it is due to memory transactions between different
> nodes.
> 
> The test program is here:
> https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
> 
> The test machine topology is:
> 
> NUMA node0 CPU(s):     0-7,16-23
> NUMA node1 CPU(s):     8-15,24-31
> 
> Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
> boost from  0.09% to 33.45%.
> The LLC cache store miss boost from 0.027% to 50.695%.
> 
> Clearly the root cause is transaction crossing the node boundary.
> 
> But then how to resolve this problem is another topic...
> 
> thx &
> rgds,
> -ql
> 
> 

Having traffic cross QPI is always a problem, and there could be a number of ways
to solve it. Probably the best solution is to have multiple NICs with some 
directly connected to each socket, with the packets from each NIC processed locally
on the socket that NIC is connected to.

If that is not possible, then other solutions need to be looked at. E.g. For an app
wanting to use a distributor, I would suggest investigating if two distributors
could be used - one on each socket. Then use a ring to burst-transfer large
groups of packets from one socket to another and then use the distributor locally.
This would involve far less QPI traffic than using a distributor with remote workers.

Regards,
/Bruce

> 
> On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:
> 
> > Hi Bruce,
> >
> > I noticed that librte_distributor has quite sever LLC miss problem when
> > running on 16 cores.
> > While on 8 cores, there's no such problem.
> > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> > cores on 2 sockets.
> >
> > The test case is the distributor_perf_autotest, i.e.
> > in app/test/test_distributor_perf.c.
> > The test result is collected by command:
> >
> > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
> > -cff -n2 --no-huge
> >
> > Note that test results show that with or without hugepage, the LCC miss
> > rate remains the same. So I will just show --no-huge config.
> >
> > With 8 cores, the LLC miss rate is OK:
> >
> > LLC-load-misses  26750
> > LLC-loads  93979233
> > LLC-store-misses  432263
> > LLC-stores  69954746
> >
> > That is 0.028% of load miss and 0.62% of store miss.
> >
> > With 16 cores, the LLC miss rate is very high:
> >
> > LLC-load-misses  70263520
> > LLC-loads  143807657
> > LLC-store-misses  23115990
> > LLC-stores  63692854
> >
> > That is 48.9% load miss and 36.3% store miss.
> >
> > Most of the load miss happens at first line of rte_distributor_poll_pkt.
> > Most of the store miss happens at ... I don't know, because perf record on
> > LLC-store-misses brings down my machine.
> >
> > It's not so straightforward to me how could this happen: 8 core fine, but
> > 16 cores very bad.
> > My guess is that 16 cores bring in more QPI transaction between sockets?
> > Or 16 cores bring a different LLC access pattern?
> >
> > So I tried to reduce the padding inside union rte_distributor_buffer from
> > 3 cachelines to 1 cacheline.
> >
> > -     char pad[CACHE_LINE_SIZE*3];
> > +    char pad[CACHE_LINE_SIZE];
> >
> > And it does have a obvious result:
> >
> > LLC-load-misses  53159968
> > LLC-loads  167756282
> > LLC-store-misses  29012799
> > LLC-stores  63352541
> >
> > Now it is 31.69% of load miss, and 45.79% of store miss.
> >
> > It lows down the load miss rate, but raises the store miss rate.
> > Both numbers are still very high, sadly.
> > But the bright side is that it decrease the Time per burst and time per
> > packet.
> >
> > The original version has:
> > === Performance test of distributor ===
> > Time per burst:  8013
> > Time per packet: 250
> >
> > And the patched ver has:
> > === Performance test of distributor ===
> > Time per burst:  6834
> > Time per packet: 213
> >
> >
> > I tried a couple of other tricks. Such as adding more idle loops
> > in rte_distributor_get_pkt,
> > and making the rte_distributor_buffer thread_local to each worker core.
> > But none of this trick
> > has any noticeable outcome. These failures make me tend to believe the
> > high LLC miss rate
> > is related to QPI or NUMA. But my machine is not able to perf on uncore
> > QPI events so this
> > cannot be approved.
> >
> >
> > I cannot draw any conclusion or reveal the root cause after all. But I
> > suggest a further study on the performance bottleneck so as to find a good
> > solution.
> >
> > thx &
> > rgds,
> > -qinglai
> >
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] LLC miss in librte_distributor
  2014-11-12 16:07   ` Bruce Richardson
@ 2014-11-12 17:11     ` jigsaw
  2014-11-13 15:17       ` jigsaw
  0 siblings, 1 reply; 5+ messages in thread
From: jigsaw @ 2014-11-12 17:11 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce,

Thanks for your reply.

I agree that to logically divide the distributor functionality is the best
solution.

Meantime I tried some tricks and the result looks good: For same amount of
pkts (1M), the LLC stores and loads decrease 90% percent, and the miss
rates for both decrease to 25%.
The L1 miss rate increase a bit, thought.
Then the combined result is that the time spent decreases 50%.
The main change I made is to use a FIFO to transfer the pkts from
distributor to worker, while the current buf is only used as a signalling
channel. This change has a very obvious effect on saving LLC access.

However, the test is based on the simple test program, rather on DPDK
application. So I will try same tricks on DPDK and see if it has same
effect.
Besides, I need more time to read a few more papers to get it right.

I will try to propose a patch if I manage to get a positive result. It will
take several days coz I'm not fully dedicated to this issue.

I will come back with more details.

BTW, I have another user story: a worker can asking distributor to schedule
a pkt.
It arises in such condition: After processing pkt with tag value 1, the
worker changes it's tag to 2, so the distributor has to be
asked to deliver the pkt with new tag value to proper worker.
I already have the patch ready but I will hold it back until previous patch
is committed.
I need also your comments on this user story.

thx &
rgds,
-ql

On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
> > Hi,
> >
> > OK it is now very clear it is due to memory transactions between
> different
> > nodes.
> >
> > The test program is here:
> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
> >
> > The test machine topology is:
> >
> > NUMA node0 CPU(s):     0-7,16-23
> > NUMA node1 CPU(s):     8-15,24-31
> >
> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
> > boost from  0.09% to 33.45%.
> > The LLC cache store miss boost from 0.027% to 50.695%.
> >
> > Clearly the root cause is transaction crossing the node boundary.
> >
> > But then how to resolve this problem is another topic...
> >
> > thx &
> > rgds,
> > -ql
> >
> >
>
> Having traffic cross QPI is always a problem, and there could be a number
> of ways
> to solve it. Probably the best solution is to have multiple NICs with some
> directly connected to each socket, with the packets from each NIC
> processed locally
> on the socket that NIC is connected to.
>
> If that is not possible, then other solutions need to be looked at. E.g.
> For an app
> wanting to use a distributor, I would suggest investigating if two
> distributors
> could be used - one on each socket. Then use a ring to burst-transfer large
> groups of packets from one socket to another and then use the distributor
> locally.
> This would involve far less QPI traffic than using a distributor with
> remote workers.
>
> Regards,
> /Bruce
>
> >
> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:
> >
> > > Hi Bruce,
> > >
> > > I noticed that librte_distributor has quite sever LLC miss problem when
> > > running on 16 cores.
> > > While on 8 cores, there's no such problem.
> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> > > cores on 2 sockets.
> > >
> > > The test case is the distributor_perf_autotest, i.e.
> > > in app/test/test_distributor_perf.c.
> > > The test result is collected by command:
> > >
> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores
> ./test
> > > -cff -n2 --no-huge
> > >
> > > Note that test results show that with or without hugepage, the LCC miss
> > > rate remains the same. So I will just show --no-huge config.
> > >
> > > With 8 cores, the LLC miss rate is OK:
> > >
> > > LLC-load-misses  26750
> > > LLC-loads  93979233
> > > LLC-store-misses  432263
> > > LLC-stores  69954746
> > >
> > > That is 0.028% of load miss and 0.62% of store miss.
> > >
> > > With 16 cores, the LLC miss rate is very high:
> > >
> > > LLC-load-misses  70263520
> > > LLC-loads  143807657
> > > LLC-store-misses  23115990
> > > LLC-stores  63692854
> > >
> > > That is 48.9% load miss and 36.3% store miss.
> > >
> > > Most of the load miss happens at first line of
> rte_distributor_poll_pkt.
> > > Most of the store miss happens at ... I don't know, because perf
> record on
> > > LLC-store-misses brings down my machine.
> > >
> > > It's not so straightforward to me how could this happen: 8 core fine,
> but
> > > 16 cores very bad.
> > > My guess is that 16 cores bring in more QPI transaction between
> sockets?
> > > Or 16 cores bring a different LLC access pattern?
> > >
> > > So I tried to reduce the padding inside union rte_distributor_buffer
> from
> > > 3 cachelines to 1 cacheline.
> > >
> > > -     char pad[CACHE_LINE_SIZE*3];
> > > +    char pad[CACHE_LINE_SIZE];
> > >
> > > And it does have a obvious result:
> > >
> > > LLC-load-misses  53159968
> > > LLC-loads  167756282
> > > LLC-store-misses  29012799
> > > LLC-stores  63352541
> > >
> > > Now it is 31.69% of load miss, and 45.79% of store miss.
> > >
> > > It lows down the load miss rate, but raises the store miss rate.
> > > Both numbers are still very high, sadly.
> > > But the bright side is that it decrease the Time per burst and time per
> > > packet.
> > >
> > > The original version has:
> > > === Performance test of distributor ===
> > > Time per burst:  8013
> > > Time per packet: 250
> > >
> > > And the patched ver has:
> > > === Performance test of distributor ===
> > > Time per burst:  6834
> > > Time per packet: 213
> > >
> > >
> > > I tried a couple of other tricks. Such as adding more idle loops
> > > in rte_distributor_get_pkt,
> > > and making the rte_distributor_buffer thread_local to each worker core.
> > > But none of this trick
> > > has any noticeable outcome. These failures make me tend to believe the
> > > high LLC miss rate
> > > is related to QPI or NUMA. But my machine is not able to perf on uncore
> > > QPI events so this
> > > cannot be approved.
> > >
> > >
> > > I cannot draw any conclusion or reveal the root cause after all. But I
> > > suggest a further study on the performance bottleneck so as to find a
> good
> > > solution.
> > >
> > > thx &
> > > rgds,
> > > -qinglai
> > >
> > >
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] LLC miss in librte_distributor
  2014-11-12 17:11     ` jigsaw
@ 2014-11-13 15:17       ` jigsaw
  0 siblings, 0 replies; 5+ messages in thread
From: jigsaw @ 2014-11-13 15:17 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi,

Well, I give up the idea of optimizing QPI caused LLC miss.
The queue based messaging has even worse performance than polling a same
buf from both cores.
It is the nature of busy polling model.
I guess we have to accept it as a fact, unless the programming model can be
changed to a biased locking model,
which favors one lock-owner core. But unfortunately the biased locking
model doesn't seem to be applicable for distributor.

thx &
rgds,
-ql

On Wed, Nov 12, 2014 at 7:11 PM, jigsaw <jigsaw@gmail.com> wrote:

> Hi Bruce,
>
> Thanks for your reply.
>
> I agree that to logically divide the distributor functionality is the best
> solution.
>
> Meantime I tried some tricks and the result looks good: For same amount of
> pkts (1M), the LLC stores and loads decrease 90% percent, and the miss
> rates for both decrease to 25%.
> The L1 miss rate increase a bit, thought.
> Then the combined result is that the time spent decreases 50%.
> The main change I made is to use a FIFO to transfer the pkts from
> distributor to worker, while the current buf is only used as a signalling
> channel. This change has a very obvious effect on saving LLC access.
>
> However, the test is based on the simple test program, rather on DPDK
> application. So I will try same tricks on DPDK and see if it has same
> effect.
> Besides, I need more time to read a few more papers to get it right.
>
> I will try to propose a patch if I manage to get a positive result. It
> will take several days coz I'm not fully dedicated to this issue.
>
> I will come back with more details.
>
> BTW, I have another user story: a worker can asking distributor to
> schedule a pkt.
> It arises in such condition: After processing pkt with tag value 1, the
> worker changes it's tag to 2, so the distributor has to be
> asked to deliver the pkt with new tag value to proper worker.
> I already have the patch ready but I will hold it back until previous
> patch is committed.
> I need also your comments on this user story.
>
> thx &
> rgds,
> -ql
>
> On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
>
>> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
>> > Hi,
>> >
>> > OK it is now very clear it is due to memory transactions between
>> different
>> > nodes.
>> >
>> > The test program is here:
>> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
>> >
>> > The test machine topology is:
>> >
>> > NUMA node0 CPU(s):     0-7,16-23
>> > NUMA node1 CPU(s):     8-15,24-31
>> >
>> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load
>> miss
>> > boost from  0.09% to 33.45%.
>> > The LLC cache store miss boost from 0.027% to 50.695%.
>> >
>> > Clearly the root cause is transaction crossing the node boundary.
>> >
>> > But then how to resolve this problem is another topic...
>> >
>> > thx &
>> > rgds,
>> > -ql
>> >
>> >
>>
>> Having traffic cross QPI is always a problem, and there could be a number
>> of ways
>> to solve it. Probably the best solution is to have multiple NICs with some
>> directly connected to each socket, with the packets from each NIC
>> processed locally
>> on the socket that NIC is connected to.
>>
>> If that is not possible, then other solutions need to be looked at. E.g.
>> For an app
>> wanting to use a distributor, I would suggest investigating if two
>> distributors
>> could be used - one on each socket. Then use a ring to burst-transfer
>> large
>> groups of packets from one socket to another and then use the distributor
>> locally.
>> This would involve far less QPI traffic than using a distributor with
>> remote workers.
>>
>> Regards,
>> /Bruce
>>
>> >
>> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:
>> >
>> > > Hi Bruce,
>> > >
>> > > I noticed that librte_distributor has quite sever LLC miss problem
>> when
>> > > running on 16 cores.
>> > > While on 8 cores, there's no such problem.
>> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
>> > > cores on 2 sockets.
>> > >
>> > > The test case is the distributor_perf_autotest, i.e.
>> > > in app/test/test_distributor_perf.c.
>> > > The test result is collected by command:
>> > >
>> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores
>> ./test
>> > > -cff -n2 --no-huge
>> > >
>> > > Note that test results show that with or without hugepage, the LCC
>> miss
>> > > rate remains the same. So I will just show --no-huge config.
>> > >
>> > > With 8 cores, the LLC miss rate is OK:
>> > >
>> > > LLC-load-misses  26750
>> > > LLC-loads  93979233
>> > > LLC-store-misses  432263
>> > > LLC-stores  69954746
>> > >
>> > > That is 0.028% of load miss and 0.62% of store miss.
>> > >
>> > > With 16 cores, the LLC miss rate is very high:
>> > >
>> > > LLC-load-misses  70263520
>> > > LLC-loads  143807657
>> > > LLC-store-misses  23115990
>> > > LLC-stores  63692854
>> > >
>> > > That is 48.9% load miss and 36.3% store miss.
>> > >
>> > > Most of the load miss happens at first line of
>> rte_distributor_poll_pkt.
>> > > Most of the store miss happens at ... I don't know, because perf
>> record on
>> > > LLC-store-misses brings down my machine.
>> > >
>> > > It's not so straightforward to me how could this happen: 8 core fine,
>> but
>> > > 16 cores very bad.
>> > > My guess is that 16 cores bring in more QPI transaction between
>> sockets?
>> > > Or 16 cores bring a different LLC access pattern?
>> > >
>> > > So I tried to reduce the padding inside union rte_distributor_buffer
>> from
>> > > 3 cachelines to 1 cacheline.
>> > >
>> > > -     char pad[CACHE_LINE_SIZE*3];
>> > > +    char pad[CACHE_LINE_SIZE];
>> > >
>> > > And it does have a obvious result:
>> > >
>> > > LLC-load-misses  53159968
>> > > LLC-loads  167756282
>> > > LLC-store-misses  29012799
>> > > LLC-stores  63352541
>> > >
>> > > Now it is 31.69% of load miss, and 45.79% of store miss.
>> > >
>> > > It lows down the load miss rate, but raises the store miss rate.
>> > > Both numbers are still very high, sadly.
>> > > But the bright side is that it decrease the Time per burst and time
>> per
>> > > packet.
>> > >
>> > > The original version has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  8013
>> > > Time per packet: 250
>> > >
>> > > And the patched ver has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  6834
>> > > Time per packet: 213
>> > >
>> > >
>> > > I tried a couple of other tricks. Such as adding more idle loops
>> > > in rte_distributor_get_pkt,
>> > > and making the rte_distributor_buffer thread_local to each worker
>> core.
>> > > But none of this trick
>> > > has any noticeable outcome. These failures make me tend to believe the
>> > > high LLC miss rate
>> > > is related to QPI or NUMA. But my machine is not able to perf on
>> uncore
>> > > QPI events so this
>> > > cannot be approved.
>> > >
>> > >
>> > > I cannot draw any conclusion or reveal the root cause after all. But I
>> > > suggest a further study on the performance bottleneck so as to find a
>> good
>> > > solution.
>> > >
>> > > thx &
>> > > rgds,
>> > > -qinglai
>> > >
>> > >
>>
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-11-13 15:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-11 15:37 [dpdk-dev] LLC miss in librte_distributor jigsaw
2014-11-12  8:37 ` jigsaw
2014-11-12 16:07   ` Bruce Richardson
2014-11-12 17:11     ` jigsaw
2014-11-13 15:17       ` jigsaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).