* [dpdk-dev] LLC miss in librte_distributor @ 2014-11-11 15:37 jigsaw 2014-11-12 8:37 ` jigsaw 0 siblings, 1 reply; 5+ messages in thread From: jigsaw @ 2014-11-11 15:37 UTC (permalink / raw) To: Bruce Richardson, dev Hi Bruce, I noticed that librte_distributor has quite sever LLC miss problem when running on 16 cores. While on 8 cores, there's no such problem. The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 cores on 2 sockets. The test case is the distributor_perf_autotest, i.e. in app/test/test_distributor_perf.c. The test result is collected by command: perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test -cff -n2 --no-huge Note that test results show that with or without hugepage, the LCC miss rate remains the same. So I will just show --no-huge config. With 8 cores, the LLC miss rate is OK: LLC-load-misses 26750 LLC-loads 93979233 LLC-store-misses 432263 LLC-stores 69954746 That is 0.028% of load miss and 0.62% of store miss. With 16 cores, the LLC miss rate is very high: LLC-load-misses 70263520 LLC-loads 143807657 LLC-store-misses 23115990 LLC-stores 63692854 That is 48.9% load miss and 36.3% store miss. Most of the load miss happens at first line of rte_distributor_poll_pkt. Most of the store miss happens at ... I don't know, because perf record on LLC-store-misses brings down my machine. It's not so straightforward to me how could this happen: 8 core fine, but 16 cores very bad. My guess is that 16 cores bring in more QPI transaction between sockets? Or 16 cores bring a different LLC access pattern? So I tried to reduce the padding inside union rte_distributor_buffer from 3 cachelines to 1 cacheline. - char pad[CACHE_LINE_SIZE*3]; + char pad[CACHE_LINE_SIZE]; And it does have a obvious result: LLC-load-misses 53159968 LLC-loads 167756282 LLC-store-misses 29012799 LLC-stores 63352541 Now it is 31.69% of load miss, and 45.79% of store miss. It lows down the load miss rate, but raises the store miss rate. Both numbers are still very high, sadly. But the bright side is that it decrease the Time per burst and time per packet. The original version has: === Performance test of distributor === Time per burst: 8013 Time per packet: 250 And the patched ver has: === Performance test of distributor === Time per burst: 6834 Time per packet: 213 I tried a couple of other tricks. Such as adding more idle loops in rte_distributor_get_pkt, and making the rte_distributor_buffer thread_local to each worker core. But none of this trick has any noticeable outcome. These failures make me tend to believe the high LLC miss rate is related to QPI or NUMA. But my machine is not able to perf on uncore QPI events so this cannot be approved. I cannot draw any conclusion or reveal the root cause after all. But I suggest a further study on the performance bottleneck so as to find a good solution. thx & rgds, -qinglai ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] LLC miss in librte_distributor 2014-11-11 15:37 [dpdk-dev] LLC miss in librte_distributor jigsaw @ 2014-11-12 8:37 ` jigsaw 2014-11-12 16:07 ` Bruce Richardson 0 siblings, 1 reply; 5+ messages in thread From: jigsaw @ 2014-11-12 8:37 UTC (permalink / raw) To: Bruce Richardson, dev Hi, OK it is now very clear it is due to memory transactions between different nodes. The test program is here: https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b The test machine topology is: NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss boost from 0.09% to 33.45%. The LLC cache store miss boost from 0.027% to 50.695%. Clearly the root cause is transaction crossing the node boundary. But then how to resolve this problem is another topic... thx & rgds, -ql On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote: > Hi Bruce, > > I noticed that librte_distributor has quite sever LLC miss problem when > running on 16 cores. > While on 8 cores, there's no such problem. > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > cores on 2 sockets. > > The test case is the distributor_perf_autotest, i.e. > in app/test/test_distributor_perf.c. > The test result is collected by command: > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test > -cff -n2 --no-huge > > Note that test results show that with or without hugepage, the LCC miss > rate remains the same. So I will just show --no-huge config. > > With 8 cores, the LLC miss rate is OK: > > LLC-load-misses 26750 > LLC-loads 93979233 > LLC-store-misses 432263 > LLC-stores 69954746 > > That is 0.028% of load miss and 0.62% of store miss. > > With 16 cores, the LLC miss rate is very high: > > LLC-load-misses 70263520 > LLC-loads 143807657 > LLC-store-misses 23115990 > LLC-stores 63692854 > > That is 48.9% load miss and 36.3% store miss. > > Most of the load miss happens at first line of rte_distributor_poll_pkt. > Most of the store miss happens at ... I don't know, because perf record on > LLC-store-misses brings down my machine. > > It's not so straightforward to me how could this happen: 8 core fine, but > 16 cores very bad. > My guess is that 16 cores bring in more QPI transaction between sockets? > Or 16 cores bring a different LLC access pattern? > > So I tried to reduce the padding inside union rte_distributor_buffer from > 3 cachelines to 1 cacheline. > > - char pad[CACHE_LINE_SIZE*3]; > + char pad[CACHE_LINE_SIZE]; > > And it does have a obvious result: > > LLC-load-misses 53159968 > LLC-loads 167756282 > LLC-store-misses 29012799 > LLC-stores 63352541 > > Now it is 31.69% of load miss, and 45.79% of store miss. > > It lows down the load miss rate, but raises the store miss rate. > Both numbers are still very high, sadly. > But the bright side is that it decrease the Time per burst and time per > packet. > > The original version has: > === Performance test of distributor === > Time per burst: 8013 > Time per packet: 250 > > And the patched ver has: > === Performance test of distributor === > Time per burst: 6834 > Time per packet: 213 > > > I tried a couple of other tricks. Such as adding more idle loops > in rte_distributor_get_pkt, > and making the rte_distributor_buffer thread_local to each worker core. > But none of this trick > has any noticeable outcome. These failures make me tend to believe the > high LLC miss rate > is related to QPI or NUMA. But my machine is not able to perf on uncore > QPI events so this > cannot be approved. > > > I cannot draw any conclusion or reveal the root cause after all. But I > suggest a further study on the performance bottleneck so as to find a good > solution. > > thx & > rgds, > -qinglai > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] LLC miss in librte_distributor 2014-11-12 8:37 ` jigsaw @ 2014-11-12 16:07 ` Bruce Richardson 2014-11-12 17:11 ` jigsaw 0 siblings, 1 reply; 5+ messages in thread From: Bruce Richardson @ 2014-11-12 16:07 UTC (permalink / raw) To: jigsaw; +Cc: dev On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote: > Hi, > > OK it is now very clear it is due to memory transactions between different > nodes. > > The test program is here: > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b > > The test machine topology is: > > NUMA node0 CPU(s): 0-7,16-23 > NUMA node1 CPU(s): 8-15,24-31 > > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss > boost from 0.09% to 33.45%. > The LLC cache store miss boost from 0.027% to 50.695%. > > Clearly the root cause is transaction crossing the node boundary. > > But then how to resolve this problem is another topic... > > thx & > rgds, > -ql > > Having traffic cross QPI is always a problem, and there could be a number of ways to solve it. Probably the best solution is to have multiple NICs with some directly connected to each socket, with the packets from each NIC processed locally on the socket that NIC is connected to. If that is not possible, then other solutions need to be looked at. E.g. For an app wanting to use a distributor, I would suggest investigating if two distributors could be used - one on each socket. Then use a ring to burst-transfer large groups of packets from one socket to another and then use the distributor locally. This would involve far less QPI traffic than using a distributor with remote workers. Regards, /Bruce > > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote: > > > Hi Bruce, > > > > I noticed that librte_distributor has quite sever LLC miss problem when > > running on 16 cores. > > While on 8 cores, there's no such problem. > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > > cores on 2 sockets. > > > > The test case is the distributor_perf_autotest, i.e. > > in app/test/test_distributor_perf.c. > > The test result is collected by command: > > > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test > > -cff -n2 --no-huge > > > > Note that test results show that with or without hugepage, the LCC miss > > rate remains the same. So I will just show --no-huge config. > > > > With 8 cores, the LLC miss rate is OK: > > > > LLC-load-misses 26750 > > LLC-loads 93979233 > > LLC-store-misses 432263 > > LLC-stores 69954746 > > > > That is 0.028% of load miss and 0.62% of store miss. > > > > With 16 cores, the LLC miss rate is very high: > > > > LLC-load-misses 70263520 > > LLC-loads 143807657 > > LLC-store-misses 23115990 > > LLC-stores 63692854 > > > > That is 48.9% load miss and 36.3% store miss. > > > > Most of the load miss happens at first line of rte_distributor_poll_pkt. > > Most of the store miss happens at ... I don't know, because perf record on > > LLC-store-misses brings down my machine. > > > > It's not so straightforward to me how could this happen: 8 core fine, but > > 16 cores very bad. > > My guess is that 16 cores bring in more QPI transaction between sockets? > > Or 16 cores bring a different LLC access pattern? > > > > So I tried to reduce the padding inside union rte_distributor_buffer from > > 3 cachelines to 1 cacheline. > > > > - char pad[CACHE_LINE_SIZE*3]; > > + char pad[CACHE_LINE_SIZE]; > > > > And it does have a obvious result: > > > > LLC-load-misses 53159968 > > LLC-loads 167756282 > > LLC-store-misses 29012799 > > LLC-stores 63352541 > > > > Now it is 31.69% of load miss, and 45.79% of store miss. > > > > It lows down the load miss rate, but raises the store miss rate. > > Both numbers are still very high, sadly. > > But the bright side is that it decrease the Time per burst and time per > > packet. > > > > The original version has: > > === Performance test of distributor === > > Time per burst: 8013 > > Time per packet: 250 > > > > And the patched ver has: > > === Performance test of distributor === > > Time per burst: 6834 > > Time per packet: 213 > > > > > > I tried a couple of other tricks. Such as adding more idle loops > > in rte_distributor_get_pkt, > > and making the rte_distributor_buffer thread_local to each worker core. > > But none of this trick > > has any noticeable outcome. These failures make me tend to believe the > > high LLC miss rate > > is related to QPI or NUMA. But my machine is not able to perf on uncore > > QPI events so this > > cannot be approved. > > > > > > I cannot draw any conclusion or reveal the root cause after all. But I > > suggest a further study on the performance bottleneck so as to find a good > > solution. > > > > thx & > > rgds, > > -qinglai > > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] LLC miss in librte_distributor 2014-11-12 16:07 ` Bruce Richardson @ 2014-11-12 17:11 ` jigsaw 2014-11-13 15:17 ` jigsaw 0 siblings, 1 reply; 5+ messages in thread From: jigsaw @ 2014-11-12 17:11 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev Hi Bruce, Thanks for your reply. I agree that to logically divide the distributor functionality is the best solution. Meantime I tried some tricks and the result looks good: For same amount of pkts (1M), the LLC stores and loads decrease 90% percent, and the miss rates for both decrease to 25%. The L1 miss rate increase a bit, thought. Then the combined result is that the time spent decreases 50%. The main change I made is to use a FIFO to transfer the pkts from distributor to worker, while the current buf is only used as a signalling channel. This change has a very obvious effect on saving LLC access. However, the test is based on the simple test program, rather on DPDK application. So I will try same tricks on DPDK and see if it has same effect. Besides, I need more time to read a few more papers to get it right. I will try to propose a patch if I manage to get a positive result. It will take several days coz I'm not fully dedicated to this issue. I will come back with more details. BTW, I have another user story: a worker can asking distributor to schedule a pkt. It arises in such condition: After processing pkt with tag value 1, the worker changes it's tag to 2, so the distributor has to be asked to deliver the pkt with new tag value to proper worker. I already have the patch ready but I will hold it back until previous patch is committed. I need also your comments on this user story. thx & rgds, -ql On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson < bruce.richardson@intel.com> wrote: > On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote: > > Hi, > > > > OK it is now very clear it is due to memory transactions between > different > > nodes. > > > > The test program is here: > > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b > > > > The test machine topology is: > > > > NUMA node0 CPU(s): 0-7,16-23 > > NUMA node1 CPU(s): 8-15,24-31 > > > > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss > > boost from 0.09% to 33.45%. > > The LLC cache store miss boost from 0.027% to 50.695%. > > > > Clearly the root cause is transaction crossing the node boundary. > > > > But then how to resolve this problem is another topic... > > > > thx & > > rgds, > > -ql > > > > > > Having traffic cross QPI is always a problem, and there could be a number > of ways > to solve it. Probably the best solution is to have multiple NICs with some > directly connected to each socket, with the packets from each NIC > processed locally > on the socket that NIC is connected to. > > If that is not possible, then other solutions need to be looked at. E.g. > For an app > wanting to use a distributor, I would suggest investigating if two > distributors > could be used - one on each socket. Then use a ring to burst-transfer large > groups of packets from one socket to another and then use the distributor > locally. > This would involve far less QPI traffic than using a distributor with > remote workers. > > Regards, > /Bruce > > > > > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote: > > > > > Hi Bruce, > > > > > > I noticed that librte_distributor has quite sever LLC miss problem when > > > running on 16 cores. > > > While on 8 cores, there's no such problem. > > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > > > cores on 2 sockets. > > > > > > The test case is the distributor_perf_autotest, i.e. > > > in app/test/test_distributor_perf.c. > > > The test result is collected by command: > > > > > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores > ./test > > > -cff -n2 --no-huge > > > > > > Note that test results show that with or without hugepage, the LCC miss > > > rate remains the same. So I will just show --no-huge config. > > > > > > With 8 cores, the LLC miss rate is OK: > > > > > > LLC-load-misses 26750 > > > LLC-loads 93979233 > > > LLC-store-misses 432263 > > > LLC-stores 69954746 > > > > > > That is 0.028% of load miss and 0.62% of store miss. > > > > > > With 16 cores, the LLC miss rate is very high: > > > > > > LLC-load-misses 70263520 > > > LLC-loads 143807657 > > > LLC-store-misses 23115990 > > > LLC-stores 63692854 > > > > > > That is 48.9% load miss and 36.3% store miss. > > > > > > Most of the load miss happens at first line of > rte_distributor_poll_pkt. > > > Most of the store miss happens at ... I don't know, because perf > record on > > > LLC-store-misses brings down my machine. > > > > > > It's not so straightforward to me how could this happen: 8 core fine, > but > > > 16 cores very bad. > > > My guess is that 16 cores bring in more QPI transaction between > sockets? > > > Or 16 cores bring a different LLC access pattern? > > > > > > So I tried to reduce the padding inside union rte_distributor_buffer > from > > > 3 cachelines to 1 cacheline. > > > > > > - char pad[CACHE_LINE_SIZE*3]; > > > + char pad[CACHE_LINE_SIZE]; > > > > > > And it does have a obvious result: > > > > > > LLC-load-misses 53159968 > > > LLC-loads 167756282 > > > LLC-store-misses 29012799 > > > LLC-stores 63352541 > > > > > > Now it is 31.69% of load miss, and 45.79% of store miss. > > > > > > It lows down the load miss rate, but raises the store miss rate. > > > Both numbers are still very high, sadly. > > > But the bright side is that it decrease the Time per burst and time per > > > packet. > > > > > > The original version has: > > > === Performance test of distributor === > > > Time per burst: 8013 > > > Time per packet: 250 > > > > > > And the patched ver has: > > > === Performance test of distributor === > > > Time per burst: 6834 > > > Time per packet: 213 > > > > > > > > > I tried a couple of other tricks. Such as adding more idle loops > > > in rte_distributor_get_pkt, > > > and making the rte_distributor_buffer thread_local to each worker core. > > > But none of this trick > > > has any noticeable outcome. These failures make me tend to believe the > > > high LLC miss rate > > > is related to QPI or NUMA. But my machine is not able to perf on uncore > > > QPI events so this > > > cannot be approved. > > > > > > > > > I cannot draw any conclusion or reveal the root cause after all. But I > > > suggest a further study on the performance bottleneck so as to find a > good > > > solution. > > > > > > thx & > > > rgds, > > > -qinglai > > > > > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] LLC miss in librte_distributor 2014-11-12 17:11 ` jigsaw @ 2014-11-13 15:17 ` jigsaw 0 siblings, 0 replies; 5+ messages in thread From: jigsaw @ 2014-11-13 15:17 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev Hi, Well, I give up the idea of optimizing QPI caused LLC miss. The queue based messaging has even worse performance than polling a same buf from both cores. It is the nature of busy polling model. I guess we have to accept it as a fact, unless the programming model can be changed to a biased locking model, which favors one lock-owner core. But unfortunately the biased locking model doesn't seem to be applicable for distributor. thx & rgds, -ql On Wed, Nov 12, 2014 at 7:11 PM, jigsaw <jigsaw@gmail.com> wrote: > Hi Bruce, > > Thanks for your reply. > > I agree that to logically divide the distributor functionality is the best > solution. > > Meantime I tried some tricks and the result looks good: For same amount of > pkts (1M), the LLC stores and loads decrease 90% percent, and the miss > rates for both decrease to 25%. > The L1 miss rate increase a bit, thought. > Then the combined result is that the time spent decreases 50%. > The main change I made is to use a FIFO to transfer the pkts from > distributor to worker, while the current buf is only used as a signalling > channel. This change has a very obvious effect on saving LLC access. > > However, the test is based on the simple test program, rather on DPDK > application. So I will try same tricks on DPDK and see if it has same > effect. > Besides, I need more time to read a few more papers to get it right. > > I will try to propose a patch if I manage to get a positive result. It > will take several days coz I'm not fully dedicated to this issue. > > I will come back with more details. > > BTW, I have another user story: a worker can asking distributor to > schedule a pkt. > It arises in such condition: After processing pkt with tag value 1, the > worker changes it's tag to 2, so the distributor has to be > asked to deliver the pkt with new tag value to proper worker. > I already have the patch ready but I will hold it back until previous > patch is committed. > I need also your comments on this user story. > > thx & > rgds, > -ql > > On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson < > bruce.richardson@intel.com> wrote: > >> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote: >> > Hi, >> > >> > OK it is now very clear it is due to memory transactions between >> different >> > nodes. >> > >> > The test program is here: >> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b >> > >> > The test machine topology is: >> > >> > NUMA node0 CPU(s): 0-7,16-23 >> > NUMA node1 CPU(s): 8-15,24-31 >> > >> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load >> miss >> > boost from 0.09% to 33.45%. >> > The LLC cache store miss boost from 0.027% to 50.695%. >> > >> > Clearly the root cause is transaction crossing the node boundary. >> > >> > But then how to resolve this problem is another topic... >> > >> > thx & >> > rgds, >> > -ql >> > >> > >> >> Having traffic cross QPI is always a problem, and there could be a number >> of ways >> to solve it. Probably the best solution is to have multiple NICs with some >> directly connected to each socket, with the packets from each NIC >> processed locally >> on the socket that NIC is connected to. >> >> If that is not possible, then other solutions need to be looked at. E.g. >> For an app >> wanting to use a distributor, I would suggest investigating if two >> distributors >> could be used - one on each socket. Then use a ring to burst-transfer >> large >> groups of packets from one socket to another and then use the distributor >> locally. >> This would involve far less QPI traffic than using a distributor with >> remote workers. >> >> Regards, >> /Bruce >> >> > >> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote: >> > >> > > Hi Bruce, >> > > >> > > I noticed that librte_distributor has quite sever LLC miss problem >> when >> > > running on 16 cores. >> > > While on 8 cores, there's no such problem. >> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 >> > > cores on 2 sockets. >> > > >> > > The test case is the distributor_perf_autotest, i.e. >> > > in app/test/test_distributor_perf.c. >> > > The test result is collected by command: >> > > >> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores >> ./test >> > > -cff -n2 --no-huge >> > > >> > > Note that test results show that with or without hugepage, the LCC >> miss >> > > rate remains the same. So I will just show --no-huge config. >> > > >> > > With 8 cores, the LLC miss rate is OK: >> > > >> > > LLC-load-misses 26750 >> > > LLC-loads 93979233 >> > > LLC-store-misses 432263 >> > > LLC-stores 69954746 >> > > >> > > That is 0.028% of load miss and 0.62% of store miss. >> > > >> > > With 16 cores, the LLC miss rate is very high: >> > > >> > > LLC-load-misses 70263520 >> > > LLC-loads 143807657 >> > > LLC-store-misses 23115990 >> > > LLC-stores 63692854 >> > > >> > > That is 48.9% load miss and 36.3% store miss. >> > > >> > > Most of the load miss happens at first line of >> rte_distributor_poll_pkt. >> > > Most of the store miss happens at ... I don't know, because perf >> record on >> > > LLC-store-misses brings down my machine. >> > > >> > > It's not so straightforward to me how could this happen: 8 core fine, >> but >> > > 16 cores very bad. >> > > My guess is that 16 cores bring in more QPI transaction between >> sockets? >> > > Or 16 cores bring a different LLC access pattern? >> > > >> > > So I tried to reduce the padding inside union rte_distributor_buffer >> from >> > > 3 cachelines to 1 cacheline. >> > > >> > > - char pad[CACHE_LINE_SIZE*3]; >> > > + char pad[CACHE_LINE_SIZE]; >> > > >> > > And it does have a obvious result: >> > > >> > > LLC-load-misses 53159968 >> > > LLC-loads 167756282 >> > > LLC-store-misses 29012799 >> > > LLC-stores 63352541 >> > > >> > > Now it is 31.69% of load miss, and 45.79% of store miss. >> > > >> > > It lows down the load miss rate, but raises the store miss rate. >> > > Both numbers are still very high, sadly. >> > > But the bright side is that it decrease the Time per burst and time >> per >> > > packet. >> > > >> > > The original version has: >> > > === Performance test of distributor === >> > > Time per burst: 8013 >> > > Time per packet: 250 >> > > >> > > And the patched ver has: >> > > === Performance test of distributor === >> > > Time per burst: 6834 >> > > Time per packet: 213 >> > > >> > > >> > > I tried a couple of other tricks. Such as adding more idle loops >> > > in rte_distributor_get_pkt, >> > > and making the rte_distributor_buffer thread_local to each worker >> core. >> > > But none of this trick >> > > has any noticeable outcome. These failures make me tend to believe the >> > > high LLC miss rate >> > > is related to QPI or NUMA. But my machine is not able to perf on >> uncore >> > > QPI events so this >> > > cannot be approved. >> > > >> > > >> > > I cannot draw any conclusion or reveal the root cause after all. But I >> > > suggest a further study on the performance bottleneck so as to find a >> good >> > > solution. >> > > >> > > thx & >> > > rgds, >> > > -qinglai >> > > >> > > >> > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-11-13 15:07 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-11-11 15:37 [dpdk-dev] LLC miss in librte_distributor jigsaw 2014-11-12 8:37 ` jigsaw 2014-11-12 16:07 ` Bruce Richardson 2014-11-12 17:11 ` jigsaw 2014-11-13 15:17 ` jigsaw
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).