From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com [209.85.212.169]) by dpdk.org (Postfix) with ESMTP id 02A3D7E75 for ; Thu, 13 Nov 2014 16:07:43 +0100 (CET) Received: by mail-wi0-f169.google.com with SMTP id n3so2215248wiv.4 for ; Thu, 13 Nov 2014 07:17:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=o05M0SurltjJZGEads1c3LBjaHucow+Aq4tckLgh3F0=; b=Ti7mGMGOMROy0aivMV2gw4g/l2aQCK9KdxhDdjB5ed7WNhTORuC7rCTnK7sSAAZwuz QbIrZQBPwZ4rg2cLuqvW257Y2kIE48VF+bNE5+uMlxIpdRtLRsMAn5b93YYAFCMtSxRN S4EsNEKH2dwmDIz29X75NR8D+yWpoVMjxwrq0iv4rtD7cu2G5idzqk4ICGV9RHduylm2 mR+aLEAF2tm6reob4A/bEEo3VcSuemIrT9gFyAH6um41t1hv1rHFj6CzdzJ7BlARhLhb o/ffp4q1q200gC4u1dkXTrMZjfflcus1TEJtWwPtTua4Lm5+ESrC29HgpyleIeJ5dg9a MZ9g== MIME-Version: 1.0 X-Received: by 10.180.211.108 with SMTP id nb12mr4789928wic.76.1415891861906; Thu, 13 Nov 2014 07:17:41 -0800 (PST) Received: by 10.27.86.14 with HTTP; Thu, 13 Nov 2014 07:17:41 -0800 (PST) In-Reply-To: References: <20141112160709.GB7952@bricha3-MOBL3> Date: Thu, 13 Nov 2014 17:17:41 +0200 Message-ID: From: jigsaw To: Bruce Richardson Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] LLC miss in librte_distributor X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Nov 2014 15:07:43 -0000 Hi, Well, I give up the idea of optimizing QPI caused LLC miss. The queue based messaging has even worse performance than polling a same buf from both cores. It is the nature of busy polling model. I guess we have to accept it as a fact, unless the programming model can be changed to a biased locking model, which favors one lock-owner core. But unfortunately the biased locking model doesn't seem to be applicable for distributor. thx & rgds, -ql On Wed, Nov 12, 2014 at 7:11 PM, jigsaw wrote: > Hi Bruce, > > Thanks for your reply. > > I agree that to logically divide the distributor functionality is the best > solution. > > Meantime I tried some tricks and the result looks good: For same amount of > pkts (1M), the LLC stores and loads decrease 90% percent, and the miss > rates for both decrease to 25%. > The L1 miss rate increase a bit, thought. > Then the combined result is that the time spent decreases 50%. > The main change I made is to use a FIFO to transfer the pkts from > distributor to worker, while the current buf is only used as a signalling > channel. This change has a very obvious effect on saving LLC access. > > However, the test is based on the simple test program, rather on DPDK > application. So I will try same tricks on DPDK and see if it has same > effect. > Besides, I need more time to read a few more papers to get it right. > > I will try to propose a patch if I manage to get a positive result. It > will take several days coz I'm not fully dedicated to this issue. > > I will come back with more details. > > BTW, I have another user story: a worker can asking distributor to > schedule a pkt. > It arises in such condition: After processing pkt with tag value 1, the > worker changes it's tag to 2, so the distributor has to be > asked to deliver the pkt with new tag value to proper worker. > I already have the patch ready but I will hold it back until previous > patch is committed. > I need also your comments on this user story. > > thx & > rgds, > -ql > > On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson < > bruce.richardson@intel.com> wrote: > >> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote: >> > Hi, >> > >> > OK it is now very clear it is due to memory transactions between >> different >> > nodes. >> > >> > The test program is here: >> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b >> > >> > The test machine topology is: >> > >> > NUMA node0 CPU(s): 0-7,16-23 >> > NUMA node1 CPU(s): 8-15,24-31 >> > >> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load >> miss >> > boost from 0.09% to 33.45%. >> > The LLC cache store miss boost from 0.027% to 50.695%. >> > >> > Clearly the root cause is transaction crossing the node boundary. >> > >> > But then how to resolve this problem is another topic... >> > >> > thx & >> > rgds, >> > -ql >> > >> > >> >> Having traffic cross QPI is always a problem, and there could be a number >> of ways >> to solve it. Probably the best solution is to have multiple NICs with some >> directly connected to each socket, with the packets from each NIC >> processed locally >> on the socket that NIC is connected to. >> >> If that is not possible, then other solutions need to be looked at. E.g. >> For an app >> wanting to use a distributor, I would suggest investigating if two >> distributors >> could be used - one on each socket. Then use a ring to burst-transfer >> large >> groups of packets from one socket to another and then use the distributor >> locally. >> This would involve far less QPI traffic than using a distributor with >> remote workers. >> >> Regards, >> /Bruce >> >> > >> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw wrote: >> > >> > > Hi Bruce, >> > > >> > > I noticed that librte_distributor has quite sever LLC miss problem >> when >> > > running on 16 cores. >> > > While on 8 cores, there's no such problem. >> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 >> > > cores on 2 sockets. >> > > >> > > The test case is the distributor_perf_autotest, i.e. >> > > in app/test/test_distributor_perf.c. >> > > The test result is collected by command: >> > > >> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores >> ./test >> > > -cff -n2 --no-huge >> > > >> > > Note that test results show that with or without hugepage, the LCC >> miss >> > > rate remains the same. So I will just show --no-huge config. >> > > >> > > With 8 cores, the LLC miss rate is OK: >> > > >> > > LLC-load-misses 26750 >> > > LLC-loads 93979233 >> > > LLC-store-misses 432263 >> > > LLC-stores 69954746 >> > > >> > > That is 0.028% of load miss and 0.62% of store miss. >> > > >> > > With 16 cores, the LLC miss rate is very high: >> > > >> > > LLC-load-misses 70263520 >> > > LLC-loads 143807657 >> > > LLC-store-misses 23115990 >> > > LLC-stores 63692854 >> > > >> > > That is 48.9% load miss and 36.3% store miss. >> > > >> > > Most of the load miss happens at first line of >> rte_distributor_poll_pkt. >> > > Most of the store miss happens at ... I don't know, because perf >> record on >> > > LLC-store-misses brings down my machine. >> > > >> > > It's not so straightforward to me how could this happen: 8 core fine, >> but >> > > 16 cores very bad. >> > > My guess is that 16 cores bring in more QPI transaction between >> sockets? >> > > Or 16 cores bring a different LLC access pattern? >> > > >> > > So I tried to reduce the padding inside union rte_distributor_buffer >> from >> > > 3 cachelines to 1 cacheline. >> > > >> > > - char pad[CACHE_LINE_SIZE*3]; >> > > + char pad[CACHE_LINE_SIZE]; >> > > >> > > And it does have a obvious result: >> > > >> > > LLC-load-misses 53159968 >> > > LLC-loads 167756282 >> > > LLC-store-misses 29012799 >> > > LLC-stores 63352541 >> > > >> > > Now it is 31.69% of load miss, and 45.79% of store miss. >> > > >> > > It lows down the load miss rate, but raises the store miss rate. >> > > Both numbers are still very high, sadly. >> > > But the bright side is that it decrease the Time per burst and time >> per >> > > packet. >> > > >> > > The original version has: >> > > === Performance test of distributor === >> > > Time per burst: 8013 >> > > Time per packet: 250 >> > > >> > > And the patched ver has: >> > > === Performance test of distributor === >> > > Time per burst: 6834 >> > > Time per packet: 213 >> > > >> > > >> > > I tried a couple of other tricks. Such as adding more idle loops >> > > in rte_distributor_get_pkt, >> > > and making the rte_distributor_buffer thread_local to each worker >> core. >> > > But none of this trick >> > > has any noticeable outcome. These failures make me tend to believe the >> > > high LLC miss rate >> > > is related to QPI or NUMA. But my machine is not able to perf on >> uncore >> > > QPI events so this >> > > cannot be approved. >> > > >> > > >> > > I cannot draw any conclusion or reveal the root cause after all. But I >> > > suggest a further study on the performance bottleneck so as to find a >> good >> > > solution. >> > > >> > > thx & >> > > rgds, >> > > -qinglai >> > > >> > > >> > >