From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by dpdk.org (Postfix) with ESMTP id D3FEC7E75 for ; Wed, 12 Nov 2014 18:01:13 +0100 (CET) Received: by mail-wi0-f172.google.com with SMTP id bs8so5589402wib.17 for ; Wed, 12 Nov 2014 09:11:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AqZvpDRWfGCnvcVikALMrrLYjdEFIvKx3Ab0W5Bzums=; b=cHM9XttCmLEKaevFakeaZAoIpDAM+iiAPOmtHjjlZ9+rvh5f/8WKZV0EW8LawDH3Bo /l3JcRYkaRr+MFIM4p4a5x+ZS0ZbLzYyl9ND276bubqNu8ME2ROiceCAsJyPQqeiSxbM romoTRjZsam/XARYAVz1z7zYRPDesDokKikHCMsVggYX4nYYOFvDDG0NbXvuVvP33JCa CKP406kUi9NzxBnZ7XhF1eN0Qe8wzzExgOjib/XhmfJK5Hq7KAlKOAUnULrwQROQQ83d /sBB5GHg0w29Gr4toqb9cyiGJo+FP5sDBP9oaf+ggxt7zIlt8hveZwJe+H/l/hdbs+jY zJVw== MIME-Version: 1.0 X-Received: by 10.180.93.37 with SMTP id cr5mr52454495wib.76.1415812268789; Wed, 12 Nov 2014 09:11:08 -0800 (PST) Received: by 10.27.86.14 with HTTP; Wed, 12 Nov 2014 09:11:08 -0800 (PST) In-Reply-To: <20141112160709.GB7952@bricha3-MOBL3> References: <20141112160709.GB7952@bricha3-MOBL3> Date: Wed, 12 Nov 2014 19:11:08 +0200 Message-ID: From: jigsaw To: Bruce Richardson Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] LLC miss in librte_distributor X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Nov 2014 17:01:14 -0000 Hi Bruce, Thanks for your reply. I agree that to logically divide the distributor functionality is the best solution. Meantime I tried some tricks and the result looks good: For same amount of pkts (1M), the LLC stores and loads decrease 90% percent, and the miss rates for both decrease to 25%. The L1 miss rate increase a bit, thought. Then the combined result is that the time spent decreases 50%. The main change I made is to use a FIFO to transfer the pkts from distributor to worker, while the current buf is only used as a signalling channel. This change has a very obvious effect on saving LLC access. However, the test is based on the simple test program, rather on DPDK application. So I will try same tricks on DPDK and see if it has same effect. Besides, I need more time to read a few more papers to get it right. I will try to propose a patch if I manage to get a positive result. It will take several days coz I'm not fully dedicated to this issue. I will come back with more details. BTW, I have another user story: a worker can asking distributor to schedule a pkt. It arises in such condition: After processing pkt with tag value 1, the worker changes it's tag to 2, so the distributor has to be asked to deliver the pkt with new tag value to proper worker. I already have the patch ready but I will hold it back until previous patch is committed. I need also your comments on this user story. thx & rgds, -ql On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson < bruce.richardson@intel.com> wrote: > On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote: > > Hi, > > > > OK it is now very clear it is due to memory transactions between > different > > nodes. > > > > The test program is here: > > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b > > > > The test machine topology is: > > > > NUMA node0 CPU(s): 0-7,16-23 > > NUMA node1 CPU(s): 8-15,24-31 > > > > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss > > boost from 0.09% to 33.45%. > > The LLC cache store miss boost from 0.027% to 50.695%. > > > > Clearly the root cause is transaction crossing the node boundary. > > > > But then how to resolve this problem is another topic... > > > > thx & > > rgds, > > -ql > > > > > > Having traffic cross QPI is always a problem, and there could be a number > of ways > to solve it. Probably the best solution is to have multiple NICs with some > directly connected to each socket, with the packets from each NIC > processed locally > on the socket that NIC is connected to. > > If that is not possible, then other solutions need to be looked at. E.g. > For an app > wanting to use a distributor, I would suggest investigating if two > distributors > could be used - one on each socket. Then use a ring to burst-transfer large > groups of packets from one socket to another and then use the distributor > locally. > This would involve far less QPI traffic than using a distributor with > remote workers. > > Regards, > /Bruce > > > > > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw wrote: > > > > > Hi Bruce, > > > > > > I noticed that librte_distributor has quite sever LLC miss problem when > > > running on 16 cores. > > > While on 8 cores, there's no such problem. > > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > > > cores on 2 sockets. > > > > > > The test case is the distributor_perf_autotest, i.e. > > > in app/test/test_distributor_perf.c. > > > The test result is collected by command: > > > > > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores > ./test > > > -cff -n2 --no-huge > > > > > > Note that test results show that with or without hugepage, the LCC miss > > > rate remains the same. So I will just show --no-huge config. > > > > > > With 8 cores, the LLC miss rate is OK: > > > > > > LLC-load-misses 26750 > > > LLC-loads 93979233 > > > LLC-store-misses 432263 > > > LLC-stores 69954746 > > > > > > That is 0.028% of load miss and 0.62% of store miss. > > > > > > With 16 cores, the LLC miss rate is very high: > > > > > > LLC-load-misses 70263520 > > > LLC-loads 143807657 > > > LLC-store-misses 23115990 > > > LLC-stores 63692854 > > > > > > That is 48.9% load miss and 36.3% store miss. > > > > > > Most of the load miss happens at first line of > rte_distributor_poll_pkt. > > > Most of the store miss happens at ... I don't know, because perf > record on > > > LLC-store-misses brings down my machine. > > > > > > It's not so straightforward to me how could this happen: 8 core fine, > but > > > 16 cores very bad. > > > My guess is that 16 cores bring in more QPI transaction between > sockets? > > > Or 16 cores bring a different LLC access pattern? > > > > > > So I tried to reduce the padding inside union rte_distributor_buffer > from > > > 3 cachelines to 1 cacheline. > > > > > > - char pad[CACHE_LINE_SIZE*3]; > > > + char pad[CACHE_LINE_SIZE]; > > > > > > And it does have a obvious result: > > > > > > LLC-load-misses 53159968 > > > LLC-loads 167756282 > > > LLC-store-misses 29012799 > > > LLC-stores 63352541 > > > > > > Now it is 31.69% of load miss, and 45.79% of store miss. > > > > > > It lows down the load miss rate, but raises the store miss rate. > > > Both numbers are still very high, sadly. > > > But the bright side is that it decrease the Time per burst and time per > > > packet. > > > > > > The original version has: > > > === Performance test of distributor === > > > Time per burst: 8013 > > > Time per packet: 250 > > > > > > And the patched ver has: > > > === Performance test of distributor === > > > Time per burst: 6834 > > > Time per packet: 213 > > > > > > > > > I tried a couple of other tricks. Such as adding more idle loops > > > in rte_distributor_get_pkt, > > > and making the rte_distributor_buffer thread_local to each worker core. > > > But none of this trick > > > has any noticeable outcome. These failures make me tend to believe the > > > high LLC miss rate > > > is related to QPI or NUMA. But my machine is not able to perf on uncore > > > QPI events so this > > > cannot be approved. > > > > > > > > > I cannot draw any conclusion or reveal the root cause after all. But I > > > suggest a further study on the performance bottleneck so as to find a > good > > > solution. > > > > > > thx & > > > rgds, > > > -qinglai > > > > > > >