From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by dpdk.org (Postfix) with ESMTP id 6C1E67E75 for ; Wed, 12 Nov 2014 09:27:40 +0100 (CET) Received: by mail-wg0-f42.google.com with SMTP id k14so13469728wgh.15 for ; Wed, 12 Nov 2014 00:37:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=EoP+fdQAE2p+bK7c8LnvqUq1D24FIktLw7mrpx9TrSY=; b=fRzzETykmewAQ80LwPVrloHIEvzmk9wX4ZXEzuesiRVamC1GsnUi3EM8BkVI67mImt Bdi5xwhhbzlqVVqS/5srBw/8r5jkh2+yNk4tTzcUDAJtT4p4ct4KlB8oSriKZkqw4KMC biAjjRrDquZbjlm12dZXfGgvGqirDYQZ8O71OIVG8kjjGJStDejLrWG+t9GRaN6j0tE4 ovX9hA4pEHXgcPEZPA2/HUnRjwZr3sZbsRFWLmrKcIXnlP5/8S1M0VVOxeC1Wyxtz7FH 1zp1ikuKy96LEIaGtucJYZqEw+8TT2nplsgfNw5q/jWc+3T+Mft5sqfBOWYJXA+v/2// h+vQ== MIME-Version: 1.0 X-Received: by 10.180.93.37 with SMTP id cr5mr48420447wib.76.1415781453682; Wed, 12 Nov 2014 00:37:33 -0800 (PST) Received: by 10.27.86.144 with HTTP; Wed, 12 Nov 2014 00:37:33 -0800 (PST) In-Reply-To: References: Date: Wed, 12 Nov 2014 10:37:33 +0200 Message-ID: From: jigsaw To: Bruce Richardson , "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] LLC miss in librte_distributor X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Nov 2014 08:27:40 -0000 Hi, OK it is now very clear it is due to memory transactions between different nodes. The test program is here: https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b The test machine topology is: NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss boost from 0.09% to 33.45%. The LLC cache store miss boost from 0.027% to 50.695%. Clearly the root cause is transaction crossing the node boundary. But then how to resolve this problem is another topic... thx & rgds, -ql On Tue, Nov 11, 2014 at 5:37 PM, jigsaw wrote: > Hi Bruce, > > I noticed that librte_distributor has quite sever LLC miss problem when > running on 16 cores. > While on 8 cores, there's no such problem. > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > cores on 2 sockets. > > The test case is the distributor_perf_autotest, i.e. > in app/test/test_distributor_perf.c. > The test result is collected by command: > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test > -cff -n2 --no-huge > > Note that test results show that with or without hugepage, the LCC miss > rate remains the same. So I will just show --no-huge config. > > With 8 cores, the LLC miss rate is OK: > > LLC-load-misses 26750 > LLC-loads 93979233 > LLC-store-misses 432263 > LLC-stores 69954746 > > That is 0.028% of load miss and 0.62% of store miss. > > With 16 cores, the LLC miss rate is very high: > > LLC-load-misses 70263520 > LLC-loads 143807657 > LLC-store-misses 23115990 > LLC-stores 63692854 > > That is 48.9% load miss and 36.3% store miss. > > Most of the load miss happens at first line of rte_distributor_poll_pkt. > Most of the store miss happens at ... I don't know, because perf record on > LLC-store-misses brings down my machine. > > It's not so straightforward to me how could this happen: 8 core fine, but > 16 cores very bad. > My guess is that 16 cores bring in more QPI transaction between sockets? > Or 16 cores bring a different LLC access pattern? > > So I tried to reduce the padding inside union rte_distributor_buffer from > 3 cachelines to 1 cacheline. > > - char pad[CACHE_LINE_SIZE*3]; > + char pad[CACHE_LINE_SIZE]; > > And it does have a obvious result: > > LLC-load-misses 53159968 > LLC-loads 167756282 > LLC-store-misses 29012799 > LLC-stores 63352541 > > Now it is 31.69% of load miss, and 45.79% of store miss. > > It lows down the load miss rate, but raises the store miss rate. > Both numbers are still very high, sadly. > But the bright side is that it decrease the Time per burst and time per > packet. > > The original version has: > === Performance test of distributor === > Time per burst: 8013 > Time per packet: 250 > > And the patched ver has: > === Performance test of distributor === > Time per burst: 6834 > Time per packet: 213 > > > I tried a couple of other tricks. Such as adding more idle loops > in rte_distributor_get_pkt, > and making the rte_distributor_buffer thread_local to each worker core. > But none of this trick > has any noticeable outcome. These failures make me tend to believe the > high LLC miss rate > is related to QPI or NUMA. But my machine is not able to perf on uncore > QPI events so this > cannot be approved. > > > I cannot draw any conclusion or reveal the root cause after all. But I > suggest a further study on the performance bottleneck so as to find a good > solution. > > thx & > rgds, > -qinglai > >