From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jigsaw@gmail.com>
Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42])
 by dpdk.org (Postfix) with ESMTP id 6C1E67E75
 for <dev@dpdk.org>; Wed, 12 Nov 2014 09:27:40 +0100 (CET)
Received: by mail-wg0-f42.google.com with SMTP id k14so13469728wgh.15
 for <dev@dpdk.org>; Wed, 12 Nov 2014 00:37:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :content-type; bh=EoP+fdQAE2p+bK7c8LnvqUq1D24FIktLw7mrpx9TrSY=;
 b=fRzzETykmewAQ80LwPVrloHIEvzmk9wX4ZXEzuesiRVamC1GsnUi3EM8BkVI67mImt
 Bdi5xwhhbzlqVVqS/5srBw/8r5jkh2+yNk4tTzcUDAJtT4p4ct4KlB8oSriKZkqw4KMC
 biAjjRrDquZbjlm12dZXfGgvGqirDYQZ8O71OIVG8kjjGJStDejLrWG+t9GRaN6j0tE4
 ovX9hA4pEHXgcPEZPA2/HUnRjwZr3sZbsRFWLmrKcIXnlP5/8S1M0VVOxeC1Wyxtz7FH
 1zp1ikuKy96LEIaGtucJYZqEw+8TT2nplsgfNw5q/jWc+3T+Mft5sqfBOWYJXA+v/2//
 h+vQ==
MIME-Version: 1.0
X-Received: by 10.180.93.37 with SMTP id cr5mr48420447wib.76.1415781453682;
 Wed, 12 Nov 2014 00:37:33 -0800 (PST)
Received: by 10.27.86.144 with HTTP; Wed, 12 Nov 2014 00:37:33 -0800 (PST)
In-Reply-To: <CAHVfvh4+96-St8O=C9q6PvjwpbGVDBGL06Lhc5vZL0QzXfobYQ@mail.gmail.com>
References: <CAHVfvh4+96-St8O=C9q6PvjwpbGVDBGL06Lhc5vZL0QzXfobYQ@mail.gmail.com>
Date: Wed, 12 Nov 2014 10:37:33 +0200
Message-ID: <CAHVfvh4VfsGEEF1e-Qh8wi6emA=i3RQEqC+VwiuZ7uf0qC+jyg@mail.gmail.com>
From: jigsaw <jigsaw@gmail.com>
To: Bruce Richardson <bruce.richardson@intel.com>,
 "dev@dpdk.org" <dev@dpdk.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] LLC miss in librte_distributor
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Nov 2014 08:27:40 -0000

Hi,

OK it is now very clear it is due to memory transactions between different
nodes.

The test program is here:
https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b

The test machine topology is:

NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss
boost from  0.09% to 33.45%.
The LLC cache store miss boost from 0.027% to 50.695%.

Clearly the root cause is transaction crossing the node boundary.

But then how to resolve this problem is another topic...

thx &
rgds,
-ql


On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:

> Hi Bruce,
>
> I noticed that librte_distributor has quite sever LLC miss problem when
> running on 16 cores.
> While on 8 cores, there's no such problem.
> The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
> cores on 2 sockets.
>
> The test case is the distributor_perf_autotest, i.e.
> in app/test/test_distributor_perf.c.
> The test result is collected by command:
>
> perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test
> -cff -n2 --no-huge
>
> Note that test results show that with or without hugepage, the LCC miss
> rate remains the same. So I will just show --no-huge config.
>
> With 8 cores, the LLC miss rate is OK:
>
> LLC-load-misses  26750
> LLC-loads  93979233
> LLC-store-misses  432263
> LLC-stores  69954746
>
> That is 0.028% of load miss and 0.62% of store miss.
>
> With 16 cores, the LLC miss rate is very high:
>
> LLC-load-misses  70263520
> LLC-loads  143807657
> LLC-store-misses  23115990
> LLC-stores  63692854
>
> That is 48.9% load miss and 36.3% store miss.
>
> Most of the load miss happens at first line of rte_distributor_poll_pkt.
> Most of the store miss happens at ... I don't know, because perf record on
> LLC-store-misses brings down my machine.
>
> It's not so straightforward to me how could this happen: 8 core fine, but
> 16 cores very bad.
> My guess is that 16 cores bring in more QPI transaction between sockets?
> Or 16 cores bring a different LLC access pattern?
>
> So I tried to reduce the padding inside union rte_distributor_buffer from
> 3 cachelines to 1 cacheline.
>
> -     char pad[CACHE_LINE_SIZE*3];
> +    char pad[CACHE_LINE_SIZE];
>
> And it does have a obvious result:
>
> LLC-load-misses  53159968
> LLC-loads  167756282
> LLC-store-misses  29012799
> LLC-stores  63352541
>
> Now it is 31.69% of load miss, and 45.79% of store miss.
>
> It lows down the load miss rate, but raises the store miss rate.
> Both numbers are still very high, sadly.
> But the bright side is that it decrease the Time per burst and time per
> packet.
>
> The original version has:
> === Performance test of distributor ===
> Time per burst:  8013
> Time per packet: 250
>
> And the patched ver has:
> === Performance test of distributor ===
> Time per burst:  6834
> Time per packet: 213
>
>
> I tried a couple of other tricks. Such as adding more idle loops
> in rte_distributor_get_pkt,
> and making the rte_distributor_buffer thread_local to each worker core.
> But none of this trick
> has any noticeable outcome. These failures make me tend to believe the
> high LLC miss rate
> is related to QPI or NUMA. But my machine is not able to perf on uncore
> QPI events so this
> cannot be approved.
>
>
> I cannot draw any conclusion or reveal the root cause after all. But I
> suggest a further study on the performance bottleneck so as to find a good
> solution.
>
> thx &
> rgds,
> -qinglai
>
>