From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jigsaw@gmail.com>
Received: from mail-wi0-f169.google.com (mail-wi0-f169.google.com
 [209.85.212.169]) by dpdk.org (Postfix) with ESMTP id 02A3D7E75
 for <dev@dpdk.org>; Thu, 13 Nov 2014 16:07:43 +0100 (CET)
Received: by mail-wi0-f169.google.com with SMTP id n3so2215248wiv.4
 for <dev@dpdk.org>; Thu, 13 Nov 2014 07:17:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=o05M0SurltjJZGEads1c3LBjaHucow+Aq4tckLgh3F0=;
 b=Ti7mGMGOMROy0aivMV2gw4g/l2aQCK9KdxhDdjB5ed7WNhTORuC7rCTnK7sSAAZwuz
 QbIrZQBPwZ4rg2cLuqvW257Y2kIE48VF+bNE5+uMlxIpdRtLRsMAn5b93YYAFCMtSxRN
 S4EsNEKH2dwmDIz29X75NR8D+yWpoVMjxwrq0iv4rtD7cu2G5idzqk4ICGV9RHduylm2
 mR+aLEAF2tm6reob4A/bEEo3VcSuemIrT9gFyAH6um41t1hv1rHFj6CzdzJ7BlARhLhb
 o/ffp4q1q200gC4u1dkXTrMZjfflcus1TEJtWwPtTua4Lm5+ESrC29HgpyleIeJ5dg9a
 MZ9g==
MIME-Version: 1.0
X-Received: by 10.180.211.108 with SMTP id nb12mr4789928wic.76.1415891861906; 
 Thu, 13 Nov 2014 07:17:41 -0800 (PST)
Received: by 10.27.86.14 with HTTP; Thu, 13 Nov 2014 07:17:41 -0800 (PST)
In-Reply-To: <CAHVfvh6UtFPt9R4m=ynbeGfWACaCA40haR+B6s8Lw_UGfu7WMg@mail.gmail.com>
References: <CAHVfvh4+96-St8O=C9q6PvjwpbGVDBGL06Lhc5vZL0QzXfobYQ@mail.gmail.com>
 <CAHVfvh4VfsGEEF1e-Qh8wi6emA=i3RQEqC+VwiuZ7uf0qC+jyg@mail.gmail.com>
 <20141112160709.GB7952@bricha3-MOBL3>
 <CAHVfvh6UtFPt9R4m=ynbeGfWACaCA40haR+B6s8Lw_UGfu7WMg@mail.gmail.com>
Date: Thu, 13 Nov 2014 17:17:41 +0200
Message-ID: <CAHVfvh6OatucASEEXAOoqQ4XWG-N7tsMnqn3bx2GfcyARdYN5w@mail.gmail.com>
From: jigsaw <jigsaw@gmail.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] LLC miss in librte_distributor
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Nov 2014 15:07:43 -0000

Hi,

Well, I give up the idea of optimizing QPI caused LLC miss.
The queue based messaging has even worse performance than polling a same
buf from both cores.
It is the nature of busy polling model.
I guess we have to accept it as a fact, unless the programming model can be
changed to a biased locking model,
which favors one lock-owner core. But unfortunately the biased locking
model doesn't seem to be applicable for distributor.

thx &
rgds,
-ql

On Wed, Nov 12, 2014 at 7:11 PM, jigsaw <jigsaw@gmail.com> wrote:

> Hi Bruce,
>
> Thanks for your reply.
>
> I agree that to logically divide the distributor functionality is the best
> solution.
>
> Meantime I tried some tricks and the result looks good: For same amount of
> pkts (1M), the LLC stores and loads decrease 90% percent, and the miss
> rates for both decrease to 25%.
> The L1 miss rate increase a bit, thought.
> Then the combined result is that the time spent decreases 50%.
> The main change I made is to use a FIFO to transfer the pkts from
> distributor to worker, while the current buf is only used as a signalling
> channel. This change has a very obvious effect on saving LLC access.
>
> However, the test is based on the simple test program, rather on DPDK
> application. So I will try same tricks on DPDK and see if it has same
> effect.
> Besides, I need more time to read a few more papers to get it right.
>
> I will try to propose a patch if I manage to get a positive result. It
> will take several days coz I'm not fully dedicated to this issue.
>
> I will come back with more details.
>
> BTW, I have another user story: a worker can asking distributor to
> schedule a pkt.
> It arises in such condition: After processing pkt with tag value 1, the
> worker changes it's tag to 2, so the distributor has to be
> asked to deliver the pkt with new tag value to proper worker.
> I already have the patch ready but I will hold it back until previous
> patch is committed.
> I need also your comments on this user story.
>
> thx &
> rgds,
> -ql
>
> On Wed, Nov 12, 2014 at 6:07 PM, Bruce Richardson <
> bruce.richardson@intel.com> wrote:
>
>> On Wed, Nov 12, 2014 at 10:37:33AM +0200, jigsaw wrote:
>> > Hi,
>> >
>> > OK it is now very clear it is due to memory transactions between
>> different
>> > nodes.
>> >
>> > The test program is here:
>> > https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b
>> >
>> > The test machine topology is:
>> >
>> > NUMA node0 CPU(s):     0-7,16-23
>> > NUMA node1 CPU(s):     8-15,24-31
>> >
>> > Change the 3rd param from 0 to 1 at line 135, and the LLC cache load
>> miss
>> > boost from  0.09% to 33.45%.
>> > The LLC cache store miss boost from 0.027% to 50.695%.
>> >
>> > Clearly the root cause is transaction crossing the node boundary.
>> >
>> > But then how to resolve this problem is another topic...
>> >
>> > thx &
>> > rgds,
>> > -ql
>> >
>> >
>>
>> Having traffic cross QPI is always a problem, and there could be a number
>> of ways
>> to solve it. Probably the best solution is to have multiple NICs with some
>> directly connected to each socket, with the packets from each NIC
>> processed locally
>> on the socket that NIC is connected to.
>>
>> If that is not possible, then other solutions need to be looked at. E.g.
>> For an app
>> wanting to use a distributor, I would suggest investigating if two
>> distributors
>> could be used - one on each socket. Then use a ring to burst-transfer
>> large
>> groups of packets from one socket to another and then use the distributor
>> locally.
>> This would involve far less QPI traffic than using a distributor with
>> remote workers.
>>
>> Regards,
>> /Bruce
>>
>> >
>> > On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw@gmail.com> wrote:
>> >
>> > > Hi Bruce,
>> > >
>> > > I noticed that librte_distributor has quite sever LLC miss problem
>> when
>> > > running on 16 cores.
>> > > While on 8 cores, there's no such problem.
>> > > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32
>> > > cores on 2 sockets.
>> > >
>> > > The test case is the distributor_perf_autotest, i.e.
>> > > in app/test/test_distributor_perf.c.
>> > > The test result is collected by command:
>> > >
>> > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores
>> ./test
>> > > -cff -n2 --no-huge
>> > >
>> > > Note that test results show that with or without hugepage, the LCC
>> miss
>> > > rate remains the same. So I will just show --no-huge config.
>> > >
>> > > With 8 cores, the LLC miss rate is OK:
>> > >
>> > > LLC-load-misses  26750
>> > > LLC-loads  93979233
>> > > LLC-store-misses  432263
>> > > LLC-stores  69954746
>> > >
>> > > That is 0.028% of load miss and 0.62% of store miss.
>> > >
>> > > With 16 cores, the LLC miss rate is very high:
>> > >
>> > > LLC-load-misses  70263520
>> > > LLC-loads  143807657
>> > > LLC-store-misses  23115990
>> > > LLC-stores  63692854
>> > >
>> > > That is 48.9% load miss and 36.3% store miss.
>> > >
>> > > Most of the load miss happens at first line of
>> rte_distributor_poll_pkt.
>> > > Most of the store miss happens at ... I don't know, because perf
>> record on
>> > > LLC-store-misses brings down my machine.
>> > >
>> > > It's not so straightforward to me how could this happen: 8 core fine,
>> but
>> > > 16 cores very bad.
>> > > My guess is that 16 cores bring in more QPI transaction between
>> sockets?
>> > > Or 16 cores bring a different LLC access pattern?
>> > >
>> > > So I tried to reduce the padding inside union rte_distributor_buffer
>> from
>> > > 3 cachelines to 1 cacheline.
>> > >
>> > > -     char pad[CACHE_LINE_SIZE*3];
>> > > +    char pad[CACHE_LINE_SIZE];
>> > >
>> > > And it does have a obvious result:
>> > >
>> > > LLC-load-misses  53159968
>> > > LLC-loads  167756282
>> > > LLC-store-misses  29012799
>> > > LLC-stores  63352541
>> > >
>> > > Now it is 31.69% of load miss, and 45.79% of store miss.
>> > >
>> > > It lows down the load miss rate, but raises the store miss rate.
>> > > Both numbers are still very high, sadly.
>> > > But the bright side is that it decrease the Time per burst and time
>> per
>> > > packet.
>> > >
>> > > The original version has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  8013
>> > > Time per packet: 250
>> > >
>> > > And the patched ver has:
>> > > === Performance test of distributor ===
>> > > Time per burst:  6834
>> > > Time per packet: 213
>> > >
>> > >
>> > > I tried a couple of other tricks. Such as adding more idle loops
>> > > in rte_distributor_get_pkt,
>> > > and making the rte_distributor_buffer thread_local to each worker
>> core.
>> > > But none of this trick
>> > > has any noticeable outcome. These failures make me tend to believe the
>> > > high LLC miss rate
>> > > is related to QPI or NUMA. But my machine is not able to perf on
>> uncore
>> > > QPI events so this
>> > > cannot be approved.
>> > >
>> > >
>> > > I cannot draw any conclusion or reveal the root cause after all. But I
>> > > suggest a further study on the performance bottleneck so as to find a
>> good
>> > > solution.
>> > >
>> > > thx &
>> > > rgds,
>> > > -qinglai
>> > >
>> > >
>>
>
>