From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9D1E0430FE;
	Fri, 25 Aug 2023 11:06:06 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 35E2F40695;
	Fri, 25 Aug 2023 11:06:06 +0200 (CEST)
Received: from dkmailrelay1.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 2CB5E400D5
 for <dev@dpdk.org>; Fri, 25 Aug 2023 11:06:04 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesys.local [192.168.4.10])
 by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id D7B6F208AA;
 Fri, 25 Aug 2023 11:06:03 +0200 (CEST)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: cache thrashing question
Date: Fri, 25 Aug 2023 11:06:01 +0200
X-MimeOLE: Produced By Microsoft Exchange V6.5
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87B3A@smartserver.smartshare.dk>
In-Reply-To: <ZOhk4SRRog9E1mlq@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: cache thrashing question
Thread-Index: AdnXLWSRfHitTJYYQUCfABxUEZY26gAAX8cw
References: <98CBD80474FA8B44BF855DF32C47DC35D87B39@smartserver.smartshare.dk>
 <ZOhk4SRRog9E1mlq@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <dev@dpdk.org>, <olivier.matz@6wind.com>, <andrew.rybchenko@oktetlabs.ru>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

+CC mempool maintainers

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 25 August 2023 10.23
>=20
> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Br=F8rup wrote:
> > Bruce,
> >
> > With this patch [1], it is noted that the ring producer and consumer =
data
> should not be on adjacent cache lines, for performance reasons.
> >
> > [1]:
> =
https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=3Dd9f0d3a1=
ffd4b66
> e75485cc8b63b9aedfbdfe8b0
> >
> > (It's obvious that they cannot share the same cache line, because =
they are
> accessed by two different threads.)
> >
> > Intuitively, I would think that having them on different cache lines =
would
> suffice. Why does having an empty cache line between them make a =
difference?
> >
> > And does it need to be an empty cache line? Or does it suffice =
having the
> second structure start at two cache lines after the start of the first
> structure (e.g. if the size of the first structure is two cache =
lines)?
> >
> > I'm asking because the same principle might apply to other code too.
> >
> Hi Morten,
>=20
> this was something we discovered when working on the distributor =
library.
> If we have cachelines per core where there is heavy access, having =
some
> cachelines as a gap between the content cachelines can help =
performance. We
> believe this helps due to avoiding issues with the HW prefetchers =
(e.g.
> adjacent cacheline prefetcher) bringing in the second cacheline
> speculatively when an operation is done on the first line.

I guessed that it had something to do with speculative prefetching, but =
wasn't sure. Good to get confirmation, and that it has a measureable =
effect somewhere. Very interesting!

NB: More comments in the ring lib about stuff like this would be nice.

So, for the mempool lib, what do you think about applying the same =
technique to the rte_mempool_debug_stats structure (which is an array =
indexed per lcore)... Two adjacent lcores heavily accessing their local =
mempool caches seems likely to me. But how heavy does the access need to =
be for this technique to be relevant?

For the rte_mempool_cache structure (also an array indexed per lcore), =
the last entries of the "objs" array at the end of the structure are =
unlikely to be used, so they already serve as a gap, and an additional =
gap seems irrelevant here.