From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id CD1CDA0546;
	Thu, 27 May 2021 17:49:25 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 964EB40150;
	Thu, 27 May 2021 17:49:25 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 8CA3540143
 for <dev@dpdk.org>; Thu, 27 May 2021 17:49:24 +0200 (CEST)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 27 May 2021 17:49:19 +0200
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk>
In-Reply-To: <YKzBWEKuHU2AlUtI@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [dpdk-dev] rte_memcpy - fence and stream
Thread-Index: AddRRzlUqrkozyKPTS6VXsVrBzmKIwBxSBEw
References: <CAOT2AHVSfmYi3rB4cnXwi-NDf7hG8_p97G5FLnD0-PxaL637MA@mail.gmail.com>
 <CAOT2AHX7Zf-N-wFRJePSHfaup2ogxR6udNeT82zxjOSH5n0xpw@mail.gmail.com>
 <YKzBWEKuHU2AlUtI@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>,
 "Manish Sharma" <manish.sharmajee75@gmail.com>
Cc: <dev@dpdk.org>
Subject: Re: [dpdk-dev] rte_memcpy - fence and stream
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, 25 May 2021 11.20
>=20
> On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > I am looking at the source for rte_memcpy (this is a discussion only
> > for x86-64)
> >
> > For one of the cases, when aligned correctly, it uses
> >
> > /**
> >  * Copy 64 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> > static __rte_always_inline void
> > rte_mov64(uint8_t *dst, const uint8_t *src)
> > {
> >         __m512i zmm0;
> >
> >         zmm0 =3D _mm512_loadu_si512((const void *)src);
> >         _mm512_storeu_si512((void *)dst, zmm0);
> > }
> >
> > I had some questions about this:
> >

[snip]

> > 3. Why isn't the code using  stream variants,
> > _mm512_stream_load_si512 and friends?
> > It would not pollute the cache, so should be better - unless
> > the required fence instructions cause a drop in performance?
> >
> Whether the stream variants perform better really depends on what you
> are
> trying to measure. However, in the vast majority of cases a core is
> making
> a copy of data to work on it, in which case having the data not in
> cache is
> going to cause massive stalls on the next access to that data as it's
> fetched from DRAM. Therefore, the best option is to use regular
> instructions meaning that the data is in local cache afterwards, =
giving
> far
> better performance when the data is actually used.
>=20

Good response, Bruce. And you are probably right about most use cases =
looking like you describe.

I'm certainly no expert on deep x86-64 optimization, but please let me =
think out loud here...

I can come up with a different scenario: One core is analyzing packet =
headers, and determines to copy some of the packets in their entirety =
(e.g. using rte_pktmbuf_copy) for another core to process later, so it =
enqueues the copies to a ring, which the other core dequeues from.

The first core doesn't care about the packet contents, and the second =
core will read the copy of the packet much later, because it needs to =
process the packets in front of the queue first.

Wouldn't this scenario benefit from a rte_memcpy variant that doesn't =
pollute the cache?

I know that rte_pktmbuf_clone might be better to use in the described =
scenario, but rte_pktmbuf_copy must be there for a reason - and I guess =
that some uses of rte_pktmbuf_copy could benefit from a non-cache =
polluting variant of rte_memcpy.

-Morten