From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id CD1CDA0546; Thu, 27 May 2021 17:49:25 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 964EB40150; Thu, 27 May 2021 17:49:25 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 8CA3540143 for ; Thu, 27 May 2021 17:49:24 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Thu, 27 May 2021 17:49:19 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [dpdk-dev] rte_memcpy - fence and stream Thread-Index: AddRRzlUqrkozyKPTS6VXsVrBzmKIwBxSBEw References: From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" , "Manish Sharma" Cc: Subject: Re: [dpdk-dev] rte_memcpy - fence and stream X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > Sent: Tuesday, 25 May 2021 11.20 >=20 > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote: > > I am looking at the source for rte_memcpy (this is a discussion only > > for x86-64) > > > > For one of the cases, when aligned correctly, it uses > > > > /** > > * Copy 64 bytes from one location to another, > > * locations should not overlap. > > */ > > static __rte_always_inline void > > rte_mov64(uint8_t *dst, const uint8_t *src) > > { > > __m512i zmm0; > > > > zmm0 =3D _mm512_loadu_si512((const void *)src); > > _mm512_storeu_si512((void *)dst, zmm0); > > } > > > > I had some questions about this: > > [snip] > > 3. Why isn't the code using stream variants, > > _mm512_stream_load_si512 and friends? > > It would not pollute the cache, so should be better - unless > > the required fence instructions cause a drop in performance? > > > Whether the stream variants perform better really depends on what you > are > trying to measure. However, in the vast majority of cases a core is > making > a copy of data to work on it, in which case having the data not in > cache is > going to cause massive stalls on the next access to that data as it's > fetched from DRAM. Therefore, the best option is to use regular > instructions meaning that the data is in local cache afterwards, = giving > far > better performance when the data is actually used. >=20 Good response, Bruce. And you are probably right about most use cases = looking like you describe. I'm certainly no expert on deep x86-64 optimization, but please let me = think out loud here... I can come up with a different scenario: One core is analyzing packet = headers, and determines to copy some of the packets in their entirety = (e.g. using rte_pktmbuf_copy) for another core to process later, so it = enqueues the copies to a ring, which the other core dequeues from. The first core doesn't care about the packet contents, and the second = core will read the copy of the packet much later, because it needs to = process the packets in front of the queue first. Wouldn't this scenario benefit from a rte_memcpy variant that doesn't = pollute the cache? I know that rte_pktmbuf_clone might be better to use in the described = scenario, but rte_pktmbuf_copy must be there for a reason - and I guess = that some uses of rte_pktmbuf_copy could benefit from a non-cache = polluting variant of rte_memcpy. -Morten