From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id ED3A7A0546; Thu, 27 May 2021 18:25:28 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DB68240683; Thu, 27 May 2021 18:25:28 +0200 (CEST) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by mails.dpdk.org (Postfix) with ESMTP id DB0F540143 for ; Thu, 27 May 2021 18:25:26 +0200 (CEST) IronPort-SDR: q7KrAUDbN8BMFTDIxu/i4LTMaYZLX202U6IdOK0nfxBqR6tnPhSMEshf5uDZh95N92pLfdIVDx MDBVPh3hAatw== X-IronPort-AV: E=McAfee;i="6200,9189,9997"; a="200885242" X-IronPort-AV: E=Sophos;i="5.83,227,1616482800"; d="scan'208";a="200885242" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 May 2021 09:25:25 -0700 IronPort-SDR: nGQiYArrV7kQYDjZynmLuYOFgG4bXLw9gjX6r9LouGHsSOFaeG25sRIL+K7b5D8xvUt/mPZLfy nWm4BreOV9Ew== X-IronPort-AV: E=Sophos;i="5.83,227,1616482800"; d="scan'208";a="443638801" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.6.178]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 27 May 2021 09:25:23 -0700 Date: Thu, 27 May 2021 17:25:19 +0100 From: Bruce Richardson To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: Manish Sharma , dev@dpdk.org Message-ID: References: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk> Subject: Re: [dpdk-dev] rte_memcpy - fence and stream X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Brørup wrote: > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > > Sent: Tuesday, 25 May 2021 11.20 > > > > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote: > > > I am looking at the source for rte_memcpy (this is a discussion only > > > for x86-64) > > > > > > For one of the cases, when aligned correctly, it uses > > > > > > /** > > > * Copy 64 bytes from one location to another, > > > * locations should not overlap. > > > */ > > > static __rte_always_inline void > > > rte_mov64(uint8_t *dst, const uint8_t *src) > > > { > > > __m512i zmm0; > > > > > > zmm0 = _mm512_loadu_si512((const void *)src); > > > _mm512_storeu_si512((void *)dst, zmm0); > > > } > > > > > > I had some questions about this: > > > > > [snip] > > > > 3. Why isn't the code using stream variants, > > > _mm512_stream_load_si512 and friends? > > > It would not pollute the cache, so should be better - unless > > > the required fence instructions cause a drop in performance? > > > > > Whether the stream variants perform better really depends on what you > > are > > trying to measure. However, in the vast majority of cases a core is > > making > > a copy of data to work on it, in which case having the data not in > > cache is > > going to cause massive stalls on the next access to that data as it's > > fetched from DRAM. Therefore, the best option is to use regular > > instructions meaning that the data is in local cache afterwards, giving > > far > > better performance when the data is actually used. > > > > Good response, Bruce. And you are probably right about most use cases looking like you describe. > > I'm certainly no expert on deep x86-64 optimization, but please let me think out loud here... > > I can come up with a different scenario: One core is analyzing packet headers, and determines to copy some of the packets in their entirety (e.g. using rte_pktmbuf_copy) for another core to process later, so it enqueues the copies to a ring, which the other core dequeues from. > > The first core doesn't care about the packet contents, and the second core will read the copy of the packet much later, because it needs to process the packets in front of the queue first. > > Wouldn't this scenario benefit from a rte_memcpy variant that doesn't pollute the cache? > > I know that rte_pktmbuf_clone might be better to use in the described scenario, but rte_pktmbuf_copy must be there for a reason - and I guess that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting variant of rte_memcpy. > That is indeed a possible scenario, but in that case we would probably want to differentiate between different levels of cache. While we would not want the copy to be polluting the local L1 or L2 cache of the core doing the copy (unless the other core was a hyperthread), we probably would want any copy to be present in any shared caches, rather than all the way in DRAM. For Intel platforms and a scenario which you describe, I would actually recommend using the "ioat" driver copy accelerator if cache pollution is a concern. In the case of the copies being done in HW, the local cache of a core would not be polluted, but the copied data could still end up in LLC due to DDIO. In terms of memcpy functions, given the number of possibilities of scenarios, in the absense of compelling data showing a meaningful benefit for a common scenario, I'd be wary about trying to provide specialized varients, since we could end up with a lot of them to maintain and tune. /Bruce