From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6BD1CA0546; Tue, 25 May 2021 11:20:34 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id DF31940150; Tue, 25 May 2021 11:20:33 +0200 (CEST) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by mails.dpdk.org (Postfix) with ESMTP id 92F794003F for ; Tue, 25 May 2021 11:20:31 +0200 (CEST) IronPort-SDR: ST3nvE4WhnYFILqtFyuC1wdIqru9JyBOLOxncUTDLAfnEZ06SysmN5NZmLDulwnIlNFl9m0I6I MdEfg2a1gPWQ== X-IronPort-AV: E=McAfee;i="6200,9189,9994"; a="200249717" X-IronPort-AV: E=Sophos;i="5.82,328,1613462400"; d="scan'208";a="200249717" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 May 2021 02:20:30 -0700 IronPort-SDR: N0GGVCnkio+GliIUi0wo/KMo8b9SUZYAMeE2HnADuBuI3IJz8x1q6sd3q+VBggnkGavCubFVDh Rd3aas8WaVxQ== X-IronPort-AV: E=Sophos;i="5.82,328,1613462400"; d="scan'208";a="397287129" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.6.111]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 25 May 2021 02:20:28 -0700 Date: Tue, 25 May 2021 10:20:24 +0100 From: Bruce Richardson To: Manish Sharma Cc: dev@dpdk.org Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] rte_memcpy - fence and stream X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote: > I am looking at the source for rte_memcpy (this is a discussion only for > x86-64) > > For one of the cases, when aligned correctly, it uses > > /** > * Copy 64 bytes from one location to another, > * locations should not overlap. > */ > static __rte_always_inline void > rte_mov64(uint8_t *dst, const uint8_t *src) > { > __m512i zmm0; > > zmm0 = _mm512_loadu_si512((const void *)src); > _mm512_storeu_si512((void *)dst, zmm0); > } > > I had some questions about this: > > 1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that > not required in this case and if not, why isnt it needed? > > 2. Are the mm512_loadu_si512 and _mm512_storeu_si512 non temporal? > They are not non-temporal, so don't need fences. > 3. Why isn't the code using stream variants, _mm512_stream_load_si512 and > friends? It would not pollute the cache, so should be better - unless the > required fence instructions cause a drop in performance? > Whether the stream variants perform better really depends on what you are trying to measure. However, in the vast majority of cases a core is making a copy of data to work on it, in which case having the data not in cache is going to cause massive stalls on the next access to that data as it's fetched from DRAM. Therefore, the best option is to use regular instructions meaning that the data is in local cache afterwards, giving far better performance when the data is actually used. > 4. Do the _mm512_stream_load_si512 need fence instructions? Based on my > reading of the spec, the answer is yes - but wanted to confirm. > I believe a fence would be necessary for safety, yes.