From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 25B13A0546; Thu, 27 May 2021 19:10:13 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 96D1440150; Thu, 27 May 2021 19:10:12 +0200 (CEST) Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) by mails.dpdk.org (Postfix) with ESMTP id CFC0E40143 for ; Thu, 27 May 2021 19:10:10 +0200 (CEST) Received: by mail-ej1-f52.google.com with SMTP id lg14so1191356ejb.9 for ; Thu, 27 May 2021 10:10:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0mEmRtKnuLp+r9edfYF4oS3vXv6JpyTPBeXxPH0re10=; b=H9sqFqpHFkx24QjAZFu36nmnpYT+fkRSjLGRV9qm9ej77TtmnCy8F9CYL1yYiYCOzC Fg8t/8SSjbZvDgY+WF9Lpj8yj6k83CTrDcUAeD/l/XRPlrfR3c/EXp8b39iY/TTWYpb7 LuIXGf3D9+rYNC7yVB7p2IOCfMe9lJ2s7gGyT8trwxrUEJUhATns2Setgx0y2dlaWcxr 2t+c2Ix5exKqtIih7KlWFUNLE6tB4OAu8E9NnwQjLYkSBZ5jI5aBWmBZ9CpBxcBDDTyz w1PHhsVL4fRnXQXPHABGSKpqXzSj6kkJsZQD/uwX2kp/rxiTC5bVXyIkcZ2qpL5YF0E8 5ymg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0mEmRtKnuLp+r9edfYF4oS3vXv6JpyTPBeXxPH0re10=; b=U40iwaAm42ZloMSJuhRXa/HaCjkfH8T5IsotPYr1H9dWA0l9nCbHXal5UEv7Cx6B+i CcSOI+7s4ex+HGsn5soE6I/6E7U2u7vsXp+nJR39XJdNmnC0noYq63MX3Rz0H6PT5pbc olR2tzpLy/g777yTFD2UTLuINb85gVopsWr+fZqc7dwW4G8QUH7yd2ny/ce4sa+1tp2d MyusL5Usj82cCGDUgciULtnNCT6LmOJi2UPwq/13spoHi2UMwRLEuggstBzvfyvAnwkJ YauIh6lfqaxVpkm1CPfVShJCF0+02dPvNPLPwddKHsvFHSP320WS2LQIfrPxfRbV2OKp UnKg== X-Gm-Message-State: AOAM532vY755U0phUVZzU3RNbFynfeLskgfQHdJgzL9/yrHcoqdytusM 2plCdaB1DMYs4CVC96xQsfyCvS3cUXAMF29yTiM= X-Google-Smtp-Source: ABdhPJxzBOWqUzQT1DaEdYol2RPnj/Mmbz2K3awWdZwsNml1yRD1UKD/N6T4xjEZhMTGJqg4q3EM6xKSClS+T0f80x0= X-Received: by 2002:a17:907:1b20:: with SMTP id mp32mr4983373ejc.495.1622135410467; Thu, 27 May 2021 10:10:10 -0700 (PDT) MIME-Version: 1.0 References: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk> In-Reply-To: From: Manish Sharma Date: Thu, 27 May 2021 22:39:59 +0530 Message-ID: To: Bruce Richardson Cc: =?UTF-8?Q?Morten_Br=C3=B8rup?= , dev@dpdk.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] rte_memcpy - fence and stream X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" For the case I have, hardly 2% of the data buffers which are being copied get looked at - mostly its for DMA. Having a version of DPDK memcopy that does non temporal copies would definitely be good. If in my case, I have a lot of CPUs doing the copy in parallel, would I/OAT driver copy accelerator still help? On Thu, May 27, 2021 at 9:55 PM Bruce Richardson wrote: > On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Br=C3=B8rup wrote: > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > > > Sent: Tuesday, 25 May 2021 11.20 > > > > > > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote: > > > > I am looking at the source for rte_memcpy (this is a discussion onl= y > > > > for x86-64) > > > > > > > > For one of the cases, when aligned correctly, it uses > > > > > > > > /** > > > > * Copy 64 bytes from one location to another, > > > > * locations should not overlap. > > > > */ > > > > static __rte_always_inline void > > > > rte_mov64(uint8_t *dst, const uint8_t *src) > > > > { > > > > __m512i zmm0; > > > > > > > > zmm0 =3D _mm512_loadu_si512((const void *)src); > > > > _mm512_storeu_si512((void *)dst, zmm0); > > > > } > > > > > > > > I had some questions about this: > > > > > > > > [snip] > > > > > > 3. Why isn't the code using stream variants, > > > > _mm512_stream_load_si512 and friends? > > > > It would not pollute the cache, so should be better - unless > > > > the required fence instructions cause a drop in performance? > > > > > > > Whether the stream variants perform better really depends on what you > > > are > > > trying to measure. However, in the vast majority of cases a core is > > > making > > > a copy of data to work on it, in which case having the data not in > > > cache is > > > going to cause massive stalls on the next access to that data as it's > > > fetched from DRAM. Therefore, the best option is to use regular > > > instructions meaning that the data is in local cache afterwards, givi= ng > > > far > > > better performance when the data is actually used. > > > > > > > Good response, Bruce. And you are probably right about most use cases > looking like you describe. > > > > I'm certainly no expert on deep x86-64 optimization, but please let me > think out loud here... > > > > I can come up with a different scenario: One core is analyzing packet > headers, and determines to copy some of the packets in their entirety (e.= g. > using rte_pktmbuf_copy) for another core to process later, so it enqueues > the copies to a ring, which the other core dequeues from. > > > > The first core doesn't care about the packet contents, and the second > core will read the copy of the packet much later, because it needs to > process the packets in front of the queue first. > > > > Wouldn't this scenario benefit from a rte_memcpy variant that doesn't > pollute the cache? > > > > I know that rte_pktmbuf_clone might be better to use in the described > scenario, but rte_pktmbuf_copy must be there for a reason - and I guess > that some uses of rte_pktmbuf_copy could benefit from a non-cache polluti= ng > variant of rte_memcpy. > > > > That is indeed a possible scenario, but in that case we would probably wa= nt > to differentiate between different levels of cache. While we would not wa= nt > the copy to be polluting the local L1 or L2 cache of the core doing the > copy (unless the other core was a hyperthread), we probably would want an= y > copy to be present in any shared caches, rather than all the way in DRAM. > For Intel platforms and a scenario which you describe, I would actually > recommend using the "ioat" driver copy accelerator if cache pollution is = a > concern. In the case of the copies being done in HW, the local cache of a > core would not be polluted, but the copied data could still end up in LLC > due to DDIO. > > In terms of memcpy functions, given the number of possibilities of > scenarios, in the absense of compelling data showing a meaningful benefit > for a common scenario, I'd be wary about trying to provide specialized > varients, since we could end up with a lot of them to maintain and tune. > > /Bruce >