From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 07F5AA0548; Tue, 22 Jun 2021 23:55:58 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id BDF574003F; Tue, 22 Jun 2021 23:55:57 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 9E55D4003E for ; Tue, 22 Jun 2021 23:55:56 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Tue, 22 Jun 2021 23:55:55 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35C61880@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C617E4@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [dpdk-dev] rte_memcpy - fence and stream X-MimeOLE: Produced By Microsoft Exchange V6.5 Thread-Index: AddTHPwIqjlkBkgjRFSl+BhQXcU3ggABFdIQBSMUJcA= References: <98CBD80474FA8B44BF855DF32C47DC35C617E1@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C617E4@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Manish Sharma" Cc: , "Bruce Richardson" Subject: Re: [dpdk-dev] rte_memcpy - fence and stream X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Morten Br=F8rup > Sent: Thursday, 27 May 2021 20.15 >=20 > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce = Richardson > > Sent: Thursday, 27 May 2021 19.22 > > > > On Thu, May 27, 2021 at 10:39:59PM +0530, Manish Sharma wrote: > > > For the case I have, hardly 2% of the data buffers which are > being > > > copied get looked at - mostly its for DMA. Which data buffers are you not looking at, Manish? The original data = buffers, or the copies, or both? > > > Having a version of DPDK > > > memcopy that does non temporal copies would definitely be good. > > > If in my case, I have a lot of CPUs doing the copy in parallel, > > would > > > I/OAT driver copy accelerator still help? > > > > > It will depend upon the size of the copies being done. For bigger > > packets > > the accelerator can help free up CPU cycles for other things. > > > > However, if only 2% of the data which is being copied gets looked = at, > > why > > does it need to be copied? Can the original buffers not be used in > that > > case? >=20 > I can only speak for myself here... >=20 > Our firmware has a packet capture feature with a filter. >=20 > If a packet matches the capture filter, a metadata header and the > relevant part of the packet contents ("snap length" in tcpdump > terminology) is appended to a large memory area (the "capture buffer") > using rte_pktmbuf_read/rte_memcpy. This capture buffer is only read > through the GUI or management API by the network administrator, i.e. = it > will only be read minutes or hours later, so there is no need to put > any of it in any CPU cache. >=20 > It does not make sense to clone and hold on to many thousands of mbufs > when we only need some of their contents. So we copy the contents > instead of increasing the mbuf refcount. >=20 > We currently only use our packet capture feature for R&D purposes, so > we have not optimized it yet. However, we will need to optimize it for > production use at some point. So I find this discussion initiated by > Manish very interesting. >=20 > -Morten Here's some code for inspiration. I haven't tested it yet. And it can be = further optimized. /** * Copy 16 bytes from one location to another, using non-temporal = storage * at the destination. * The locations must not overlap. * * @param dst * Pointer to the destination of the data. * Must be aligned on a 16-byte boundary. * @param src * Pointer to the source data. * Does not need to be aligned on any particular boundary. */ static __rte_always_inline void rte_mov16_aligned16_non_temporal(uint8_t *dst, const uint8_t *src) { __m128i xmm0; xmm0 =3D _mm_loadu_si128((const __m128i *)src); _mm_stream_si128((__m128i *)dst, xmm0); } /** * Copy bytes from one location to another, using non-temporal storage * at the destination. * The locations must not overlap. * * @param dst * Pointer to the destination of the data. * Must be aligned on a 16-byte boundary. * @param src * Pointer to the source data. * Does not need to be aligned on any particular boundary. * @param n * Number of bytes to copy. * Must be divisble by 4. * @return * Pointer to the destination data. */ static __rte_always_inline void * rte_memcpy_aligned16_non_temporal(void *dst, const void *src, size_t n) { void * const ret =3D dst; RTE_ASSERT(!((uintptr_t)dst & 0xF)); RTE_ASSERT(!(n & 3)); while (n >=3D 16) { rte_mov16_aligned16_non_temporal(dst, src); src =3D (const uint8_t *)src + 16; dst =3D (uint8_t *)dst + 16; n -=3D 16; } if (n & 8) { int64_t a =3D *(const int64_t *)src; _mm_stream_si64((long long int *)dst, a); src =3D (const uint8_t *)src + 8; dst =3D (uint8_t *)dst + 8; n -=3D 8; } if (n & 4) { int32_t a =3D *(const int32_t *)src; _mm_stream_si32((int32_t *)dst, a); src =3D (const uint8_t *)src + 4; dst =3D (uint8_t *)dst + 4; n -=3D 4; } return ret; }