From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 58828A00C4; Thu, 28 Jul 2022 00:02:29 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id F0ADC4021F; Thu, 28 Jul 2022 00:02:28 +0200 (CEST) Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42]) by mails.dpdk.org (Postfix) with ESMTP id A940240141 for ; Thu, 28 Jul 2022 00:02:27 +0200 (CEST) Received: by mail-ua1-f42.google.com with SMTP id v10so75025uap.11 for ; Wed, 27 Jul 2022 15:02:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=GMgooqcJK8/ueOomzwA6GsWD9sCjhq8FBeF7V7g7w14=; b=FLsDS1JUvJxPlxPmscmsf61QKUgN7cUI3N+6kSpzEA+fXe1k0SD40PIMWcYWzNHwWO 6VV5pCjAxZgGk3UEN62lSGextrcBz+hgQ3eV9x8A/WljRsgsanV9zDcl5zT8rMxCK3CB 4LTyx6QMMfFcfjPRjmTQl3FtKFw1OHpr4UgHv5VyZsk2uS1CY8upg4x+ZjTBkjpatqV1 RbAGL9TdWb2zruS6t/G8lRAKXj3z9QKSL30q19+Pclxi3J1ECHQHyb9erliTOCO+d6/w otzcXuNEtIwBDm8FQUccyiNDa0v3HkmOUMuPLRMe4RR2oWDFLthh98gCAv6BztfjyExH xgrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GMgooqcJK8/ueOomzwA6GsWD9sCjhq8FBeF7V7g7w14=; b=FJrh+P3I1qCNw6hxF+JQWz2u1L015gP9bbGIGB+PwwgWACHTWBCY+4H6lAhKcy4q0c hSOWYcbk3QZ53NtAJyofehE6pu9vPqLjY6Z9ficXzmrdQ9kcf7BXl+UzaAL9QfAGoB+S Gg+5DdM+FdQg4VsfztMuNtzkMUaWEDdbIC1z25Rwmm6BynIvldkvCuNb2Pw2YB/icHC3 9/HI/kRrn6oBmpPCiD6c4u6ixxiwUuLGg7RJ2jwXVB0wYFGNF5IzWVpXnTJiOsuFFGrz vBXN6ivVEDwYeiLNfeTbgetuScjZFfuFb9lhLjzh95LO5CMTKUzr6G8f6mei3ZKWVPf9 HCyg== X-Gm-Message-State: AJIora+gJf9gJBLXhjlBq4l1HwxWfQId7Nb7Do4XhYvWHYWi+0qnZNqr bTnsYezLzKOT00XKDkVMGwFQsNDqSbCTl4ixp1azSw== X-Google-Smtp-Source: AGRyM1sSOxnsBsJocmNEzlPoaodThI46A04j/UZHRUEqwGEaGDVurNqIRpBzBdvPxWY1xZJOEox3NKeU3WiNNCSIwrk= X-Received: by 2002:ab0:1427:0:b0:384:df94:4ed9 with SMTP id b36-20020ab01427000000b00384df944ed9mr1277512uae.6.1658959347009; Wed, 27 Jul 2022 15:02:27 -0700 (PDT) MIME-Version: 1.0 References: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D871DB@smartserver.smartshare.dk> <262c214b-7870-a221-2621-6684dce42823@yandex.ru> <98CBD80474FA8B44BF855DF32C47DC35D871F7@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D871FD@smartserver.smartshare.dk> In-Reply-To: From: =?UTF-8?Q?Stanis=C5=82aw_Kardach?= Date: Thu, 28 Jul 2022 00:02:16 +0200 Message-ID: Subject: Re: [RFC v2] non-temporal memcpy To: Honnappa Nagarahalli Cc: =?UTF-8?Q?Morten_Br=C3=B8rup?= , Konstantin Ananyev , dev , Bruce Richardson , Jan Viktorin , Ruifeng Wang , David Christensen , nd Content-Type: multipart/alternative; boundary="00000000000093622405e4d092a6" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org --00000000000093622405e4d092a6 Content-Type: text/plain; charset="UTF-8" On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, < Honnappa.Nagarahalli@arm.com> wrote: > > > > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > > > Sent: Wednesday, 27 July 2022 19.38 > > > > > > > [...] > > > > > > > > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's > > > supposed > > > > > to be arch > > > > > > specific limitation, that we probably want to hide, no? > > > > > > > > Correct. However, optional hints for optimization purposes will be > > > available. > > > > And it is up to the architecture specific implementation to make the > > > best use > > > > of these hints, or just ignore them. > > > > > > > > > > Inside the function can check alignment of both src and dst and > > > > > decide should it > > > > > > use NT load/store instructions or just do normal copy. > > > > > IMO, the normal copy should not be done by this API under any > > > > > conditions. Why not let the application call memcpy/rte_memcpy > > > > > when the NT copy is not applicable? It helps the programmer to > > > understand > > > > > and debug the issues much easier. > > > > > > > > Yes, the programmer must choose between normal memcpy() and non- > > > > temporal rte_memcpy_nt(). I am offering new functions, not modifying > > > > memcpy() or rte_memcpy(). > > > > > > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if > > > non- > > > > temporal copying is unavailable, e.g. on POWER and RISC-V > > > architectures, > > > > which don't have NT load/store instructions. > > > I am talking about a scenario where the application is being ported > > > between architectures. Not everyone knows about the capabilities of > > > the architecture. It is better to indicate upfront (ex: compilation > > > failures) that a certain feature is not supported on the target > > > architecture rather than the user having to discover through painful > > > debugging. > > > > I'm considering rte_memcpy_nt() a performance optimized variant of > > memcpy(), where the performance gain is less cache pollution. Thus, > silent > > fallback to memcpy() should suffice. > > > > Other architecture differences also affect DPDK performance; the > inability to > > perform non-temporal load/store just one more to the (undocumented) list. > > > > Failing at build time if NT load/store is unavailable by the > architecture would > > prevent the function from being used by other DPDK libraries, e.g. by the > > rte_pktmbuf_copy() function used by the pdump library. > The other libraries in DPDK need to provide NT versions as the libraries > need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy > under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt() > > > > > I don't oppose to your idea, I just don't have any idea how to reasonably > > implement it. So I'm trying to defend why it is not important. I am suggesting that the applications could implement #ifdef depending on > the architecture. > I assume that it would be a pre-processor flag defined (or not) on DPDK side and application doing #ifdef based on it? Another way to achieve this would be to use #warning directive (see [1]) inside DPDK when the generic fallback is taken. Also isn't the argument on memcpy_nt capability query not a more general one, that is how would/should application query DPDK's capabilities when run or compiled? [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html > --00000000000093622405e4d092a6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <<= a href=3D"mailto:Honnappa.Nagarahalli@arm.com" rel=3D"noreferrer noreferrer= " target=3D"_blank">Honnappa.Nagarahalli@arm.com> wrote:
<snip>
>
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Wednesday, 27 July 2022 19.38
> >
>
> [...]
>
> > >
> > > > > Yes, x86 needs 16B alignment for NT load/stores Bu= t that's
> > supposed
> > > > to be arch
> > > > > specific limitation, that we probably want to hide= , no?
> > >
> > > Correct. However, optional hints for optimization purposes w= ill be
> > available.
> > > And it is up to the architecture specific implementation to = make the
> > best use
> > > of these hints, or just ignore them.
> > >
> > > > > Inside the function can check alignment of both sr= c and dst and
> > > > decide should it
> > > > > use NT load/store instructions or just do normal c= opy.
> > > > IMO, the normal copy should not be done by this API und= er any
> > > > conditions. Why not let the application call memcpy/rte= _memcpy
> > > > when the NT copy is not applicable? It helps the progra= mmer to
> > understand
> > > > and debug the issues much easier.
> > >
> > > Yes, the programmer must choose between normal memcpy() and = non-
> > > temporal rte_memcpy_nt(). I am offering new functions, not m= odifying
> > > memcpy() or rte_memcpy().
> > >
> > > And rte_memcpy_nt() will silently fall back to normal memcpy= () if
> > non-
> > > temporal copying is unavailable, e.g. on POWER and RISC-V > > architectures,
> > > which don't have NT load/store instructions.
> > I am talking about a scenario where the application is being port= ed
> > between architectures. Not everyone knows about the capabilities = of
> > the architecture. It is better to indicate upfront (ex: compilati= on
> > failures) that a certain feature is not supported on the target > > architecture rather than the user having to discover through pain= ful
> > debugging.
>
> I'm considering rte_memcpy_nt() a performance optimized variant of=
> memcpy(), where the performance gain is less cache pollution. Thus, si= lent
> fallback to memcpy() should suffice.
>
> Other architecture differences also affect DPDK performance; the inabi= lity to
> perform non-temporal load/store just one more to the (undocumented) li= st.
>
> Failing at build time if NT load/store is unavailable by the architect= ure would
> prevent the function from being used by other DPDK libraries, e.g. by = the
> rte_pktmbuf_copy() function used by the pdump library.
The other libraries in DPDK need to provide NT versions as the libraries ne= ed to cater for not-NT use cases as well. i.e. we cannot hide a NT copy und= er rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()

>
> I don't oppose to your idea, I just don't have any idea how to= reasonably
> implement it. So I'm trying to defend why it is not important.
I am suggesting that the applications could implement #ifdef depending on t= he architecture.
I assume tha= t it would be a pre-processor flag defined (or not) on DPDK side and applic= ation doing #ifdef based on it?

Another way to achieve this would be to use #warning directive (s= ee [1]) inside DPDK when the generic fallback is taken.

Also isn't the argument on memcpy_nt ca= pability query not a more general one, that is how would/should application= query DPDK's capabilities when run or compiled?

--00000000000093622405e4d092a6--