Re: [PATCH v1 1/2] riscv support rte_memcpy in vector

DPDK patches and discussions
 help / color / mirror / Atom feed

From: <chen.qiguo@zte.com.cn>
To: <sunyuechi@iscas.ac.cn>
Cc: <stanislaw.kardach@gmail.com>, <stephen@networkplumber.org>,
	<dev@dpdk.org>, <bruce.richardson@intel.com>
Subject: Re: [PATCH v1 1/2] riscv support rte_memcpy in vector
Date: Fri, 17 Oct 2025 18:10:07 +0800 (CST)	[thread overview]
Message-ID: <20251017181007533dzuJEVmzs5aH510aD5O1S@zte.com.cn> (raw)
In-Reply-To: <3f36237c.34043.199f0a4c352.Coremail.sunyuechi@iscas.ac.cn>


[-- Attachment #1.1.1: Type: text/plain, Size: 15131 bytes --]

>     16  0 -  0( 57.49%)   1 -  1(  7.30%)   2 -  2(  0.19%)   3 -  3(  3.19%) >     17  0 -  0( 53.78%)   3 -  2( 51.65%)   4 -  3( 37.35%)   4 -  3( 23.94%) >     31  0 -  0( 27.02%)   3 -  2( 51.99%)   4 -  3( 37.34%)   4 -  3( 24.09%) >     32  0 -  0( 56.82%)   3 -  2( 50.42%)   4 -  3( 39.73%)   4 -  3( 25.04%) >     33  0 -  0( 30.60%)   3 -  3( 30.94%)   6 -  4( 46.89%)   6 -  5( 26.21%) >     63  0 -  0( 16.84%)   4 -  3( 21.57%)   6 -  5( 31.74%)   7 -  6( 18.01%) >     64  0 -  0( 21.98%)   4 -  3( 21.35%)   6 -  5( 36.13%)   7 -  6( 20.05%) 
It looks like there's a performance degradation in the 0-128 range, can you fix it?


For  small size copy,  we can use memcpy directly.   It seems that the judge condition causes this result. 



Original


From: sunyuechi@iscas.ac.cn <sunyuechi@iscas.ac.cn>
To: 陈其国10108961;
Cc: stanislaw.kardach@gmail.com <stanislaw.kardach@gmail.com>;stephen@networkplumber.org <stephen@networkplumber.org>;dev@dpdk.org <dev@dpdk.org>;bruce.richardson@intel.com <bruce.richardson@intel.com>;
Date: 2025年10月17日 13:29
Subject: Re: [PATCH v1 1/2] riscv support rte_memcpy in vector

> riscv support rte_memcpy in vector
 > This patch implements RISC-V vector intrinsics
 
 
 Please adjust the title and msg to mention that zicbop has been introduced, and that intrinsic is not currently being used
 
 
 config/riscv/meson.build
 
 
 > # detect extensions
 > # Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+
 > if (riscv_extension_macros and
 >     (cc.get_define('__riscv_zicbop', args: machine_args) != ''))
 >   if ((cc.get_id() == 'gcc' and cc.version().version_compare('>=14.1.0'))
 >       or (cc.get_id() == 'clang' and cc.version().version_compare('>=18.1.0')))
 >       message('Compiling with the zicbop extension')
 >       machine_args += ['-DRTE_RISCV_FEATURE_PREFETCH']
 >   else
 >     warning('Detected zicbop extension but cannot use because intrinsics are not available (present in GCC 14.1.0+ and Clang 18.1.0+)')
 >   endif
 > endif
 
 
 The implementation does not involve intrinsics
 
 
 >     16  0 -  0( 57.49%)   1 -  1(  7.30%)   2 -  2(  0.19%)   3 -  3(  3.19%) 
 >     17  0 -  0( 53.78%)   3 -  2( 51.65%)   4 -  3( 37.35%)   4 -  3( 23.94%) 
 >     31  0 -  0( 27.02%)   3 -  2( 51.99%)   4 -  3( 37.34%)   4 -  3( 24.09%) 
 >     32  0 -  0( 56.82%)   3 -  2( 50.42%)   4 -  3( 39.73%)   4 -  3( 25.04%) 
 >     33  0 -  0( 30.60%)   3 -  3( 30.94%)   6 -  4( 46.89%)   6 -  5( 26.21%) 
 >     63  0 -  0( 16.84%)   4 -  3( 21.57%)   6 -  5( 31.74%)   7 -  6( 18.01%) 
 >     64  0 -  0( 21.98%)   4 -  3( 21.35%)   6 -  5( 36.13%)   7 -  6( 20.05%) 
 
 
 It looks like there's a performance degradation in the 0-128 range, can you fix it?
 
 
 eal/riscv/include/rte_memcpy.h
 
 
 > #define ALIGNMENT_MASK_16    0xF
 
 
 unused
 
 
 >/*else*/
 
 
 Please remove /*else*/
 
 
 > static __rte_always_inline void *
 > _rte_memcpy(void *dst, const void *src, size_t n)
 > {
 > 	return _rte_memcpy_generic((uint8_t *)dst, (const uint8_t *)src, n);
 > }
 
 
 No need for an extra function call; you can write the implementation directly in the function
 	
 
 
 
 	-----原始邮件-----
 发件人:"Qiguo Chen" <chen.qiguo@zte.com.cn>
 发送时间:2025-10-16 17:09:33 (星期四)
 收件人: stanislaw.kardach@gmail.com, sunyuechi@iscas.ac.cn, stephen@networkplumber.org
 抄送: dev@dpdk.org, bruce.richardson@intel.com, "Qiguo Chen" <chen.qiguo@zte.com.cn>
 主题: [PATCH v1 1/2] riscv support rte_memcpy in vector
 
 This patch implements RISC-V vector intrinsics
 to accelerate memory copy operations for byte range (129~1600).
 
 Signed-off-by: Qiguo Chen <chen.qiguo@zte.com.cn> 
 ---
  .mailmap                           |   1 +
  config/riscv/meson.build           |  14 ++
  lib/eal/riscv/include/rte_memcpy.h | 310 ++++++++++++++++++++++++++++-
  3 files changed, 323 insertions(+), 2 deletions(-)
 
 diff --git a/.mailmap b/.mailmap
 index 08e5ec8560..178c5f44f4 100644
 --- a/.mailmap
 +++ b/.mailmap
 @@ -1285,6 +1285,7 @@ Qian Hao <qi_an_hao@126.com> 
  Qian Xu <qian.q.xu@intel.com> 
  Qiao Liu <qiao.liu@intel.com> 
  Qi Fu <qi.fu@intel.com> 
 +Qiguo Chen <chen.qiguo@zte.com.cn> 
  Qimai Xiao <qimaix.xiao@intel.com> 
  Qiming Chen <chenqiming_huawei@163.com> 
  Qiming Yang <qiming.yang@intel.com> 
 diff --git a/config/riscv/meson.build b/config/riscv/meson.build
 index f3daea0c0e..abba474b5e 100644
 --- a/config/riscv/meson.build
 +++ b/config/riscv/meson.build
 @@ -146,6 +146,20 @@ if (riscv_extension_macros and
      endif
  endif
   
 +# detect extensions
 +# Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+
 +if (riscv_extension_macros and
 +    (cc.get_define(&apos;__riscv_zicbop&apos;, args: machine_args) != &apos;&apos;))
 +  if ((cc.get_id() == &apos;gcc&apos; and cc.version().version_compare(&apos;>=14.1.0&apos;))
 +      or (cc.get_id() == &apos;clang&apos; and cc.version().version_compare(&apos;>=18.1.0&apos;)))
 +      message(&apos;Compiling with the zicbop extension&apos;)
 +      machine_args += [&apos;-DRTE_RISCV_FEATURE_PREFETCH&apos;]
 +  else
 +    warning(&apos;Detected zicbop extension but cannot use because intrinsics are not available (present in GCC 14.1.0+ and Clang 18.1.0+)&apos;)
 +  endif
 +endif
 +
 +
  # apply flags
  foreach flag: dpdk_flags
      if flag.length() > 0
 diff --git a/lib/eal/riscv/include/rte_memcpy.h b/lib/eal/riscv/include/rte_memcpy.h
 index d8a942c5d2..6f8cb0d4a4 100644
 --- a/lib/eal/riscv/include/rte_memcpy.h
 +++ b/lib/eal/riscv/include/rte_memcpy.h
 @@ -11,6 +11,7 @@
  #include <string.h> 
   
  #include "rte_common.h" 
 +#include <rte_branch_prediction.h> 
   
  #include "generic/rte_memcpy.h" 
   
 @@ -18,6 +19,290 @@
  extern "C" {
  #endif
   
 +
 +#if defined(RTE_RISCV_FEATURE_V) && !(defined(RTE_RISCV_FEATURE_PREFETCH))
 +#undef RTE_RISCV_FEATURE_V
 +#endif
 +
 +
 +#if defined(RTE_RISCV_FEATURE_V)
 +
 +#include "rte_cpuflags.h" 
 +
 +#define RISCV_VLENB   16
 +#define MEMCPY_GLIBC       (1U << 0)
 +#define MEMCPY_RISCV       (1U << 1)
 +#define ALIGNMENT_MASK_128   0x7F
 +#define ALIGNMENT_MASK_64    0x3F
 +#define ALIGNMENT_MASK_16    0xF
 +
 +static uint8_t memcpy_alg = MEMCPY_GLIBC;
 +
 +
 +static __rte_always_inline void
 +memcpy_prefetch64_1(const uint8_t *src, uint8_t *dst)
 +{
 +    __asm__ (
 +        "prefetch.r 64(%0)\n" 
 +        "prefetch.w 64(%1)" 
 +        :: "r"(src), "r"(dst)
 +    );
 +}
 +
 +static __rte_always_inline void
 +memcpy_prefetch128_1(const uint8_t *src, uint8_t *dst)
 +{
 +    __asm__ (
 +        "prefetch.r 128(%0)\n" 
 +        "prefetch.w 128(%1)" 
 +        :: "r"(src), "r"(dst)
 +    );
 +}
 +
 +static __rte_always_inline void
 +memcpy_prefetch128_2(const uint8_t *src, uint8_t *dst)
 +{
 +    __asm__ (
 +        "prefetch.r 128(%0);" 
 +        "prefetch.w 128(%1);" 
 +        "prefetch.r 192(%0);" 
 +        "prefetch.w 192(%1)" 
 +        :: "r"(src), "r"(dst)
 +    );
 +}
 +
 +
 +static __rte_always_inline void
 +_rte_mov32(uint8_t *dst, const uint8_t *src)
 +{
 +    uint32_t n = 32;
 +    asm volatile (
 +         "vsetvli t1, %2, e8, m2, ta, ma\n" 
 +         "vle8.v v2, (%1)\n" 
 +         "vse8.v v2, (%0)" 
 +         :: "r"(dst), "r"(src), "r"(n)
 +         : "v2", "v3", "t1", "memory" 
 +     );
 +}
 +
 +static __rte_always_inline void
 +_rte_mov64(uint8_t *dst, const uint8_t *src)
 +{
 +    uint32_t n = 64;
 +    asm volatile (
 +        "vsetvli t3, %2, e8, m4, ta, ma\n" 
 +        "vle8.v v8, (%1)\n" 
 +        "vse8.v v8, (%0)" 
 +        :: "r"(dst), "r"(src), "r"(n)
 +        :  "v8", "v9", "v10", "v11", "t3", "memory" 
 +     );
 +}
 +
 +static __rte_always_inline void
 +_rte_mov128(uint8_t *dst, const uint8_t *src)
 +{
 +    uint32_t n = 128;
 +    asm volatile (
 +        "vsetvli t4, %2, e8, m8, ta, ma\n" 
 +        "vle8.v v16, (%1)\n" 
 +        "vse8.v v16, (%0)" 
 +        :: "r"(dst), "r"(src), "r"(n)
 +        : "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23", "t4", "memory" 
 +     );
 +}
 +
 +static __rte_always_inline void
 +_rte_mov256(uint8_t *dst, const uint8_t *src)
 +{
 +    memcpy_prefetch128_2(src, dst);
 +    _rte_mov128(dst, src);
 +    _rte_mov128(dst + 128, src + 128);
 +}
 +
 +static __rte_always_inline void
 +_rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 +{
 +    asm volatile (
 +        "prefetch.r 64(%1)\n" 
 +        "prefetch.w 64(%0)\n" 
 +        "prefetch.r 128(%1)\n" 
 +        "prefetch.w 128(%0)\n" 
 +        "prefetch.r 192(%1)\n" 
 +        "prefetch.w 192(%0)\n" 
 +        "prefetch.r 256(%1)\n" 
 +        "prefetch.w 256(%0)\n" 
 +        "prefetch.r 320(%1)\n" 
 +        "prefetch.w 320(%0)\n" 
 +        "prefetch.r 384(%1)\n" 
 +        "prefetch.w 384(%0)\n" 
 +        "prefetch.r 448(%1)\n" 
 +        "prefetch.w 448(%0)\n" 
 +        "prefetch.r 512(%1)\n" 
 +        "li t6, 512\n" 
 +        "3:\n" 
 +        "li t5, 128;" 
 +        "vsetvli zero, t5, e8, m8, ta, ma\n" 
 +        "1:;" 
 +        "bgt %2, t6, 4f\n" 
 +        "j 2f\n" 
 +        "4:\n" 
 +        "prefetch.r 576(%1)\n" 
 +        "prefetch.r 640(%1)\n" 
 +        "2:\n" 
 +        "vle8.v   v16, (%1)\n" 
 +        "add      %1, %1, t5\n" 
 +        "vse8.v   v16, (%0)\n" 
 +        "add      %0, %0, t5\n" 
 +        "sub      %2, %2, t5\n" 
 +        "bnez     %2, 1b" 
 +        : "+r"(dst), "+r"(src), "+r"(n)
 +        :
 +        : "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23", "t5", "t6", "memory" 
 +    );
 +}
 +
 +static __rte_always_inline void
 +_rte_mov(uint8_t *dst, const uint8_t *src, uint32_t n)
 +{
 +    asm volatile (
 +        "1:\n" 
 +        "vsetvli t4, %2, e8, m8, ta, ma\n" 
 +        "vle8.v v16, (%1)\n" 
 +        "add %1, %1, t4\n" 
 +        "vse8.v v16, (%0)\n" 
 +        "add %0, %0, t4\n" 
 +        "sub %2, %2, t4\n" 
 +        "bnez %2, 1b" 
 +        : "+r"(dst), "+r"(src), "+r"(n)
 +        :
 +        : "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23", "t4", "memory" 
 +     );
 +}
 +
 +static __rte_always_inline void
 +_rte_mov_aligned(uint8_t *dst, const uint8_t *src, uint32_t n)
 +{
 +    asm volatile (
 +        "prefetch.r 128(%1)\n" 
 +        "prefetch.r 192(%1)\n" 
 +        "prefetch.r 256(%1)\n" 
 +        "prefetch.r 320(%1)\n" 
 +        "prefetch.r 384(%1)\n" 
 +        "prefetch.r 448(%1)\n" 
 +        "prefetch.r 512(%1)\n" 
 +        "prefetch.r 576(%1)\n" 
 +        "li t6, 640\n" 
 +        "1:\n" 
 +        "vsetvli t4, %2, e8, m8, ta, ma\n" 
 +        "vle8.v v16, (%1)\n" 
 +        "add %1, %1, t4\n" 
 +        "vse8.v v16, (%0)\n" 
 +        "add %0, %0, t4\n" 
 +        "sub %2, %2, t4\n" 
 +        "blt %2, t6, 3f\n" 
 +        "prefetch.r 512(%1)\n" 
 +        "prefetch.r 576(%1)\n" 
 +        "3:\n" 
 +        "bnez %2, 1b" 
 +        : "+r"(dst), "+r"(src), "+r"(n)
 +        :
 +        : "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23", "t4", "t6", "memory" 
 +     );
 +}
 +
 +static __rte_always_inline void *
 +_rte_memcpy_generic(uint8_t       *dst, const uint8_t *src, size_t n)
 +{
 +    void *ret = dst;
 +    size_t dstofss;
 +    uint32_t bn;
 +
 +    if (n <= 384) {
 +        if (n >= 256) {
 +            memcpy_prefetch128_2(src, dst);
 +            n -= 256;
 +            _rte_mov128(dst, src);
 +            _rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
 +            src = (const uint8_t *)src + 256;
 +            dst = (uint8_t *)dst + 256;
 +        }
 +        if (n >= 128) {
 +            memcpy_prefetch128_1(src, dst);
 +            n -= 128;
 +            _rte_mov128(dst, src);
 +            src = (const uint8_t *)src + 128;
 +            dst = (uint8_t *)dst + 128;
 +        }
 +
 +        if (n >= 64) {
 +            memcpy_prefetch64_1(src, dst);
 +            n -= 64;
 +            _rte_mov64(dst, src);
 +            src = (const uint8_t *)src + 64;
 +            dst = (uint8_t *)dst + 64;
 +        }
 +
 +        if (n > 32) {
 +            _rte_mov32(dst, src);
 +            _rte_mov32((uint8_t *)dst - 32 + n,
 +                    (const uint8_t *)src - 32 + n);
 +            return ret;
 +        }
 +
 +        if (n > 0) {
 +            _rte_mov32((uint8_t *)dst - 32 + n,
 +                    (const uint8_t *)src - 32 + n);
 +        }
 +        return ret;
 +    }
 +
 +    /**
 +     * Make store aligned when copy size exceeds 256 bytes.
 +     */
 +    dstofss = (uintptr_t)dst & ALIGNMENT_MASK_64;
 +    if (dstofss > 0) {
 +        dstofss = 64 - dstofss;
 +        n -= dstofss;
 +        _rte_mov64(dst, src);
 +        src = (const uint8_t *)src + dstofss;
 +        dst = (uint8_t *)dst + dstofss;
 +    }
 +
 +    /**
 +     * Copy 128-byte blocks
 +     */
 +    if ((uintptr_t)src & ALIGNMENT_MASK_64)    {
 +        bn = n - (n & ALIGNMENT_MASK_128);
 +        _rte_mov128blocks(dst, src, bn);
 +        n = n & ALIGNMENT_MASK_128;
 +        src = (const uint8_t *)src + bn;
 +        dst = (uint8_t *)dst + bn;
 +        _rte_mov(dst, src, n);
 +    } else
 +        _rte_mov_aligned(dst, src, n);
 +
 +    return ret;
 +}
 +
 +static __rte_always_inline void *
 +_rte_memcpy(void *dst, const void *src, size_t n)
 +{
 +    return _rte_memcpy_generic((uint8_t *)dst, (const uint8_t *)src, n);
 +}
 +#endif
 +
 +/*----------------------api---------------------------------------------------*/
 +static __rte_always_inline void *
 +rte_memcpy(void *dst, const void *src, size_t n)
 +{
 +#if defined(RTE_RISCV_FEATURE_V)
 +    if (likely((memcpy_alg == MEMCPY_RISCV) && (n >= 128) && (n < 2048)))
 +        return _rte_memcpy(dst, src, n);
 +    /*else*/
 +#endif
 +        return memcpy(dst, src, n);
 +}
 +
  static inline void
  rte_mov16(uint8_t *dst, const uint8_t *src)
  {
 @@ -51,10 +336,31 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
  static inline void
  rte_mov256(uint8_t *dst, const uint8_t *src)
  {
 -    memcpy(dst, src, 256);
 +#if defined(RTE_RISCV_FEATURE_V)
 +    if (likely(memcpy_alg == MEMCPY_RISCV))
 +        _rte_mov256(dst, src);
 +    else
 +#endif
 +        memcpy(dst, src, 256);
 +}
 +/*----------------------------------------------------------------------------*/
 +#if defined(RTE_RISCV_FEATURE_V)
 +static inline long
 +riscv_vlenb(void)
 +{
 +    long vlenb;
 +    asm ("csrr %0, 0xc22" : "=r"(vlenb));
 +    return vlenb;
  }
   
 -#define rte_memcpy(d, s, n)    memcpy((d), (s), (n))
 +RTE_INIT(rte_vect_memcpy_init)
 +{
 +    long vlenb = riscv_vlenb();
 +    if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RISCV_ISA_V) && (vlenb >= RISCV_VLENB))
 +        memcpy_alg = MEMCPY_RISCV;
 +}
 +#endif
 +
   
  #ifdef __cplusplus
  }
 --  
 2.21.0.windows.1

[-- Attachment #1.1.2: Type: text/html , Size: 34249 bytes --]

next prev parent reply	other threads:[~2025-10-17 10:10 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-16  9:09 [PATCH v1 0/2] Optimization Summary for RISC-V rte_memcpy Qiguo Chen
2025-10-16  9:09 ` [PATCH v1 1/2] riscv support rte_memcpy in vector Qiguo Chen
2025-10-17  5:29   ` sunyuechi
2025-10-17 10:10     ` chen.qiguo [this message]
2025-10-17  9:36   ` [PATCH v2 0/1] Optimization Summary for RISC-V rte_memcpy Qiguo Chen
2025-10-17  9:36     ` [PATCH v2 1/1] riscv support rte_memcpy in vector Qiguo Chen
2025-10-16  9:09 ` [PATCH v1 2/2] benchmark report for rte_memcpy Qiguo Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251017181007533dzuJEVmzs5aH510aD5O1S@zte.com.cn \
    --to=chen.qiguo@zte.com.cn \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=stanislaw.kardach@gmail.com \
    --cc=stephen@networkplumber.org \
    --cc=sunyuechi@iscas.ac.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).