From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A37DE43DF8; Thu, 4 Apr 2024 12:07:41 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 772E040268; Thu, 4 Apr 2024 12:07:41 +0200 (CEST) Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by mails.dpdk.org (Postfix) with ESMTP id F16D840268 for ; Thu, 4 Apr 2024 12:07:39 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1712225261; x=1743761261; h=date:from:to:cc:subject:message-id:references: content-transfer-encoding:in-reply-to:mime-version; bh=DLyKUh5bvr00fnsmIuTyhVlTjApKWM8v+TUhMaaSEm0=; b=U5PkIZRkUvbjpj9QMFTYsvRlsJi4l2hM7G51n6SzMFXhuquiiR0lxAK5 14eu6JegiRvzGYyQzdtNckVlonqX+QbTukADRMrcD/+AIH5Zr943t9zp6 mGF4yqWET/6n7wq2l8pMAvZdNpYs8dK8QbHtgx0wHQPbXJMT/Kh/yIBIY raDc9PxKNvECO6OlAOTZBsP0cbzzPX9g+ZdPFS9w/CoEpUEy834w3M0xW VkQUCnZWvlrwmOEi3KHFP7JKCj3vV8zuIxIUNNts4ePlXJeV58n4Qqjmf xuSFO3fnScCwzceuCOLl0ZvMXbodT4ZAor1y9vz0cuhVyXv2Z/BxWMW7g g==; X-CSE-ConnectionGUID: oVFiFZv+Tz+cSaAuG3TaoQ== X-CSE-MsgGUID: BEl6PqwEQ6evmD96mwAnQw== X-IronPort-AV: E=McAfee;i="6600,9927,11033"; a="7414289" X-IronPort-AV: E=Sophos;i="6.07,178,1708416000"; d="scan'208";a="7414289" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Apr 2024 03:07:39 -0700 X-CSE-ConnectionGUID: Rzz9TUDbSZu/FEb298jWPQ== X-CSE-MsgGUID: OVKwzq4WQnyCg9j/aazZVQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,179,1708416000"; d="scan'208";a="18664983" Received: from fmsmsx602.amr.corp.intel.com ([10.18.126.82]) by fmviesa010.fm.intel.com with ESMTP/TLS/AES256-GCM-SHA384; 04 Apr 2024 03:07:38 -0700 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx602.amr.corp.intel.com (10.18.126.82) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 4 Apr 2024 03:07:38 -0700 Received: from FMSEDG603.ED.cps.intel.com (10.1.192.133) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35 via Frontend Transport; Thu, 4 Apr 2024 03:07:38 -0700 Received: from NAM11-CO1-obe.outbound.protection.outlook.com (104.47.56.168) by edgegateway.intel.com (192.55.55.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Thu, 4 Apr 2024 03:07:37 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=A4ziN3QZzlycBWLS9MuCMvrVNzXyQuP4DpS1DxWbaPj3bJZeePHOq0jUrWJtuYIUJN6IFIk2l/J1YCs1R/qAQ/GfzboSe2Gts9dxBXj0Pn0SMVC9EJN3FfW3uDyJu/pmGriF8OASvQKTDpghsXsHDCulNTAPUFbSnkYCyVdrgK5hbkXZWe4uLClwBp1arsPmXRa+tygRl1dLlnRiYpRK3ZgvgqaMRFpacuL4ydb1vfJ7OK3PTCvT8XOuVo5+rVGA209UF9OizFHuJ7TTUfI65aN+zoUwjgzSlsuFMHJJVSi17K/groQ+k/bCPgVkIE3m+J16un9ldM3FlW9Q95sYaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=GW7fkdQwyyB0PWdlYN8t3A+Wm+giaEuEsjKEUFkVGzg=; b=lDRv7tlBdCB+kRAXDCn9c37+uN9vbCQFtYGQBmDiNU8cKIPJbxQp21FILmBgqQN23JbSx9XHSn01J9SbW2F7SkNP66mfDWRlQewiXwA1bR58rBuxLuZBA7rGHHyqEs2f3E5s+haPl0ahrKUlKqFE1En2pkhuVjG42HLkc24Zr4dCzlxkc6pEeho6Ob9kIXopyIZQPInCDrWOVbE2JcNdaAJH5r7cyg1jwGv3hNlPRQHLpLpkYkd4mpwit5MeNuENr9/J0XBRwCa01V5yIMBbGye0ByF02woAS+Rhgc98pMBnDl8sw/LGRv+3Z4FOQ/qBBacJyot9mtkOjqG7Tf2uyA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from DS0PR11MB7309.namprd11.prod.outlook.com (2603:10b6:8:13e::17) by SA2PR11MB4778.namprd11.prod.outlook.com (2603:10b6:806:119::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7452.25; Thu, 4 Apr 2024 10:07:09 +0000 Received: from DS0PR11MB7309.namprd11.prod.outlook.com ([fe80::487e:e20c:ad88:9c0f]) by DS0PR11MB7309.namprd11.prod.outlook.com ([fe80::487e:e20c:ad88:9c0f%7]) with mapi id 15.20.7452.019; Thu, 4 Apr 2024 10:07:09 +0000 Date: Thu, 4 Apr 2024 11:07:03 +0100 From: Bruce Richardson To: Morten =?iso-8859-1?Q?Br=F8rup?= CC: , , , Subject: Re: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance Message-ID: References: <20240302234812.9137-1-mb@smartsharesystems.com> <20240303094621.16404-1-mb@smartsharesystems.com> Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240303094621.16404-1-mb@smartsharesystems.com> X-ClientProxiedBy: DUZP191CA0043.EURP191.PROD.OUTLOOK.COM (2603:10a6:10:4f8::26) To DS0PR11MB7309.namprd11.prod.outlook.com (2603:10b6:8:13e::17) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS0PR11MB7309:EE_|SA2PR11MB4778:EE_ X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 8D/iBEY3UAp75Hze8o6YrzVL87Wgii6S2QndxJKeXvgLOU9t8vZzQcw4U1REYE0KvwKnxEUoLrEhUsalapuBROkSCtxHxT6FJ/FsaTXaAlxkivz+mVK2ThhFjfjjMGE3Z2hzvDIBg2rKeAgMBBX7uFw+qjms/4ohZ2261ch6SN/Hu9ouuH2zx12B7/uTAQH4/XtrnKtzyrSQPty4BQXP5jpd97+XVUXB40H6O62z54f5Afrgl30H2683jen3yXFeORGORaGo/m1aLnTHIBqlEcMMfAg9XmQNTtB8Byw8R9IBitBGlXtWHeb5Q2qEaR56lPUuCf9mrd6N6dORpFvH+ciptYdQjBSkBDYTu1l2aj3wNj715o3a92p1b68KkpQaFGm6WhYwyXM3UPWk5GxaYPTKfTr1Wf24ebQHK65wwxKkWcuRMHNbKkPQLF5NN9NEYBLm7BFFBUchHkb4kC13pZxlXKb4MPaOV75BaxdKznVnK5Y2gL4ODsX6TiK2TmAJNoZgKYYGyTHVUEPoWHel+9CkT24mvav1tu2FykD5ion61DNgW+Gc2B8KMlTERWmIxbzeM/uSuMlqNb04yaVOX4cqLMcs8/sf0iLYEJb/p28= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DS0PR11MB7309.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230031)(1800799015)(376005)(366007); DIR:OUT; SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?iso-8859-1?Q?3+c9OLU3l2zcnODfSsG9xqEW4/j8qt+Rkot5u3AXBizh/6WQ7tD1mBx25p?= =?iso-8859-1?Q?xXB6gLCem2Djg466c5Hzps5yv8bvW+DKsTA407yKW6bhpllBtLqVoW+AQj?= =?iso-8859-1?Q?fPnaYjF0ELdiIZxvTf+C3JLh8VeKKOF6vTqnrszcjVu8hI8WsYhcxKykUG?= =?iso-8859-1?Q?6H4lHVnhulJ3/W0zpMYp5WnfIlkTov1D51/jHTMiTu4FbuXHeiHi9NRGpJ?= =?iso-8859-1?Q?3MIc6Fiz06EYl4gkV1mNO8drq9JD86SnCYZZ4CBBpAUNsKR064LhdcocR3?= =?iso-8859-1?Q?eD69sgQcKp9hwNuJnGPVv/yeHWFeEZg0fCmX92rPAjhuYxHk2B+mzcNK5G?= =?iso-8859-1?Q?dxGUybLRuWu9XNbDX8OdpAtnv1DQMqEHml5e5AL6JZwB3RpkhEwbrGUYMg?= =?iso-8859-1?Q?ha6vNR8WwfhvVibYC/0js9Nj94TNEmLsEjZmtgT+MOf9kwy+NyByjepGOn?= =?iso-8859-1?Q?pl6JFkUzRmtS1w9TXewGcqQyyCkkypbYJpbR0Qy6CM2IqpSA21Jda7cahO?= =?iso-8859-1?Q?0M8GNsNl/YUiuuXfmZUpX1/aBV16A5E4Pv/xbEXfEOxwdk2HijXXdpkGAA?= =?iso-8859-1?Q?8qfK9xquC0WsonhED8l9zfNKh1hR5NaOBvp9UT7uIsahJ30aZXvrQufiO7?= =?iso-8859-1?Q?47HVoE7suz6BYm6dWQcXl6ksyo86ib9cMOk0TKAbAJETdTdkeXH5Jfla9q?= =?iso-8859-1?Q?B41DYLtj/r0LTvBJ88IKc21gWrPbTFXiuuQX3ODUm1wApwoZL6lD/PflDF?= =?iso-8859-1?Q?TVQk2h3ZCfWLzdSJ7SGGIT+xIbAEs8DnogSkZO8A2bsH7fIz5JBuLQFdNn?= =?iso-8859-1?Q?4baS+P+p4BcbJuXkOWD7gF9XE+siW2IFqNaIxKGa0FnJxRpLlvKd1ct3Q0?= =?iso-8859-1?Q?0zS+0/13fw8UJuEIkut72bp8mlF/P9sgbQR8YGDhN64h4fbQ0qY8Xis1iH?= =?iso-8859-1?Q?VDEHm0OAOhLdOTwDVklOol3qtybGLFTSA9CbELKEpxgNQyDC0Io9ay6wxA?= =?iso-8859-1?Q?PTkvuUxsMQBcy/EkahC1WoPVUn7H/QyaF+8ChitvanoX69xjrALvpKgBlx?= =?iso-8859-1?Q?PLDabxg7zUoou4OcAFMRMnVVYinKdSUdGTZrdneQNhlUtGCxyMrBvqo0cj?= =?iso-8859-1?Q?RLWl+qjsa08us7+YCpW+kYViDbEdJHX47xLQsaBwZCTi/YTKnqNEp52Bq5?= =?iso-8859-1?Q?4gcAloABOcT88SCU85It+HtQWOP9i9YcnEwDdSOGDWNWLPNyhakvUayCXv?= =?iso-8859-1?Q?pKF3qp+5TD/fjm55WDCtqjRwQnO4iF6vx29SsezfMgonDm/N1JxTjn8oHO?= =?iso-8859-1?Q?JK/tD6tIuQ1MM32wpqqepq0AxuauzJsFz1JR9m8sYPwFKb0f/yyBlKZPOz?= =?iso-8859-1?Q?nhALxPrUqZg2Jj7PrP/0QFqQi6Fqg66fPftqrpsKLImxo6mNTXpgR6tY6U?= =?iso-8859-1?Q?Nl1qSK50uPQt8AXlexNFh1yTR3Q+nm31xvYQFGw4AJXxvLwKsAL95kTZGc?= =?iso-8859-1?Q?S33RxHmre0NTqNqh31hBqcMIN1RhUH4qvRYJUdwXnjGY+N9z+SYgLaRZjl?= =?iso-8859-1?Q?fN/1gDAPHOCFKk/lR07BrN3Ma0RffmDRZqLb2y5VznjuD3Qk5PnryjKP19?= =?iso-8859-1?Q?w+idxV40Iij8QdQ1KHPQc9X1ns4F+r4BhrB81eFv1ztIaWZMiEO4kEBw?= =?iso-8859-1?Q?=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: b53164b7-e57b-4472-5639-08dc548efd12 X-MS-Exchange-CrossTenant-AuthSource: DS0PR11MB7309.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 Apr 2024 10:07:08.9715 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: li/km4HzodYeYLEoxW4/HkvvlQLSoEEtqOXj8LX3gTCJ+F8cp0gMoxv2FHlEtKH/9Fqt/LR+ds9fsyDMgEkCWzkI0vBfSgwGf5h97n8FSQM= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA2PR11MB4778 X-OriginatorOrg: intel.com X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Sun, Mar 03, 2024 at 10:46:21AM +0100, Morten Brørup wrote: > When the rte_memcpy() size is 16, the same 16 bytes are copied twice. > In the case where the size is known to be 16 at build tine, omit the > duplicate copy. > > Reduced the amount of effectively copy-pasted code by using #ifdef > inside functions instead of outside functions. > > Suggested-by: Stephen Hemminger > Signed-off-by: Morten Brørup Changes in general look good to me. Comments inline below. /Bruce > --- > v2: > * For GCC, version 11 is required for proper AVX handling; > if older GCC version, treat AVX as SSE. > Clang does not have this issue. > Note: Original code always treated AVX as SSE, regardless of compiler. > * Do not add copyright. (Stephen Hemminger) > --- > lib/eal/x86/include/rte_memcpy.h | 231 ++++++++----------------------- > 1 file changed, 56 insertions(+), 175 deletions(-) > > diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h > index 72a92290e0..d1df841f5e 100644 > --- a/lib/eal/x86/include/rte_memcpy.h > +++ b/lib/eal/x86/include/rte_memcpy.h > @@ -91,14 +91,6 @@ rte_mov15_or_less(void *dst, const void *src, size_t n) > return ret; > } > > -#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > - > -#define ALIGNMENT_MASK 0x3F > - > -/** > - * AVX512 implementation below > - */ > - > /** > * Copy 16 bytes from one location to another, > * locations should not overlap. > @@ -119,10 +111,16 @@ rte_mov16(uint8_t *dst, const uint8_t *src) > static __rte_always_inline void > rte_mov32(uint8_t *dst, const uint8_t *src) > { > +#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || defined __AVX2__ || \ > + (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 110000))) I think we can drop the AVX512 checks here, since I'm not aware of any system where we'd have AVX512 but not AVX2 available, so just checking for AVX2 support should be sufficient. On the final compiler-based check, I don't strongly object to it, but I just wonder as to its real value. AVX2 was first introduced by Intel over 10 years ago, and (from what I find in wikipedia), it's been in AMD CPUs since ~2015. While we did have CPUs still being produced without AVX2 since that time, they generally didn't have AVX1 either, only having SSE instructions. Therefore the number of systems which require this additional check is likely very small at this stage. That said, I'm ok to either keep or omit it at your choice. If you do keep it, how about putting the check once at the top of the file and using a single short define instead for the multiple places it's used e.g. #if (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 110000))) #define RTE_MEMCPY_AVX2 #endif > __m256i ymm0; > > ymm0 = _mm256_loadu_si256((const __m256i *)src); > _mm256_storeu_si256((__m256i *)dst, ymm0); > +#else /* SSE implementation */ > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > +#endif > } > > /** > @@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t *src) > static __rte_always_inline void > rte_mov64(uint8_t *dst, const uint8_t *src) > { > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > __m512i zmm0; > > zmm0 = _mm512_loadu_si512((const void *)src); > _mm512_storeu_si512((void *)dst, zmm0); > +#else /* AVX2, AVX & SSE implementation */ > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > +#endif > } > > /** > @@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t *src) > static __rte_always_inline void > rte_mov256(uint8_t *dst, const uint8_t *src) > { > - rte_mov64(dst + 0 * 64, src + 0 * 64); > - rte_mov64(dst + 1 * 64, src + 1 * 64); > - rte_mov64(dst + 2 * 64, src + 2 * 64); > - rte_mov64(dst + 3 * 64, src + 3 * 64); > + rte_mov128(dst + 0 * 128, src + 0 * 128); > + rte_mov128(dst + 1 * 128, src + 1 * 128); > } > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > + > +/** > + * AVX512 implementation below > + */ > + > +#define ALIGNMENT_MASK 0x3F > + > /** > * Copy 128-byte blocks from one location to another, > * locations should not overlap. > @@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void *src, size_t n) > /** > * Fast way when copy size doesn't exceed 512 bytes > */ > + if (__builtin_constant_p(n) && n == 32) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + return ret; > + } There's an outstanding patchset from Stephen to replace all use of rte_memcpy with a constant parameter with an actual call to regular memcpy. On a wider scale should we not look to do something similar in this file, have calls to rte_memcpy with constant parameter always turn into a call to regular memcpy? We used to have such a macro in older DPDK e.g. from DPDK 1.8 http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_memcpy.h?h=v1.8.0#n171 This would elminiate the need to put in constant_p checks all through the code. > if (n <= 32) { > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > + if (__builtin_constant_p(n) && n == 16) > + return ret; /* avoid (harmless) duplicate copy */ > rte_mov16((uint8_t *)dst - 16 + n, > (const uint8_t *)src - 16 + n); > return ret; > }