From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id AFDB51B320 for ; Tue, 3 Oct 2017 13:39:47 +0200 (CEST) Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Oct 2017 04:39:46 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,474,1500966000"; d="scan'208";a="318957755" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by fmsmga004.fm.intel.com with ESMTP; 03 Oct 2017 04:39:46 -0700 Received: from fmsmsx158.amr.corp.intel.com (10.18.116.75) by FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 3 Oct 2017 04:39:46 -0700 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by fmsmsx158.amr.corp.intel.com (10.18.116.75) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 3 Oct 2017 04:39:46 -0700 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.159]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.93]) with mapi id 14.03.0319.002; Tue, 3 Oct 2017 19:39:44 +0800 From: "Li, Xiaoyun" To: "Ananyev, Konstantin" , "Richardson, Bruce" CC: "Lu, Wenzhuo" , "Zhang, Helin" , "dev@dpdk.org" Thread-Topic: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Thread-Index: AQHTO5m58EawVo0vnkiZ+SCua4XO0qLQPJeAgADxDECAAEcFgIAAirkQ Date: Tue, 3 Oct 2017 11:39:43 +0000 Message-ID: References: <1506411689-94690-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-2-git-send-email-xiaoyun.li@intel.com> <2601191342CEEE43887BDE71AB9772585FAA2F94@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAA35AA@IRSMSX103.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB9772585FAA35AA@IRSMSX103.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYjBkMzkyYzgtZDVjOS00OTkwLTg3YjItMjU2MGFlYjNhNzRmIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6ImRBZ25yQmdWQWl5YnFpcjNQazJaOXRxVTNhTGJCUXcxRUtGMUZtWHpxc0U9In0= x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Oct 2017 11:39:48 -0000 Hi You mean just use rte_memcpy_internal in rte_memcpy_avx2, rte_memcpy_avx512= ? But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler supports a= vx2, then internal would only compiled With avx2 codes, then cannot choose other code path. What if the HW cannot = support avx2? If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether both compile= r and HW supports avx2. Then the function has no difference right now. The mocro is determined at compilation time. But selection is hoped to be a= t runtime. Did I consider something wrong? Best Regards, Xiaoyun Li > -----Original Message----- > From: Ananyev, Konstantin > Sent: Tuesday, October 3, 2017 19:16 > To: Li, Xiaoyun ; Richardson, Bruce > > Cc: Lu, Wenzhuo ; Zhang, Helin > ; dev@dpdk.org > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy >=20 > Hi, >=20 > > > > Hi > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Tuesday, October 3, 2017 00:39 > > > To: Li, Xiaoyun ; Richardson, Bruce > > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > ; dev@dpdk.org > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > > > > -----Original Message----- > > > > From: Li, Xiaoyun > > > > Sent: Monday, October 2, 2017 5:13 PM > > > > To: Ananyev, Konstantin ; > > > > Richardson, > > > Bruce > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > ; dev@dpdk.org; Li, Xiaoyun > > > > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > This patch dynamically selects functions of memcpy at run-time > > > > based on CPU flags that current machine supports. This patch uses > > > > function pointers which are bind to the relative functions at const= rctor > time. > > > > In addition, AVX512 instructions set would be compiled only if > > > > users config it enabled and the compiler supports it. > > > > > > > > Signed-off-by: Xiaoyun Li > > > > --- > > > > v2 > > > > * Use gcc function multi-versioning to avoid compilation issues. > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and > > > > the compiler supports it, the AVX512 codes would be compiled. Only > > > > if the compiler supports AVX2, the AVX2 codes would be compiled. > > > > > > > > v3 > > > > * Reduce function calls via only keep rte_memcpy_xxx. > > > > * Add conditions that when copy size is small, use inline code path= . > > > > Otherwise, use dynamic code path. > > > > * To support attribute target, clang version must be greater than 3= .7. > > > > Otherwise, would choose SSE/AVX code path, the same as before. > > > > * Move two mocro functions to the top of the code since they would > > > > be used in inline SSE/AVX and dynamic SSE/AVX codes. > > > > > > > > v4 > > > > * Modify rte_memcpy.h to several .c files and modify makefiles to > > > > compile > > > > AVX2 and AVX512 files. > > > > > > Could you explain to me why instead of reusing existing rte_memcpy() > > > code to generate _sse/_avx2/ax512f flavors you keep pushing changes > > > with 3 separate implementations? > > > Obviously that is much more expensive in terms of maintenance and > > > doesn't look like feasible solution to me. > > > Is existing rte_memcpy() implementation is not good enough in terms > > > of functionality and/or performance? > > > If so, can you outline these problems and try to fix them first. > > > Konstantin > > > > > > > I just change many small functions to one function in those 3 separate > functions. >=20 > Yes, so with what you suggest we'll have 4 implementations for rte_memc= py > to support. > That's very expensive terms of maintenance and I believe totally unnecess= ary. >=20 > > Because the existing codes are totally inline, including rte_memcpy() > > itself. So the compilation will change all rte_memcpy() calls into the = basic > codes like xmm0=3Dxxx. > > > > The existing codes in this way are OK. >=20 > Good. >=20 > >But when run-time, it will bring lots of function calls and cause perf > >drop. >=20 > I believe it wouldn't if we do it properly. > All internal functions (mov16, mov32, etc.) will still be unlined by the > compiler for each flavor (sse/avx2/etc.) - have a look at the patch I sen= t. >=20 > Konstantin