From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 6ED681B34A for ; Tue, 3 Oct 2017 14:12:47 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Oct 2017 05:12:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,474,1500966000"; d="scan'208";a="1178124649" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by orsmga001.jf.intel.com with ESMTP; 03 Oct 2017 05:12:21 -0700 Received: from irsmsx103.ger.corp.intel.com ([169.254.3.49]) by IRSMSX154.ger.corp.intel.com ([169.254.12.83]) with mapi id 14.03.0319.002; Tue, 3 Oct 2017 13:12:20 +0100 From: "Ananyev, Konstantin" To: "Li, Xiaoyun" , "Richardson, Bruce" CC: "Lu, Wenzhuo" , "Zhang, Helin" , "dev@dpdk.org" Thread-Topic: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Thread-Index: AQHTO5m/ev7m6EWN7Eqz5p6v1EkKxKLQwURAgABeHICAANm2MP//94+AgAAWbcA= Date: Tue, 3 Oct 2017 12:12:19 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772585FAA362F@IRSMSX103.ger.corp.intel.com> References: <1506411689-94690-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-2-git-send-email-xiaoyun.li@intel.com> <2601191342CEEE43887BDE71AB9772585FAA2F94@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAA35AA@IRSMSX103.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYjBkMzkyYzgtZDVjOS00OTkwLTg3YjItMjU2MGFlYjNhNzRmIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6ImRBZ25yQmdWQWl5YnFpcjNQazJaOXRxVTNhTGJCUXcxRUtGMUZtWHpxc0U9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Oct 2017 12:12:48 -0000 >=20 > Hi > You mean just use rte_memcpy_internal in rte_memcpy_avx2, rte_memcpy_avx5= 12? Yes, exactly and for rte_memcpy_sse() too. Basically we for rte_memcpy_avx512() we force compiler to use AVX512F path = inside rte_memcpy_iternal(), for rte_memcpy_avx2() we use AVX2 path inside rte_memcpy_internal(), etc. To do that we setup: CFLAGS_rte_memcpy_avx512f.o +=3D -mavx512f CFLAGS_rte_memcpy_avx512f.o +=3D -DRTE_MACHINE_CPUFLAG_AVX512F inside the Makefile. For rte_memcpy_avx2() we force compiler to use AVX2 path inside rte_memcpy_= internal(), etc. > But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler supports= avx2, then internal would only compiled > With avx2 codes, then cannot choose other code path. What if the HW canno= t support avx2? If the HW can't support AVX2 then rte_memcpy_init() just wouldn't select rt= e_memcpy_avx2(), it would select rte_memcpy_sse() instead: if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {...} - that is a runtime check that underlying HW does support AVX2. Konstantin > If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether both compi= ler and HW supports avx2. Then the function > has no difference right now. > The mocro is determined at compilation time. But selection is hoped to be= at runtime. > Did I consider something wrong? >=20 > Best Regards, > Xiaoyun Li >=20 >=20 >=20 >=20 > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Tuesday, October 3, 2017 19:16 > > To: Li, Xiaoyun ; Richardson, Bruce > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > ; dev@dpdk.org > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > Hi, > > > > > > > > Hi > > > > > > > -----Original Message----- > > > > From: Ananyev, Konstantin > > > > Sent: Tuesday, October 3, 2017 00:39 > > > > To: Li, Xiaoyun ; Richardson, Bruce > > > > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > > ; dev@dpdk.org > > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Li, Xiaoyun > > > > > Sent: Monday, October 2, 2017 5:13 PM > > > > > To: Ananyev, Konstantin ; > > > > > Richardson, > > > > Bruce > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > > ; dev@dpdk.org; Li, Xiaoyun > > > > > > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > This patch dynamically selects functions of memcpy at run-time > > > > > based on CPU flags that current machine supports. This patch uses > > > > > function pointers which are bind to the relative functions at con= strctor > > time. > > > > > In addition, AVX512 instructions set would be compiled only if > > > > > users config it enabled and the compiler supports it. > > > > > > > > > > Signed-off-by: Xiaoyun Li > > > > > --- > > > > > v2 > > > > > * Use gcc function multi-versioning to avoid compilation issues. > > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and > > > > > the compiler supports it, the AVX512 codes would be compiled. Onl= y > > > > > if the compiler supports AVX2, the AVX2 codes would be compiled. > > > > > > > > > > v3 > > > > > * Reduce function calls via only keep rte_memcpy_xxx. > > > > > * Add conditions that when copy size is small, use inline code pa= th. > > > > > Otherwise, use dynamic code path. > > > > > * To support attribute target, clang version must be greater than= 3.7. > > > > > Otherwise, would choose SSE/AVX code path, the same as before. > > > > > * Move two mocro functions to the top of the code since they woul= d > > > > > be used in inline SSE/AVX and dynamic SSE/AVX codes. > > > > > > > > > > v4 > > > > > * Modify rte_memcpy.h to several .c files and modify makefiles to > > > > > compile > > > > > AVX2 and AVX512 files. > > > > > > > > Could you explain to me why instead of reusing existing rte_memcpy(= ) > > > > code to generate _sse/_avx2/ax512f flavors you keep pushing changes > > > > with 3 separate implementations? > > > > Obviously that is much more expensive in terms of maintenance and > > > > doesn't look like feasible solution to me. > > > > Is existing rte_memcpy() implementation is not good enough in terms > > > > of functionality and/or performance? > > > > If so, can you outline these problems and try to fix them first. > > > > Konstantin > > > > > > > > > > I just change many small functions to one function in those 3 separat= e > > functions. > > > > Yes, so with what you suggest we'll have 4 implementations for rte_me= mcpy > > to support. > > That's very expensive terms of maintenance and I believe totally unnece= ssary. > > > > > Because the existing codes are totally inline, including rte_memcpy() > > > itself. So the compilation will change all rte_memcpy() calls into th= e basic > > codes like xmm0=3Dxxx. > > > > > > The existing codes in this way are OK. > > > > Good. > > > > >But when run-time, it will bring lots of function calls and cause per= f > > >drop. > > > > I believe it wouldn't if we do it properly. > > All internal functions (mov16, mov32, etc.) will still be unlined by th= e > > compiler for each flavor (sse/avx2/etc.) - have a look at the patch I s= ent. > > > > Konstantin