From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 6823F1B34B for ; Tue, 3 Oct 2017 14:23:05 +0200 (CEST) Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Oct 2017 05:23:04 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,474,1500966000"; d="scan'208";a="134670259" Received: from fmsmsx104.amr.corp.intel.com ([10.18.124.202]) by orsmga004.jf.intel.com with ESMTP; 03 Oct 2017 05:23:03 -0700 Received: from fmsmsx119.amr.corp.intel.com (10.18.124.207) by fmsmsx104.amr.corp.intel.com (10.18.124.202) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 3 Oct 2017 05:23:03 -0700 Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by FMSMSX119.amr.corp.intel.com (10.18.124.207) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 3 Oct 2017 05:23:03 -0700 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.159]) by shsmsx102.ccr.corp.intel.com ([169.254.2.175]) with mapi id 14.03.0319.002; Tue, 3 Oct 2017 20:23:01 +0800 From: "Li, Xiaoyun" To: "Ananyev, Konstantin" , "Richardson, Bruce" CC: "Lu, Wenzhuo" , "Zhang, Helin" , "dev@dpdk.org" Thread-Topic: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Thread-Index: AQHTO5m58EawVo0vnkiZ+SCua4XO0qLQPJeAgADxDECAAEcFgIAAirkQ//+FB4CAAIjxgA== Date: Tue, 3 Oct 2017 12:23:00 +0000 Message-ID: References: <1506411689-94690-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-1-git-send-email-xiaoyun.li@intel.com> <1506960796-71620-2-git-send-email-xiaoyun.li@intel.com> <2601191342CEEE43887BDE71AB9772585FAA2F94@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAA35AA@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAA362F@IRSMSX103.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB9772585FAA362F@IRSMSX103.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYjBkMzkyYzgtZDVjOS00OTkwLTg3YjItMjU2MGFlYjNhNzRmIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6ImRBZ25yQmdWQWl5YnFpcjNQazJaOXRxVTNhTGJCUXcxRUtGMUZtWHpxc0U9In0= x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Oct 2017 12:23:07 -0000 OK. Got it. Thanks! > -----Original Message----- > From: Ananyev, Konstantin > Sent: Tuesday, October 3, 2017 20:12 > To: Li, Xiaoyun ; Richardson, Bruce > > Cc: Lu, Wenzhuo ; Zhang, Helin > ; dev@dpdk.org > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy >=20 >=20 >=20 > > > > Hi > > You mean just use rte_memcpy_internal in rte_memcpy_avx2, > rte_memcpy_avx512? >=20 > Yes, exactly and for rte_memcpy_sse() too. > Basically we for rte_memcpy_avx512() we force compiler to use AVX512F > path inside rte_memcpy_iternal(), for rte_memcpy_avx2() we use AVX2 path > inside rte_memcpy_internal(), etc. > To do that we setup: > CFLAGS_rte_memcpy_avx512f.o +=3D -mavx512f > CFLAGS_rte_memcpy_avx512f.o +=3D -DRTE_MACHINE_CPUFLAG_AVX512F > inside the Makefile. >=20 > For rte_memcpy_avx2() we force compiler to use AVX2 path inside > rte_memcpy_internal(), etc. >=20 > > But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler > > supports avx2, then internal would only compiled With avx2 codes, then > cannot choose other code path. What if the HW cannot support avx2? >=20 > If the HW can't support AVX2 then rte_memcpy_init() just wouldn't select > rte_memcpy_avx2(), it would select rte_memcpy_sse() instead: >=20 > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {...} - that is a runtime > check that underlying HW does support AVX2. >=20 > Konstantin >=20 > > If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether > both > > compiler and HW supports avx2. Then the function has no difference righ= t > now. > > The mocro is determined at compilation time. But selection is hoped to = be > at runtime. > > Did I consider something wrong? > > > > Best Regards, > > Xiaoyun Li > > > > > > > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Tuesday, October 3, 2017 19:16 > > > To: Li, Xiaoyun ; Richardson, Bruce > > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > ; dev@dpdk.org > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > Hi, > > > > > > > > > > > Hi > > > > > > > > > -----Original Message----- > > > > > From: Ananyev, Konstantin > > > > > Sent: Tuesday, October 3, 2017 00:39 > > > > > To: Li, Xiaoyun ; Richardson, Bruce > > > > > > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > > > ; dev@dpdk.org > > > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over > > > > > memcpy > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Li, Xiaoyun > > > > > > Sent: Monday, October 2, 2017 5:13 PM > > > > > > To: Ananyev, Konstantin ; > > > > > > Richardson, > > > > > Bruce > > > > > > Cc: Lu, Wenzhuo ; Zhang, Helin > > > > > ; dev@dpdk.org; Li, Xiaoyun > > > > > > > > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > > > This patch dynamically selects functions of memcpy at run-time > > > > > > based on CPU flags that current machine supports. This patch > > > > > > uses function pointers which are bind to the relative > > > > > > functions at constrctor > > > time. > > > > > > In addition, AVX512 instructions set would be compiled only if > > > > > > users config it enabled and the compiler supports it. > > > > > > > > > > > > Signed-off-by: Xiaoyun Li > > > > > > --- > > > > > > v2 > > > > > > * Use gcc function multi-versioning to avoid compilation issues= . > > > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 > > > > > > and the compiler supports it, the AVX512 codes would be > > > > > > compiled. Only if the compiler supports AVX2, the AVX2 codes > would be compiled. > > > > > > > > > > > > v3 > > > > > > * Reduce function calls via only keep rte_memcpy_xxx. > > > > > > * Add conditions that when copy size is small, use inline code = path. > > > > > > Otherwise, use dynamic code path. > > > > > > * To support attribute target, clang version must be greater th= an 3.7. > > > > > > Otherwise, would choose SSE/AVX code path, the same as before. > > > > > > * Move two mocro functions to the top of the code since they > > > > > > would be used in inline SSE/AVX and dynamic SSE/AVX codes. > > > > > > > > > > > > v4 > > > > > > * Modify rte_memcpy.h to several .c files and modify makefiles > > > > > > to compile > > > > > > AVX2 and AVX512 files. > > > > > > > > > > Could you explain to me why instead of reusing existing > > > > > rte_memcpy() code to generate _sse/_avx2/ax512f flavors you keep > > > > > pushing changes with 3 separate implementations? > > > > > Obviously that is much more expensive in terms of maintenance > > > > > and doesn't look like feasible solution to me. > > > > > Is existing rte_memcpy() implementation is not good enough in > > > > > terms of functionality and/or performance? > > > > > If so, can you outline these problems and try to fix them first. > > > > > Konstantin > > > > > > > > > > > > > I just change many small functions to one function in those 3 > > > > separate > > > functions. > > > > > > Yes, so with what you suggest we'll have 4 implementations for > > > rte_memcpy to support. > > > That's very expensive terms of maintenance and I believe totally > unnecessary. > > > > > > > Because the existing codes are totally inline, including > > > > rte_memcpy() itself. So the compilation will change all > > > > rte_memcpy() calls into the basic > > > codes like xmm0=3Dxxx. > > > > > > > > The existing codes in this way are OK. > > > > > > Good. > > > > > > >But when run-time, it will bring lots of function calls and cause > > > >perf drop. > > > > > > I believe it wouldn't if we do it properly. > > > All internal functions (mov16, mov32, etc.) will still be unlined by > > > the compiler for each flavor (sse/avx2/etc.) - have a look at the pat= ch I > sent. > > > > > > Konstantin