From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 42FA61B1B1 for ; Mon, 2 Oct 2017 02:12:30 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Oct 2017 17:12:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,466,1500966000"; d="scan'208";a="1177598932" Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201]) by orsmga001.jf.intel.com with ESMTP; 01 Oct 2017 17:12:28 -0700 Received: from fmsmsx152.amr.corp.intel.com (10.18.125.5) by FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS) id 14.3.319.2; Sun, 1 Oct 2017 17:12:28 -0700 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by FMSMSX152.amr.corp.intel.com (10.18.125.5) with Microsoft SMTP Server (TLS) id 14.3.319.2; Sun, 1 Oct 2017 17:12:28 -0700 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.159]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.93]) with mapi id 14.03.0319.002; Mon, 2 Oct 2017 08:12:26 +0800 From: "Li, Xiaoyun" To: "Ananyev, Konstantin" , "Richardson, Bruce" CC: "Lu, Wenzhuo" , "Zhang, Helin" , "dev@dpdk.org" Thread-Topic: [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Thread-Index: AQHTNps4d1S4LinEtUS+0otIwfUeF6LPKmYAgACLkxA= Date: Mon, 2 Oct 2017 00:12:25 +0000 Message-ID: References: <1506411689-94690-1-git-send-email-xiaoyun.li@intel.com> <1506411689-94690-2-git-send-email-xiaoyun.li@intel.com> <2601191342CEEE43887BDE71AB9772585FAA2BD2@IRSMSX103.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB9772585FAA2BD2@IRSMSX103.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYWFiOTY0ZGMtZDVmYS00MzI1LTk0YTctMWEzODU4ZmY5MDE0IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IndyakxKcE9ZeVdJejdSNFBlOFc0blRuWDRTdkJWdkUrN28zS2h2NmFENTg9In0= x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Oct 2017 00:12:30 -0000 Hi > That means that each file with '#include will have its own > copy > of that function: > $ objdump -d x86_64-native-linuxapp-gcc/app/testpmd | grep > ':' | sort -u | wc -l > 233 > Same story for rte_memcpy_ptr and rte_memcpy_DEFAULT, etc... > Obviously we need (and want) only one copy of that stuff per binary. >=20 > > +#ifdef CC_SUPPORT_AVX2 >=20 > Why do you assume this macro will be defined? > By whom? > There is no such macro with gcc: > $ gcc -march=3Dnative -dM -E - &1 | grep AVX2 > #define __AVX2__ 1 > , and you don't define it yourself. > When building with '-march=3Dnative' on BDW only rte_memcpy_DEFAULT get > compiled. >=20 I defined it myself. But when I sort the patch, I forgot to modify the file= in this version. Sorry about that. It should be like this. To check whether the compiler supports AVX2 or AVX5= 12. diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk index a813c91..92399ec 100644 --- a/mk/rte.cpuflags.mk +++ b/mk/rte.cpuflags.mk @@ -141,3 +141,17 @@ space:=3D $(empty) $(empty) CPUFLAGSTMP1 :=3D $(addprefix RTE_CPUFLAG_,$(CPUFLAGS)) CPUFLAGSTMP2 :=3D $(subst $(space),$(comma),$(CPUFLAGSTMP1)) CPUFLAGS_LIST :=3D -DRTE_COMPILE_TIME_CPUFLAGS=3D$(CPUFLAGSTMP2) + +# Check if the compiler supports AVX512. +CC_SUPPORT_AVX512 :=3D $(shell $(CC) -march=3Dskylake-avx512 -dM -E - < /d= ev/null 2>&1 | grep -q AVX512 && echo 1) +ifeq ($(CC_SUPPORT_AVX512),1) +ifeq ($(CONFIG_RTE_ENABLE_AVX512),y) +MACHINE_CFLAGS +=3D -DCC_SUPPORT_AVX512 +endif +endif + +# Check if the compiler supports AVX2. +CC_SUPPORT_AVX2 :=3D $(shell $(CC) -march=3Dcore-avx2 -dM -E - < /dev/null= 2>&1 | grep -q AVX2 && echo 1) +ifeq ($(CC_SUPPORT_AVX2),1) +MACHINE_CFLAGS +=3D -DCC_SUPPORT_AVX2 +endif > To summarize: as I understand the goal of that patch was > (assuming that our current rte_memcpy() implementation is good in terms o= f > both performance and functionality): > 1. Based on current rte_memcpy() implementation define 3 x86 arch specifi= c > rte_memcpy flavors: > a) rte_memcpy_SSE > b) rte_memcpy_AVX2 > c) rte_memcpy_AVX512 > 2. Select appropriate flavor based on current HW at runtime, > i.e. both 3 flavors should be present in the binary and selection sho= uld be > made > at program startup. >=20 > As I can see none of the goals was achieved with the current patch, > instead a lot of redundant code was introduced. > So I think it is NACK for the current version. > What I think need to be done instead: >=20 > 1. mv lib/librte_eal/common/include/arch/x86/rte_memcpy.h > lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h > 2. inside rte_memcpy_internal.h rename rte_memcpy() into > rte_memcpy_internal(). > 3. create 3 files: > rte_memcpy_sse.c > rte_memcpy_avx2.c > rte_memcpy_avx512.c >=20 > Inside each of these files we define corresponding rte_memcpy_xxx() > function. > I.E: > rte_memcpy_avx2.c: > .... > #ifndef RTE_MACHINE_CPUFLAG_AVX2 > #error "no avx2 support" > endif >=20 > #include "rte_memcpy_internal.h" > ... >=20 > void * > rte_memcpy_avx2(void *dst, const void *src, size_t n) > { > return rte_memcpy_internal(dst, src, n); > } >=20 > 4. Make changes inside lib/librte_eal/Makefile to ensure that each of > rte_memcpy_xxx() > get build with appropriate -march flags (I.E: avx2 with -mavx2, etc.) > You can use librte_acl/Makefile as a reference. >=20 > 5. Create rte_memcpy.c and put rte_memcpy_ptr/rte_memcpy_init() > definitions in that file. > 6. Create new rte_memcpy.h and define rte_memcpy() in it: >=20 > ... > #include > ... >=20 > +#define RTE_X86_MEMCPY_THRESH 128 > static inline void * > rte_memcpy(void *dst, const void *src, size_t n) > { > if (n <=3D RTE_X86_MEMCPY_THRESH) > return rte_memcpy_internal(dst, src, n); > else > return (*rte_memcpy_ptr)(dst, src, n); > } >=20 > 7. Test it properly - i.e. build dpdk with default target and make sure e= ach of > 3 flavors > could be selected properly at runtime based on underlying arch. >=20 > 8. As a possible future improvement - with such changes we don't need a > generic inline > implementation. We can think about creating a faster version that need to > copy > <=3D 128B. >=20 > Konstantin >=20 Will modify it in next version. Thanks.