From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 41B3DFAE8 for ; Tue, 20 Dec 2016 10:32:08 +0100 (CET) Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP; 20 Dec 2016 01:32:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,378,1477983600"; d="scan'208";a="44542887" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by orsmga005.jf.intel.com with ESMTP; 20 Dec 2016 01:32:07 -0800 Received: from bgsmsx151.gar.corp.intel.com (10.224.48.42) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 20 Dec 2016 01:32:04 -0800 Received: from bgsmsx101.gar.corp.intel.com ([169.254.1.222]) by BGSMSX151.gar.corp.intel.com ([169.254.3.114]) with mapi id 14.03.0248.002; Tue, 20 Dec 2016 15:01:49 +0530 From: "Yang, Zhiyong" To: "Ananyev, Konstantin" , Thomas Monjalon CC: "dev@dpdk.org" , "yuanhan.liu@linux.intel.com" , "Richardson, Bruce" , "De Lara Guarch, Pablo" Thread-Topic: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Thread-Index: AQHSTHcq0cqfe4gXqkCtBCGLh5u1zaD0F6iAgAmXsfD//8XGgIAAXKoA//+1VQCABS9VkIAF58Lw///vpgAAKtUEIAAJV/6AAM6TDbA= Date: Tue, 20 Dec 2016 09:31:48 +0000 Message-ID: References: <1480926387-63838-1-git-send-email-zhiyong.yang@intel.com> <1480926387-63838-2-git-send-email-zhiyong.yang@intel.com> <7223515.9TZuZb6buy@xps13> <2601191342CEEE43887BDE71AB9772583F0E55B0@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0E568B@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0EFF66@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0F069A@irsmsx105.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB9772583F0F069A@irsmsx105.ger.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZjg2NTc5YzUtZDJlMi00NGIwLThkYmYtOWI4YWE4MjExYzRiIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IjNqQU0wQlRYR25kY1hWK29TM041ZExNMUFjNVBSNFFYcENSQlZ6aitGWjg9In0= x-ctpclassification: CTP_IC x-originating-ip: [10.223.10.10] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Dec 2016 09:32:10 -0000 Hi, Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Friday, December 16, 2016 7:48 PM > To: Yang, Zhiyong ; Thomas Monjalon > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > ; De Lara Guarch, Pablo > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform >=20 > Hi Zhiyong, >=20 > > > > > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t > > > > > > n); > > > > > > > > > > > > static inline void* > > > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > > > > > static inline void * > > > > > > rte_memset(void *s, int c, size_t n) { > > > > > > If (n < XXX) > > > > > > return rte_memset_scalar(s, c, n); > > > > > > else > > > > > > return rte_memset_huge(s, c, n); } > > > > > > > > > > > > XXX could be either a define, or could also be a variable, so > > > > > > it can be setuped at startup, depending on the architecture. > > > > > > > > > > > > Would that work? > > > > > > Konstantin > > > > > > > > > > I have implemented the code for choosing the functions at run time= . > > > > rte_memcpy is used more frequently, So I test it at run time. > > > > > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static > > > > inline void * rte_memcpy(void *dst, const void *src, size_t n) { > > > > return rte_memcpy_vector(dst, src, n); } In order to > > > > reduce the overhead at run time, I assign the function address to > > > > var rte_memcpy_vector before main() starts to init the var. > > > > > > > > static void __attribute__((constructor)) > > > > rte_memcpy_init(void) > > > > { > > > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > > > { > > > > rte_memcpy_vector =3D rte_memcpy_avx2; > > > > } > > > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > > > { > > > > rte_memcpy_vector =3D rte_memcpy_sse; > > > > } > > > > else > > > > { > > > > rte_memcpy_vector =3D memcpy; > > > > } > > > > > > > > } > > > > > > I thought we discussed a bit different approach. > > > In which rte_memcpy_vector() (rte_memeset_vector) would be called > > > only after some cutoff point, i.e: > > > > > > void > > > rte_memcpy(void *dst, const void *src, size_t len) { > > > if (len < N) memcpy(dst, src, len); > > > else rte_memcpy_vector(dst, src, len); } > > > > > > If you just always call rte_memcpy_vector() for every len, then it > > > means that compiler most likely has always to generate a proper call > > > (not inlining happening). > > > > > For small length(s) price of extra function would probably > > > overweight any potential gain with SSE/AVX2 implementation. > > > > > > Konstantin > > > > Yes, in fact, from my tests, For small length(s) rte_memset is far > > better than glibc memset, For large lengths, rte_memset is only a bit b= etter > than memset. > > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on > future machine. >=20 > Ok, thanks for clarification. > From previous mails I got a wrong impression that on big lengths > rte_memset_vector() is significantly faster than memset(). >=20 > > > > >For small length(s) price of extra function would probably overweight > > >any > > >potential gain. > > This is the key point. I think it should include the scalar optimizatio= n, not > only vector optimization. > > > > The value of rte_memset is always inlined and for small lengths it will= be > better. > > when in some case We are not sure that memset is always inlined by > compiler. >=20 > Ok, so do you know in what cases memset() is not get inlined? > Is it when len parameter can't be precomputed by the compiler (is not a > constant)? >=20 > So to me it sounds like: > - We don't need to have an optimized verision of rte_memset() for big siz= es. > - Which probably means we don't need an arch specific versions of > rte_memset_vector() at all - > for small sizes (<=3D 32B) scalar version would be good enough. > - For big sizes we can just rely on memset(). > Is that so? Using memset has actually met some trouble in some case, such as http://dpdk.org/ml/archives/dev/2016-October/048628.html >=20 > > It seems that choosing function at run time will lose the gains. > > The following is tested on haswell by patch code. >=20 > Not sure what columns 2 and 3 in the table below mean? > Konstantin Column1 shows Size(bytes). Column2 shows rte_memset Vs memset perf results in cache Column3 shows rte_memset Vs memset perf results in memory. The data is gotten using rte_rdtsc(); The test can be run using [PATCH 3/4] app/test: add performance autotest f= or rte_memset Thanks Zhiyong >=20 > > ** rte_memset() - memset perf tests > > (C =3D compile-time constant) ** =3D=3D=3D=3D=3D=3D=3D=3D =3D= =3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > > =3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > > Size memset in cache memset in mem > > (bytes) (ticks) (ticks) > > ------- -------------- --------------- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 32B aligned > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > 3 3 - 8 19 - 128 > > 4 4 - 8 13 - 128 > > 8 2 - 7 19 - 128 > > 9 2 - 7 19 - 127 > > 12 2 - 7 19 - 127 > > 17 3 - 8 19 - 132 > > 64 3 - 8 28 - 168 > > 128 7 - 13 54 - 200 > > 255 8 - 20 100 - 223 > > 511 14 - 20 187 - 314 > > 1024 24 - 29 328 - 379 > > 8192 198 - 225 1829 - 2193 > > > > Thanks > > Zhiyong