From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id ABB21376D for ; Fri, 16 Dec 2016 12:47:40 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP; 16 Dec 2016 03:47:39 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,357,1477983600"; d="scan'208";a="18930914" Received: from irsmsx108.ger.corp.intel.com ([163.33.3.3]) by orsmga002.jf.intel.com with ESMTP; 16 Dec 2016 03:47:38 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.212]) by IRSMSX108.ger.corp.intel.com ([169.254.11.173]) with mapi id 14.03.0248.002; Fri, 16 Dec 2016 11:47:37 +0000 From: "Ananyev, Konstantin" To: "Yang, Zhiyong" , Thomas Monjalon CC: "dev@dpdk.org" , "yuanhan.liu@linux.intel.com" , "Richardson, Bruce" , "De Lara Guarch, Pablo" Thread-Topic: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Thread-Index: AQHSTHcpRTnyVY6QF0Gsm+MMelrW8KD0c9uAgAlAQICAABxAEIAACH0AgAAFB4CABN6NgIAF6d8AgABCBnCAAQNXgIAAmkaw Date: Fri, 16 Dec 2016 11:47:37 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0F069A@irsmsx105.ger.corp.intel.com> References: <1480926387-63838-1-git-send-email-zhiyong.yang@intel.com> <1480926387-63838-2-git-send-email-zhiyong.yang@intel.com> <7223515.9TZuZb6buy@xps13> <2601191342CEEE43887BDE71AB9772583F0E55B0@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0E568B@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0EFF66@irsmsx105.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Dec 2016 11:47:41 -0000 Hi Zhiyong, > > > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > > > > > static inline void* > > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > > > static inline void * > > > > > rte_memset(void *s, int c, size_t n) { > > > > > If (n < XXX) > > > > > return rte_memset_scalar(s, c, n); > > > > > else > > > > > return rte_memset_huge(s, c, n); } > > > > > > > > > > XXX could be either a define, or could also be a variable, so it > > > > > can be setuped at startup, depending on the architecture. > > > > > > > > > > Would that work? > > > > > Konstantin > > > > > > > > I have implemented the code for choosing the functions at run time. > > > rte_memcpy is used more frequently, So I test it at run time. > > > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inlin= e > > > void * rte_memcpy(void *dst, const void *src, size_t n) { > > > return rte_memcpy_vector(dst, src, n); } In order to reduce > > > the overhead at run time, I assign the function address to var > > > rte_memcpy_vector before main() starts to init the var. > > > > > > static void __attribute__((constructor)) > > > rte_memcpy_init(void) > > > { > > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > > { > > > rte_memcpy_vector =3D rte_memcpy_avx2; > > > } > > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > > { > > > rte_memcpy_vector =3D rte_memcpy_sse; > > > } > > > else > > > { > > > rte_memcpy_vector =3D memcpy; > > > } > > > > > > } > > > > I thought we discussed a bit different approach. > > In which rte_memcpy_vector() (rte_memeset_vector) would be called only > > after some cutoff point, i.e: > > > > void > > rte_memcpy(void *dst, const void *src, size_t len) { > > if (len < N) memcpy(dst, src, len); > > else rte_memcpy_vector(dst, src, len); > > } > > > > If you just always call rte_memcpy_vector() for every len, then it mean= s that > > compiler most likely has always to generate a proper call (not inlining > > happening). >=20 > > For small length(s) price of extra function would probably overweight a= ny > > potential gain with SSE/AVX2 implementation. > > > > Konstantin >=20 > Yes, in fact, from my tests, For small length(s) rte_memset is far bett= er than glibc memset, > For large lengths, rte_memset is only a bit better than memset. > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on fu= ture machine. Ok, thanks for clarification. >>From previous mails I got a wrong impression that on big lengths rte_memset_vector() is significantly faster than memset(). >=20 > >For small length(s) price of extra function would probably overweight an= y > >potential gain. > This is the key point. I think it should include the scalar optimization,= not only vector optimization. >=20 > The value of rte_memset is always inlined and for small lengths it will b= e better. > when in some case We are not sure that memset is always inlined by compil= er. Ok, so do you know in what cases memset() is not get inlined? Is it when len parameter can't be precomputed by the compiler (is not a constant)? So to me it sounds like: - We don't need to have an optimized verision of rte_memset() for big sizes= . - Which probably means we don't need an arch specific versions of rte_memse= t_vector() at all - for small sizes (<=3D 32B) scalar version would be good enough.=20 - For big sizes we can just rely on memset(). Is that so? > It seems that choosing function at run time will lose the gains. > The following is tested on haswell by patch code. Not sure what columns 2 and 3 in the table below mean?=20 Konstantin > ** rte_memset() - memset perf tests > (C =3D compile-time constant) ** > =3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D = =3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D > Size memset in cache memset in mem > (bytes) (ticks) (ticks) > ------- -------------- --------------- > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D 32B aligned =3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > 3 3 - 8 19 - 128 > 4 4 - 8 13 - 128 > 8 2 - 7 19 - 128 > 9 2 - 7 19 - 127 > 12 2 - 7 19 - 127 > 17 3 - 8 19 - 132 > 64 3 - 8 28 - 168 > 128 7 - 13 54 - 200 > 255 8 - 20 100 - 223 > 511 14 - 20 187 - 314 > 1024 24 - 29 328 - 379 > 8192 198 - 225 1829 - 2193 >=20 > Thanks > Zhiyong