From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 17E3A282 for ; Thu, 15 Dec 2016 11:53:44 +0100 (CET) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 15 Dec 2016 02:53:43 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,351,1477983600"; d="scan'208";a="1072354912" Received: from irsmsx109.ger.corp.intel.com ([163.33.3.23]) by orsmga001.jf.intel.com with ESMTP; 15 Dec 2016 02:53:42 -0800 Received: from irsmsx112.ger.corp.intel.com (10.108.20.5) by IRSMSX109.ger.corp.intel.com (163.33.3.23) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 15 Dec 2016 10:53:41 +0000 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.212]) by irsmsx112.ger.corp.intel.com ([169.254.1.86]) with mapi id 14.03.0248.002; Thu, 15 Dec 2016 10:53:41 +0000 From: "Ananyev, Konstantin" To: "Yang, Zhiyong" , Thomas Monjalon CC: "dev@dpdk.org" , "yuanhan.liu@linux.intel.com" , "Richardson, Bruce" , "De Lara Guarch, Pablo" Thread-Topic: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform Thread-Index: AQHSTHcpRTnyVY6QF0Gsm+MMelrW8KD0c9uAgAlAQICAABxAEIAACH0AgAAFB4CABN6NgIAF6d8AgABCBnA= Date: Thu, 15 Dec 2016 10:53:40 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0EFF66@irsmsx105.ger.corp.intel.com> References: <1480926387-63838-1-git-send-email-zhiyong.yang@intel.com> <1480926387-63838-2-git-send-email-zhiyong.yang@intel.com> <7223515.9TZuZb6buy@xps13> <2601191342CEEE43887BDE71AB9772583F0E55B0@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0E568B@irsmsx105.ger.corp.intel.com> In-Reply-To: Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.180] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2016 10:53:45 -0000 Hi Zhiyong, > -----Original Message----- > From: Yang, Zhiyong > Sent: Thursday, December 15, 2016 6:51 AM > To: Yang, Zhiyong ; Ananyev, Konstantin ; Thomas Monjalon > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce ; De Lara Guarch, Pablo > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on I= A platform >=20 > Hi, Thomas, Konstantin: >=20 > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yang, Zhiyong > > Sent: Sunday, December 11, 2016 8:33 PM > > To: Ananyev, Konstantin ; Thomas > > Monjalon > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > ; De Lara Guarch, Pablo > > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > Hi, Konstantin, Bruce: > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Thursday, December 8, 2016 6:31 PM > > > To: Yang, Zhiyong ; Thomas Monjalon > > > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > ; De Lara Guarch, Pablo > > > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > > > > > > > > > -----Original Message----- > > > > From: Yang, Zhiyong > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > To: Ananyev, Konstantin ; Thomas > > > > Monjalon > > > > Cc: dev@dpdk.org; yuanhan.liu@linux.intel.com; Richardson, Bruce > > > > ; De Lara Guarch, Pablo > > > > > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memse= t > > > > on IA platform > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > static inline void* > > > rte_memset_huge(void *s, int c, size_t n) { > > > return __rte_memset_vector(s, c, n); } > > > > > > static inline void * > > > rte_memset(void *s, int c, size_t n) > > > { > > > If (n < XXX) > > > return rte_memset_scalar(s, c, n); > > > else > > > return rte_memset_huge(s, c, n); > > > } > > > > > > XXX could be either a define, or could also be a variable, so it can > > > be setuped at startup, depending on the architecture. > > > > > > Would that work? > > > Konstantin > > > > I have implemented the code for choosing the functions at run time. > rte_memcpy is used more frequently, So I test it at run time. >=20 > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n= ); > extern rte_memcpy_vector_t rte_memcpy_vector; > static inline void * > rte_memcpy(void *dst, const void *src, size_t n) > { > return rte_memcpy_vector(dst, src, n); > } > In order to reduce the overhead at run time, > I assign the function address to var rte_memcpy_vector before main() star= ts to init the var. >=20 > static void __attribute__((constructor)) > rte_memcpy_init(void) > { > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > { > rte_memcpy_vector =3D rte_memcpy_avx2; > } > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > { > rte_memcpy_vector =3D rte_memcpy_sse; > } > else > { > rte_memcpy_vector =3D memcpy; > } >=20 > } I thought we discussed a bit different approach. In which rte_memcpy_vector() (rte_memeset_vector) would be called only aft= er some cutoff point, i.e: void rte_memcpy(void *dst, const void *src, size_t len) { if (len < N) memcpy(dst, src, len); else rte_memcpy_vector(dst, src, len); } If you just always call rte_memcpy_vector() for every len,=20 then it means that compiler most likely has always to generate a proper cal= l (not inlining happening). For small length(s) price of extra function would probably overweight any potential gain with SSE/AVX2 implementation. =20 Konstantin=20 > I run the same virtio/vhost loopback tests without NIC. > I can see the throughput drop when running choosing functions at run ti= me > compared to original code as following on the same platform(my machine is= haswell) > Packet size perf drop > 64 -4% > 256 -5.4% > 1024 -5% > 1500 -2.5% > Another thing, I run the memcpy_perf_autotest, when N=3D <128, > the rte_memcpy perf gains almost disappears > When choosing functions at run time. For N=3Dother numbers, the perf gai= ns will become narrow. >=20 > Thanks > Zhiyong