From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 3D8B45A1F for ; Wed, 21 Jan 2015 04:20:03 +0100 (CET) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga101.jf.intel.com with ESMTP; 20 Jan 2015 19:19:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.09,439,1418112000"; d="scan'208";a="654059607" Received: from kmsmsx153.gar.corp.intel.com ([172.21.73.88]) by fmsmga001.fm.intel.com with ESMTP; 20 Jan 2015 19:19:48 -0800 Received: from shsmsx103.ccr.corp.intel.com (10.239.110.14) by KMSMSX153.gar.corp.intel.com (172.21.73.88) with Microsoft SMTP Server (TLS) id 14.3.195.1; Wed, 21 Jan 2015 11:18:43 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.64]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.192]) with mapi id 14.03.0195.001; Wed, 21 Jan 2015 11:18:40 +0800 From: "Wang, Zhihong" To: Neil Horman , Stephen Hemminger Thread-Topic: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Thread-Index: AQHQNOWk/EPPONUluEqzGrW5cT6E/ZzJ0gHg Date: Wed, 21 Jan 2015 03:18:40 +0000 Message-ID: References: <1421632414-10027-1-git-send-email-zhihong.wang@intel.com> <1421632414-10027-5-git-send-email-zhihong.wang@intel.com> <20150120091538.4c3a1363@urahara> <20150120191624.GJ18449@hmsreliant.think-freely.org> In-Reply-To: <20150120191624.GJ18449@hmsreliant.think-freely.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 21 Jan 2015 03:20:03 -0000 > -----Original Message----- > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Wednesday, January 21, 2015 3:16 AM > To: Stephen Hemminger > Cc: Wang, Zhihong; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in > arch/x86/rte_memcpy.h for both SSE and AVX platforms >=20 > On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote: > > On Mon, 19 Jan 2015 09:53:34 +0800 > > zhihong.wang@intel.com wrote: > > > > > Main code changes: > > > > > > 1. Differentiate architectural features based on CPU flags > > > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make > > > full utilization of cache bandwidth > > > > > > b. Implement separated copy flow specifically optimized for > > > target architecture > > > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > > > a. Add store aligning > > > > > > b. Add load aligning based on architectural features > > > > > > c. Put block copy loop into inline move functions for better > > > control of instruction order > > > > > > d. Eliminate unnecessary MOVs > > > > > > 3. Rewrite the inline move functions > > > > > > a. Add move functions for unaligned load cases > > > > > > b. Change instruction order in copy loops for better pipeline > > > utilization > > > > > > c. Use intrinsics instead of assembly code > > > > > > 4. Remove slow glibc call for constant copies > > > > > > Signed-off-by: Zhihong Wang > > > > Dumb question: why not fix glibc memcpy instead? > > What is special about rte_memcpy? > > > > > Fair point. Though, does glibc implement optimized memcpys per arch? Or > do they just rely on the __builtin's from gcc to get optimized variants? >=20 > Neil Neil, Stephen, Glibc has per arch implementation but is for general purpose, while rte_mem= cpy is more for small size & in cache memcpy, which is the DPDK case. This = lead to different trade-offs and optimization techniques. Also, glibc's update from version to version is also based on general judgm= ents. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for Haswell,= though not full accurate. But we need an implementation for both Sandy Bri= dge and Haswell. For instance, glibc 2.18 has load aligning optimization for unaligned memcp= y but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit= mov, but remove load aligning optimization. This hurts unaligned memcpy pe= rformance a lot on architectures like Ivy Bridge. Glibc's reason is that th= e load aligning optimization doesn't help when src/dst isn't in cache, whic= h could be the general case, but not the DPDK case. Zhihong (John)