From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id B5BCE1F5 for ; Thu, 29 Jan 2015 02:53:38 +0100 (CET) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP; 28 Jan 2015 17:49:15 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,862,1389772800"; d="scan'208";a="446766559" Received: from kmsmsx153.gar.corp.intel.com ([172.21.73.88]) by FMSMGA003.fm.intel.com with ESMTP; 28 Jan 2015 17:39:48 -0800 Received: from shsmsx104.ccr.corp.intel.com (10.239.110.15) by KMSMSX153.gar.corp.intel.com (172.21.73.88) with Microsoft SMTP Server (TLS) id 14.3.195.1; Thu, 29 Jan 2015 09:53:34 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.253]) by SHSMSX104.ccr.corp.intel.com ([169.254.5.231]) with mapi id 14.03.0195.001; Thu, 29 Jan 2015 09:53:32 +0800 From: "Wang, Zhihong" To: "EDMISON, Kelvin (Kelvin)" , "Stephen Hemminger" , Neil Horman Thread-Topic: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Thread-Index: AQHQM+g93pBJjc9asEOmHYpPzg55ApzHvQ6AgADqhQCAAMvXAIAAEcSAgADApYCAAJS5gIAABzsAgAAFRICAAAFgAIAAayOAgAAR9QCAARPvAIAA0R0AgAhPuoCAAMRJ8A== Date: Thu, 29 Jan 2015 01:53:32 +0000 Message-ID: References: <20150119130221.GB21790@hmsreliant.think-freely.org> <20150120151118.GD18449@hmsreliant.think-freely.org> <20150120161453.GA5316@bricha3-MOBL3> <54BF9D59.7070104@bisdn.de> <20150121130234.GB10756@bricha3-MOBL3> <54BFA7D5.7020106@bisdn.de> <20150121132620.GC10756@bricha3-MOBL3> <20150121114947.0753ae87@urahara> <20150121205404.GB32617@hmsreliant.think-freely.org> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Jan 2015 01:53:40 -0000 > -----Original Message----- > From: EDMISON, Kelvin (Kelvin) [mailto:kelvin.edmison@alcatel-lucent.com] > Sent: Thursday, January 29, 2015 5:48 AM > To: Wang, Zhihong; Stephen Hemminger; Neil Horman > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >=20 >=20 > On 2015-01-27, 3:22 AM, "Wang, Zhihong" wrote: >=20 > > > > > >> -----Original Message----- > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of EDMISON, > Kelvin > >> (Kelvin) > >> Sent: Friday, January 23, 2015 2:22 AM > >> To: dev@dpdk.org > >> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization > >> > >> > >> > >> On 2015-01-21, 3:54 PM, "Neil Horman" > wrote: > >> > >> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote: > >> >> On Wed, 21 Jan 2015 13:26:20 +0000 Bruce Richardson > >> >> wrote: > >> >> > [..trim...] > >> >> One issue I have is that as a vendor we need to ship on binary, > >> >>not different distributions for each Intel chip variant. There is > >> >>some support for multi-chip version functions but only in latest > >> >>Gcc which isn't in Debian stable. And the > >>multi-chip > >> >>version > >> >> of functions is going to be more expensive than inlining. For some > >> >>cases, I have seen that the overhead of fancy instructions looks > >> >>good but have > >>nasty > >> >>side effects > >> >> like CPU stall and/or increased power consumption which turns of > >>turbo > >> >>boost. > >> >> > >> >> > >> >> Distro's in general have the same problem with special case > >> >>optimizations. > >> >> > >> >What we really need is to do something like borrow the alternatives > >> >mechanism from the kernel so that we can dynamically replace > >> >instructions at run time based on cpu flags. That way we could make > >> >the choice at run time, and wouldn't have to do alot of special case > >> >jumping about. > >> >Neil > >> > >> +1. > >> > >> I think it should be an anti-requirement that the build machine be > >> the exact same chip as the deployment platform. > >> > >> I like the cpu flag inspection approach. It would help in the case > >>where DPDK is in a VM and an odd set of CPU flags have been exposed. > >> > >> If that approach doesn't work though, then perhaps DPDK memcpy could > >>go through a benchmarking at app startup time and select the most > >>performant option out of a set, like mdraid's raid6 implementation > >>does. To give an example, this is what my systems print out at boot > >>time re: raid6 algorithm selection. > >> raid6: sse2x1 3171 MB/s > >> raid6: sse2x2 3925 MB/s > >> raid6: sse2x4 4523 MB/s > >> raid6: using algorithm sse2x4 (4523 MB/s) > >> > >> Regards, > >> Kelvin > >> > > > >Thanks for the proposal! > > > >For DPDK, performance is always the most important concern. We need to > >utilize new architecture features to achieve that, so solution per arch > >is necessary. > >Even a few extra cycles can lead to bad performance if they're in a hot > >loop. > >For instance, let's assume DPDK takes 60 cycles to process a packet on > >average, then 3 more cycles here means 5% performance drop. > > > >The dynamic solution is doable but with performance penalties, even if > >it could be small. Also it may bring extra complexity, which can lead > >to unpredictable behaviors and side effects. > >For example, the dynamic solution won't have inline unrolling, which > >can bring significant performance benefit for small copies with > >constant length, like eth_addr. > > > >We can investigate the VM scenario more. > > > >Zhihong (John) >=20 > John, >=20 > Thanks for taking the time to answer my newbie question. I deeply > appreciate the attention paid to performance in DPDK. I have a follow-up > though. >=20 > I'm trying to figure out what requirements this approach creates for the > software build environment. If we want to build optimized versions for > Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have = one > of each micro-architecture available for running the builds, or is there = a way > of cross-compiling for all micro-architectures from just one build > environment? >=20 > Thanks, > Kelvin >=20 I'm not an expert in this, just some facts based on my test: The compile pr= ocess depends on the compiler and the lib version. So even on a machine that doesn't support the necessary ISA, it still shoul= d compile as long as gcc & glibc & etc have the support, only you'll get "I= llegal instruction" trying launching the compiled binary. Therefore if there's a way (worst case scenario: change flags manually) to = make DPDK build process think that it's on a Haswell machine, it will produ= ce Haswell binaries. Zhihong (John)