From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 768E55A78 for ; Tue, 27 Jan 2015 09:22:22 +0100 (CET) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga101.jf.intel.com with ESMTP; 27 Jan 2015 00:22:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.09,473,1418112000"; d="scan'208";a="657121109" Received: from pgsmsx107.gar.corp.intel.com ([10.221.44.105]) by fmsmga001.fm.intel.com with ESMTP; 27 Jan 2015 00:22:14 -0800 Received: from shsmsx103.ccr.corp.intel.com (10.239.110.14) by PGSMSX107.gar.corp.intel.com (10.221.44.105) with Microsoft SMTP Server (TLS) id 14.3.195.1; Tue, 27 Jan 2015 16:22:13 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.64]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.192]) with mapi id 14.03.0195.001; Tue, 27 Jan 2015 16:22:12 +0800 From: "Wang, Zhihong" To: "EDMISON, Kelvin (Kelvin)" , "Stephen Hemminger" , Neil Horman Thread-Topic: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Thread-Index: AQHQM+g93pBJjc9asEOmHYpPzg55ApzHvQ6AgADqhQCAAMvXAIAAEcSAgADApYCAAJS5gIAABzsAgAAFRICAAAFgAIAAayOAgAAR9QCAARPvAIAA0R0A Date: Tue, 27 Jan 2015 08:22:12 +0000 Message-ID: References: <20150119130221.GB21790@hmsreliant.think-freely.org> <20150120151118.GD18449@hmsreliant.think-freely.org> <20150120161453.GA5316@bricha3-MOBL3> <54BF9D59.7070104@bisdn.de> <20150121130234.GB10756@bricha3-MOBL3> <54BFA7D5.7020106@bisdn.de> <20150121132620.GC10756@bricha3-MOBL3> <20150121114947.0753ae87@urahara> <20150121205404.GB32617@hmsreliant.think-freely.org> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jan 2015 08:22:23 -0000 > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of EDMISON, Kelvin > (Kelvin) > Sent: Friday, January 23, 2015 2:22 AM > To: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >=20 >=20 >=20 > On 2015-01-21, 3:54 PM, "Neil Horman" wrote: >=20 > >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote: > >> On Wed, 21 Jan 2015 13:26:20 +0000 > >> Bruce Richardson wrote: > >> > >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote: > >> > > > >> > > On 21/01/15 14:02, Bruce Richardson wrote: > >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote: > >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote: > >> > > >>>>-----Original Message----- > >> > > >>>>From: Richardson, Bruce > >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM > >> > > >>>>To: Neil Horman > >> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org > >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization > >> > > >>>> > >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote: > >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong > wrote: > >> > > >>>>>>>-----Original Message----- > >> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com] > >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM > >> > > >>>>>>>To: Wang, Zhihong > >> > > >>>>>>>Cc: dev@dpdk.org > >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy > optimization > >> > > >>>>>>> > >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, > >>zhihong.wang@intel.com > >> > > >>>>wrote: > >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and > >>AVX > >> > > >>>>platforms. > >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases > >>and > >> > > >>>>>>>>more test > >> > > >>>>>>>points. > >> > > >>>>>>>>Optimization techniques are summarized below: > >> > > >>>>>>>> > >> > > >>>>>>>>1. Utilize full cache bandwidth > >> > > >>>>>>>> > >> > > >>>>>>>>2. Enforce aligned stores > >> > > >>>>>>>> > >> > > >>>>>>>>3. Apply load address alignment based on architecture > >>features > >> > > >>>>>>>> > >> > > >>>>>>>>4. Make load/store address available as early as possible > >> > > >>>>>>>> > >> > > >>>>>>>>5. General optimization techniques like inlining, branch > >> > > >>>>>>>>reducing, prefetch pattern access > >> > > >>>>>>>> > >> > > >>>>>>>>Zhihong Wang (4): > >> > > >>>>>>>> Disabled VTA for memcpy test in app/test/Makefile > >> > > >>>>>>>> Removed unnecessary test cases in test_memcpy.c > >> > > >>>>>>>> Extended test coverage in test_memcpy_perf.c > >> > > >>>>>>>> Optimized memcpy in arch/x86/rte_memcpy.h for both > SSE > >>and AVX > >> > > >>>>>>>> platforms > >> > > >>>>>>>> > >> > > >>>>>>>> app/test/Makefile | 6= + > >> > > >>>>>>>> app/test/test_memcpy.c | 52 > >>+- > >> > > >>>>>>>> app/test/test_memcpy_perf.c | 238 > >>+++++--- > >> > > >>>>>>>> .../common/include/arch/x86/rte_memcpy.h | 664 > >> > > >>>>>>>+++++++++++++++------ > >> > > >>>>>>>> 4 files changed, 656 insertions(+), 304 deletions(-) > >> > > >>>>>>>> > >> > > >>>>>>>>-- > >> > > >>>>>>>>1.9.3 > >> > > >>>>>>>> > >> > > >>>>>>>> > >> > > >>>>>>>Are you able to compile this with gcc 4.9.2? The > >>compilation of > >> > > >>>>>>>test_memcpy_perf is taking forever for me. It appears hung= . > >> > > >>>>>>>Neil > >> > > >>>>>>Neil, > >> > > >>>>>> > >> > > >>>>>>Thanks for reporting this! > >> > > >>>>>>It should compile but will take quite some time if the CPU > >>doesn't support > >> > > >>>>AVX2, the reason is that: > >> > > >>>>>>1. The SSE & AVX memcpy implementation is more > complicated > >>than > >> > > >>>>AVX2 > >> > > >>>>>>version thus the compiler takes more time to compile and > >>optimize 2. > >> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy > >>calls for > >> > > >>>>>>better test case coverage, that's quite a lot > >> > > >>>>>> > >> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GC= C > >>4.9.2: > >> > > >>>>>>1. The whole compile process takes 9'41" with the original > >> > > >>>>>>test_memcpy_perf.c (63 + 63 =3D 126 constant memcpy calls) 2= . > >>It takes > >> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to > >>12 + 12 > >> > > >>>>>>=3D 24 > >> > > >>>>>> > >> > > >>>>>>I'll reduce memcpy call in the next version of patch. > >> > > >>>>>> > >> > > >>>>>ok, thank you. I'm all for optimzation, but I think a compil= e > >>that > >> > > >>>>>takes almost > >> > > >>>>>10 minutes for a single file is going to generate some raised > >>eyebrows > >> > > >>>>>when end users start tinkering with it > >> > > >>>>> > >> > > >>>>>Neil > >> > > >>>>> > >> > > >>>>>>Zhihong (John) > >> > > >>>>>> > >> > > >>>>Even two minutes is a very long time to compile, IMHO. The > >>whole of DPDK > >> > > >>>>doesn't take that long to compile right now, and that's with a > >>couple of huge > >> > > >>>>header files with routing tables in it. Any chance you could > >>cut compile time > >> > > >>>>down to a few seconds while still having reasonable tests? > >> > > >>>>Also, when there is AVX2 present on the system, what is the > >>compile time > >> > > >>>>like for that code? > >> > > >>>> > >> > > >>>> /Bruce > >> > > >>>Neil, Bruce, > >> > > >>> > >> > > >>>Some data first. > >> > > >>> > >> > > >>>Sandy Bridge without AVX2: > >> > > >>>1. original w/ 10 constant memcpy: 2'25" > >> > > >>>2. patch w/ 12 constant memcpy: 2'41" > >> > > >>>3. patch w/ 63 constant memcpy: 9'41" > >> > > >>> > >> > > >>>Haswell with AVX2: > >> > > >>>1. original w/ 10 constant memcpy: 1'57" > >> > > >>>2. patch w/ 12 constant memcpy: 1'56" > >> > > >>>3. patch w/ 63 constant memcpy: 3'16" > >> > > >>> > >> > > >>>Also, to address Bruce's question, we have to reduce test case > >>to cut down compile time. Because we use: > >> > > >>>1. intrinsics instead of assembly for better flexibility and ca= n > >>utilize more compiler optimization > >> > > >>>2. complex function body for better performance > >> > > >>>3. inlining > >> > > >>>This increases compile time. > >> > > >>>But I think it'd be okay to do that as long as we can select a > >>fair set of test points. > >> > > >>> > >> > > >>>It'd be great if you could give some suggestion, say, 12 points= . > >> > > >>> > >> > > >>>Zhihong (John) > >> > > >>> > >> > > >>> > >> > > >>While I agree in the general case these long compilation times i= s > >>painful > >> > > >>for the users, having a factor of 2-8x in memcpy operations is > >>quite an > >> > > >>improvement, specially in DPDK applications which need to deal > >> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and > >>reassembly. > >> > > >> > >> > > >>Why not having a fast compilation by default, and a tunable > >>config flag to > >> > > >>enable a highly optimized version of rte_memcpy (e.g. > >>RTE_EAL_OPT_MEMCPY)? > >> > > >> > >> > > >>Marc > >> > > >> > >> > > >Out of interest, are these 2-8x improvements something you have > >>benchmarked > >> > > >in these app scenarios? [i.e. not just in micro-benchmarks]. > >> > > > >> > > How much that micro-speedup will end up affecting the performance > >>of the > >> > > entire application is something I cannot say, so I agree that we > >>should > >> > > probably have some additional benchmarks before deciding that pays > >>off > >> > > maintaining 2 versions of rte_memcpy. > >> > > > >> > > There are however a bunch of possible DPDK applications that could > >> > > potentially benefit; IP fragmentation, tunneling and specialized D= PI > >> > > applications, among others, since they involve a reasonable amount > >>of > >> > > memcpys per pkt. My point was, *if* it proves that is enough > >>beneficial, why > >> > > not having it optionally? > >> > > > >> > > Marc > >> > > >> > I agree, if it provides the speedups then we need to have it in - an= d > >>quite possibly > >> > on by default, even. > >> > > >> > /Bruce > >> > >> One issue I have is that as a vendor we need to ship on binary, not > >>different distributions > >> for each Intel chip variant. There is some support for multi-chip > >>version functions > >> but only in latest Gcc which isn't in Debian stable. And the multi-chi= p > >>version > >> of functions is going to be more expensive than inlining. For some > >>cases, I have > >> seen that the overhead of fancy instructions looks good but have nasty > >>side effects > >> like CPU stall and/or increased power consumption which turns of turbo > >>boost. > >> > >> > >> Distro's in general have the same problem with special case > >>optimizations. > >> > >What we really need is to do something like borrow the alternatives > >mechanism > >from the kernel so that we can dynamically replace instructions at run > >time > >based on cpu flags. That way we could make the choice at run time, and > >wouldn't > >have to do alot of special case jumping about. > >Neil >=20 > +1. >=20 > I think it should be an anti-requirement that the build machine be the > exact same chip as the deployment platform. >=20 > I like the cpu flag inspection approach. It would help in the case where > DPDK is in a VM and an odd set of CPU flags have been exposed. >=20 > If that approach doesn't work though, then perhaps DPDK memcpy could go > through a benchmarking at app startup time and select the most performant > option out of a set, like mdraid's raid6 implementation does. To give an > example, this is what my systems print out at boot time re: raid6 > algorithm selection. > raid6: sse2x1 3171 MB/s > raid6: sse2x2 3925 MB/s > raid6: sse2x4 4523 MB/s > raid6: using algorithm sse2x4 (4523 MB/s) >=20 > Regards, > Kelvin >=20 Thanks for the proposal! For DPDK, performance is always the most important concern. We need to util= ize new architecture features to achieve that, so solution per arch is nece= ssary. Even a few extra cycles can lead to bad performance if they're in a hot loo= p. For instance, let's assume DPDK takes 60 cycles to process a packet on aver= age, then 3 more cycles here means 5% performance drop. The dynamic solution is doable but with performance penalties, even if it c= ould be small. Also it may bring extra complexity, which can lead to unpred= ictable behaviors and side effects. For example, the dynamic solution won't have inline unrolling, which can br= ing significant performance benefit for small copies with constant length, = like eth_addr. We can investigate the VM scenario more. Zhihong (John)