From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-fr.alcatel-lucent.com (fr-hpida-esg-01.alcatel-lucent.com [135.245.210.20]) by dpdk.org (Postfix) with ESMTP id ABD0D5AAA for ; Thu, 22 Jan 2015 19:21:46 +0100 (CET) Received: from us70tusmtp2.zam.alcatel-lucent.com (unknown [135.5.2.64]) by Websense Email Security Gateway with ESMTPS id C7A242277EB2C for ; Thu, 22 Jan 2015 18:21:43 +0000 (GMT) Received: from US70TWXCHHUB04.zam.alcatel-lucent.com (us70twxchhub04.zam.alcatel-lucent.com [135.5.2.36]) by us70tusmtp2.zam.alcatel-lucent.com (GMO) with ESMTP id t0MILhfQ012200 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Thu, 22 Jan 2015 13:21:44 -0500 Received: from US70TWXCHMBA12.zam.alcatel-lucent.com ([169.254.6.168]) by US70TWXCHHUB04.zam.alcatel-lucent.com ([135.5.2.36]) with mapi id 14.03.0195.001; Thu, 22 Jan 2015 13:21:43 -0500 From: "EDMISON, Kelvin (Kelvin)" To: "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Thread-Index: AQHQM4rGk4ri2m//y0+DvAHJ/c2EFJzHvQ6AgADqhQCAAMvXAIAAEcSAgADApYCAAJS5gIAABzsAgAAFRICAAAFgAIAAayOAgAAR9QCAARPvAA== Date: Thu, 22 Jan 2015 18:21:42 +0000 Message-ID: References: <20150119130221.GB21790@hmsreliant.think-freely.org> <20150120151118.GD18449@hmsreliant.think-freely.org> <20150120161453.GA5316@bricha3-MOBL3> <54BF9D59.7070104@bisdn.de> <20150121130234.GB10756@bricha3-MOBL3> <54BFA7D5.7020106@bisdn.de> <20150121132620.GC10756@bricha3-MOBL3> <20150121114947.0753ae87@urahara> <20150121205404.GB32617@hmsreliant.think-freely.org> In-Reply-To: <20150121205404.GB32617@hmsreliant.think-freely.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.4.7.141117 x-originating-ip: [135.5.27.17] Content-Type: text/plain; charset="us-ascii" Content-ID: <107475C14CABDA45AE49158974039BEB@exchange.lucent.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Jan 2015 18:21:46 -0000 On 2015-01-21, 3:54 PM, "Neil Horman" wrote: >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote: >> On Wed, 21 Jan 2015 13:26:20 +0000 >> Bruce Richardson wrote: >>=20 >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote: >> > >=20 >> > > On 21/01/15 14:02, Bruce Richardson wrote: >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote: >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote: >> > > >>>>-----Original Message----- >> > > >>>>From: Richardson, Bruce >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM >> > > >>>>To: Neil Horman >> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >> > > >>>> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote: >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote: >> > > >>>>>>>-----Original Message----- >> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com] >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM >> > > >>>>>>>To: Wang, Zhihong >> > > >>>>>>>Cc: dev@dpdk.org >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization >> > > >>>>>>> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, >>zhihong.wang@intel.com >> > > >>>>wrote: >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and >>AVX >> > > >>>>platforms. >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases >>and >> > > >>>>>>>>more test >> > > >>>>>>>points. >> > > >>>>>>>>Optimization techniques are summarized below: >> > > >>>>>>>> >> > > >>>>>>>>1. Utilize full cache bandwidth >> > > >>>>>>>> >> > > >>>>>>>>2. Enforce aligned stores >> > > >>>>>>>> >> > > >>>>>>>>3. Apply load address alignment based on architecture >>features >> > > >>>>>>>> >> > > >>>>>>>>4. Make load/store address available as early as possible >> > > >>>>>>>> >> > > >>>>>>>>5. General optimization techniques like inlining, branch >> > > >>>>>>>>reducing, prefetch pattern access >> > > >>>>>>>> >> > > >>>>>>>>Zhihong Wang (4): >> > > >>>>>>>> Disabled VTA for memcpy test in app/test/Makefile >> > > >>>>>>>> Removed unnecessary test cases in test_memcpy.c >> > > >>>>>>>> Extended test coverage in test_memcpy_perf.c >> > > >>>>>>>> Optimized memcpy in arch/x86/rte_memcpy.h for both SSE >>and AVX >> > > >>>>>>>> platforms >> > > >>>>>>>> >> > > >>>>>>>> app/test/Makefile | 6 + >> > > >>>>>>>> app/test/test_memcpy.c | 52 >>+- >> > > >>>>>>>> app/test/test_memcpy_perf.c | 238 >>+++++--- >> > > >>>>>>>> .../common/include/arch/x86/rte_memcpy.h | 664 >> > > >>>>>>>+++++++++++++++------ >> > > >>>>>>>> 4 files changed, 656 insertions(+), 304 deletions(-) >> > > >>>>>>>> >> > > >>>>>>>>-- >> > > >>>>>>>>1.9.3 >> > > >>>>>>>> >> > > >>>>>>>> >> > > >>>>>>>Are you able to compile this with gcc 4.9.2? The >>compilation of >> > > >>>>>>>test_memcpy_perf is taking forever for me. It appears hung. >> > > >>>>>>>Neil >> > > >>>>>>Neil, >> > > >>>>>> >> > > >>>>>>Thanks for reporting this! >> > > >>>>>>It should compile but will take quite some time if the CPU >>doesn't support >> > > >>>>AVX2, the reason is that: >> > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated >>than >> > > >>>>AVX2 >> > > >>>>>>version thus the compiler takes more time to compile and >>optimize 2. >> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy >>calls for >> > > >>>>>>better test case coverage, that's quite a lot >> > > >>>>>> >> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC >>4.9.2: >> > > >>>>>>1. The whole compile process takes 9'41" with the original >> > > >>>>>>test_memcpy_perf.c (63 + 63 =3D 126 constant memcpy calls) 2. >>It takes >> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to >>12 + 12 >> > > >>>>>>=3D 24 >> > > >>>>>> >> > > >>>>>>I'll reduce memcpy call in the next version of patch. >> > > >>>>>> >> > > >>>>>ok, thank you. I'm all for optimzation, but I think a compile >>that >> > > >>>>>takes almost >> > > >>>>>10 minutes for a single file is going to generate some raised >>eyebrows >> > > >>>>>when end users start tinkering with it >> > > >>>>> >> > > >>>>>Neil >> > > >>>>> >> > > >>>>>>Zhihong (John) >> > > >>>>>> >> > > >>>>Even two minutes is a very long time to compile, IMHO. The >>whole of DPDK >> > > >>>>doesn't take that long to compile right now, and that's with a >>couple of huge >> > > >>>>header files with routing tables in it. Any chance you could >>cut compile time >> > > >>>>down to a few seconds while still having reasonable tests? >> > > >>>>Also, when there is AVX2 present on the system, what is the >>compile time >> > > >>>>like for that code? >> > > >>>> >> > > >>>> /Bruce >> > > >>>Neil, Bruce, >> > > >>> >> > > >>>Some data first. >> > > >>> >> > > >>>Sandy Bridge without AVX2: >> > > >>>1. original w/ 10 constant memcpy: 2'25" >> > > >>>2. patch w/ 12 constant memcpy: 2'41" >> > > >>>3. patch w/ 63 constant memcpy: 9'41" >> > > >>> >> > > >>>Haswell with AVX2: >> > > >>>1. original w/ 10 constant memcpy: 1'57" >> > > >>>2. patch w/ 12 constant memcpy: 1'56" >> > > >>>3. patch w/ 63 constant memcpy: 3'16" >> > > >>> >> > > >>>Also, to address Bruce's question, we have to reduce test case >>to cut down compile time. Because we use: >> > > >>>1. intrinsics instead of assembly for better flexibility and can >>utilize more compiler optimization >> > > >>>2. complex function body for better performance >> > > >>>3. inlining >> > > >>>This increases compile time. >> > > >>>But I think it'd be okay to do that as long as we can select a >>fair set of test points. >> > > >>> >> > > >>>It'd be great if you could give some suggestion, say, 12 points. >> > > >>> >> > > >>>Zhihong (John) >> > > >>> >> > > >>> >> > > >>While I agree in the general case these long compilation times is >>painful >> > > >>for the users, having a factor of 2-8x in memcpy operations is >>quite an >> > > >>improvement, specially in DPDK applications which need to deal >> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and >>reassembly. >> > > >> >> > > >>Why not having a fast compilation by default, and a tunable >>config flag to >> > > >>enable a highly optimized version of rte_memcpy (e.g. >>RTE_EAL_OPT_MEMCPY)? >> > > >> >> > > >>Marc >> > > >> >> > > >Out of interest, are these 2-8x improvements something you have >>benchmarked >> > > >in these app scenarios? [i.e. not just in micro-benchmarks]. >> > >=20 >> > > How much that micro-speedup will end up affecting the performance >>of the >> > > entire application is something I cannot say, so I agree that we >>should >> > > probably have some additional benchmarks before deciding that pays >>off >> > > maintaining 2 versions of rte_memcpy. >> > >=20 >> > > There are however a bunch of possible DPDK applications that could >> > > potentially benefit; IP fragmentation, tunneling and specialized DPI >> > > applications, among others, since they involve a reasonable amount >>of >> > > memcpys per pkt. My point was, *if* it proves that is enough >>beneficial, why >> > > not having it optionally? >> > >=20 >> > > Marc >> >=20 >> > I agree, if it provides the speedups then we need to have it in - and >>quite possibly >> > on by default, even. >> >=20 >> > /Bruce >>=20 >> One issue I have is that as a vendor we need to ship on binary, not >>different distributions >> for each Intel chip variant. There is some support for multi-chip >>version functions >> but only in latest Gcc which isn't in Debian stable. And the multi-chip >>version >> of functions is going to be more expensive than inlining. For some >>cases, I have >> seen that the overhead of fancy instructions looks good but have nasty >>side effects >> like CPU stall and/or increased power consumption which turns of turbo >>boost. >>=20 >>=20 >> Distro's in general have the same problem with special case >>optimizations. >>=20 >What we really need is to do something like borrow the alternatives >mechanism >from the kernel so that we can dynamically replace instructions at run >time >based on cpu flags. That way we could make the choice at run time, and >wouldn't >have to do alot of special case jumping about. >Neil +1. =20 I think it should be an anti-requirement that the build machine be the exact same chip as the deployment platform. I like the cpu flag inspection approach. It would help in the case where DPDK is in a VM and an odd set of CPU flags have been exposed. If that approach doesn't work though, then perhaps DPDK memcpy could go through a benchmarking at app startup time and select the most performant option out of a set, like mdraid's raid6 implementation does. To give an example, this is what my systems print out at boot time re: raid6 algorithm selection. raid6: sse2x1 3171 MB/s raid6: sse2x2 3925 MB/s raid6: sse2x4 4523 MB/s raid6: using algorithm sse2x4 (4523 MB/s) Regards, Kelvin