From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 11664ADBE for ; Thu, 31 Jul 2014 20:59:35 +0200 (CEST) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1XCvbL-0008VX-4E; Thu, 31 Jul 2014 15:01:25 -0400 Date: Thu, 31 Jul 2014 15:01:17 -0400 From: Neil Horman To: Bruce Richardson Message-ID: <20140731190117.GD20718@hmsreliant.think-freely.org> References: <1406665466-29654-1-git-send-email-nhorman@tuxdriver.com> <20140730210920.GB6420@localhost.localdomain> <20140731131351.GA20718@hmsreliant.think-freely.org> <5766264.li3nkTmgY6@xps13> <20140731143228.GB20718@hmsreliant.think-freely.org> <20140731181032.GC20718@hmsreliant.think-freely.org> <20140731183631.GC6420@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140731183631.GC6420@localhost.localdomain> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: dev@dpdk.org Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 31 Jul 2014 18:59:35 -0000 On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > > > > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > > > > > instructions in the specific code we want to build with support for that, and > > > > > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman > > > > > > > > > CC: Thomas Monjalon > > > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > > > > > accelerated code paths on systems that can make use of it at run time. If We > > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > > > > > systems (i.e. crash) > > > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > > > > > speedup execution on systems that are capable of using the more advanced > > > > > > > instructions. > > > > > > > > > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > > > > > accelerated paths when the instructions are available, and ignore them when > > > > > > > they're not, at run time. > > > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > > > > > employs, but implementing that would require some pretty significant work and > > > > > > > testing. This seems like a much simpler approach. > > > > > > > > [...] > > > > > > > > > Now, a macro that selected an instruction optimized or generic path is fine, as > > > > > long as it can happen at run time. The Linux kernel has such a feature, called > > > > > alternatives. But its a complex subsystem that does run time replacement of > > > > > instructions based on cpu feature flags. It would be great to have in the DPDK, > > > > > but its a significant code base and difficult to maintain, which goes against > > > > > your desire to reduce code. > > > > > > > > [...] > > > > > > > > > > Even though the code is written using intrinsics which correspond to SSE > > > > > > operations, the compiler is free to use AVX instructions where necessary > > > > > Not if you use the default machine target. > > > > > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > > > look to compile up the code for all microarchitectures, rather than just > > > > > > assuming that we will get equivalent performance to "native" by turning > > > > > > on the instruction set indicated by the primitives in the code. This is > > > > > No, you compile for the least common demonitor system, and enable more > > > > > performant paths opportunistically as run time checks allow. > > > > > > > > > > > where having one codepath recompiled multiple times will work far better > > > > > > than having multiple code paths. > > > > > Only if you're only concern is performance. As noted above, my goal is more > > > > > than just performance, its compatibility accross systems. Multiple builds for > > > > > multiple cpu flag availability is simply a non-starter for a generic > > > > > distribution. > > > > > > > > Neil, we are mixing 2 different problems here. > > > > 1) we have to fix default build (without SSE-4.2) > > > Thats nothing to fix, thats a configuration issue. Just build for a lesser > > > machine. I've already done that in the fedora build, using the defalut machine > > > target. What exactly is missing from that? > > > > > Re-reading this, I'm wondering if I missed what you were trying to say, if so I > > apologize. Were you trying to assert that the right thing to do here is to > > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that > > they are buildable on the default platform? If so, I agree, thats a nice idea, > > and am supportive of it, though I don't think that fully solves teh problem. In > > the case of the ixgbe pmd, what we have is 2 code paths, a generic code path, > > and an optimized code path using sse4.2 intrinsics. In this case, I don't think > > theres anything to fix, in that I'm fine with the optimized path needing sse4.2 > > to execute. There I just want to be able to do a run time check and use the > > optimized path if the cpu supports it, and just use the default path otherwise. > > In effect we already have exactly what you are looking for there. > > > > As far as the ACL library goes, yes, thats more complex. The use of sse4.2 > > intrinsics there is done througout the code, so theres no easy way to select a > > path. we're just left with either using the code or returning an error at run > > time, as my patch does. Certainly we can build some macros that either use the > > intrinsics for sse4.2 or code up some C-level variants of those instructions > > based on generic code, and build for the least common demoniator, or compile the > > code twice (once without sse4.2 support, and once with), and do a runtime > > selection between the two. Either way, thats going to be a useful, though > > significant effort. > > I think a good first step here that I can't see anyone objecting to is > to enable the ixgbe driver to use the vector code path for a generic > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > allows a compile for machine type default, and testpmd can still forward > packets at a good rate (roughly perf down about 10% vs native compile on > SNB). > The ACL is a tougher nut to crack, but anyone see any issues with that > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > set thread, do you want to submit an official patch here, or would you prefer I > do so?] > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 instructions in that path seems significant, isn't it? Given that performance delta, it seems like it would still be preferable to have a path that used the sse4.2 instructions when they're available. Or am I misreading what you mean when you say down 10% Neil > > > > > > 2) we could try to have performance with default build > > > > > > > Yes, we can, thats what this patch does. It doesn't address every code path, > > > no, but it addresses two paths that are low hanging fruit for doing so, and we > > > can incrementally build on that > > > > > > > Please, let's focus on the first item and we could discuss about performance > > > > later. Having some different code path choosed at runtime is a big rework and > > > > imply changing the compilation model (RFC welcome). > > > > > > Even if I misinterpreted your statement above, I'm still not sure why your > > asserting this. Fixing the build to work with the default target machine is > > good, and should be undertaken, and I'll happily do so, but why reject the > > solution in front of you to wait for it? Even if I write macros to fix up the > > ACL library, I'd still like to be able to do a run time check and select the > > optimized version or the generic version based on cpu support. Just doing a > > compile time check to determine if sse4.2 is available really isn't going to cut > > it for me, as I don't want the fedora dpdk to have pessimal performance if it > > doesn't have to. > > > > Regards > > Neil > > > > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. > > /Bruce >