From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 24E3B959 for ; Wed, 6 Aug 2014 18:57:34 +0200 (CEST) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga103.fm.intel.com with ESMTP; 06 Aug 2014 09:52:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.97,862,1389772800"; d="scan'208";a="368789970" Received: from irsmsx101.ger.corp.intel.com ([163.33.3.153]) by FMSMGA003.fm.intel.com with ESMTP; 06 Aug 2014 09:57:06 -0700 Received: from irsmsx103.ger.corp.intel.com ([169.254.3.131]) by IRSMSX101.ger.corp.intel.com ([169.254.1.92]) with mapi id 14.03.0195.001; Wed, 6 Aug 2014 18:00:00 +0100 From: "Richardson, Bruce" To: Neil Horman , "Ananyev, Konstantin" Thread-Topic: [PATCH] acl: If build does not support sse4.2, emulate missing instructions with C code Thread-Index: AQHPr/nfCJnYOGX98UykIKMkJWxWu5vCEoeAgAAwp4CAASI8AIAACxKAgABc2OA= Date: Wed, 6 Aug 2014 16:59:59 +0000 Message-ID: <59AF69C657FD0841A61C55336867B5B0343D666D@IRSMSX103.ger.corp.intel.com> References: <1407166558-9532-1-git-send-email-nhorman@tuxdriver.com> <2601191342CEEE43887BDE71AB9772582134F98D@IRSMSX105.ger.corp.intel.com> <20140805182035.GB20550@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB9772582134FC92@IRSMSX105.ger.corp.intel.com> <20140806121859.GB26562@localhost.localdomain> In-Reply-To: <20140806121859.GB26562@localhost.localdomain> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH] acl: If build does not support sse4.2, emulate missing instructions with C code X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Aug 2014 16:57:35 -0000 > -----Original Message----- > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Wednesday, August 06, 2014 5:19 AM > To: Ananyev, Konstantin > Cc: dev@dpdk.org; Thomas Monjalon; Richardson, Bruce > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate missi= ng > instructions with C code >=20 > On Wed, Aug 06, 2014 at 11:39:22AM +0000, Ananyev, Konstantin wrote: > > > > > > > -----Original Message----- > > > From: Neil Horman [mailto:nhorman@tuxdriver.com] > > > Sent: Tuesday, August 05, 2014 7:21 PM > > > To: Ananyev, Konstantin > > > Cc: dev@dpdk.org; Thomas Monjalon; Richardson, Bruce > > > Subject: Re: [PATCH] acl: If build does not support sse4.2, emulate m= issing > instructions with C code > > > > > > On Tue, Aug 05, 2014 at 03:26:27PM +0000, Ananyev, Konstantin wrote: > > > > Hi Neil, > > > > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com] > > > > > Sent: Monday, August 04, 2014 4:36 PM > > > > > To: dev@dpdk.org > > > > > Cc: Neil Horman; Thomas Monjalon; Ananyev, Konstantin; Richardson= , > Bruce > > > > > Subject: [PATCH] acl: If build does not support sse4.2, emulate m= issing > instructions with C code > > > > > > > > > > The ACL library makes extensive use of some SSE4.2 instructions, = which > means the > > > > > default build can't compile this library. Work around the proble= m by > testing > > > > > the __SSE42__ definition in the acl_vects.h file and defining the= macros > there > > > > > as intrinsics or c-level equivalants. Note this is a minimal pat= ch, adjusting > > > > > only the definitions that are currently used in the ACL library. > > > > > > > > > > > > > My comments about actual implementations of c-level equivalents bel= ow. > > > > None of them look correct to me. > > > > Of course it could be fixed. > > > > Though I am not sure that is a right way to proceed: > > > > At first I really doubt that these equivalents will provide similar > performance. > > > > As you probably note - we do have a scalar version of rte_acl_clas= sify(): > rte_acl_classify_scalar(). > > > > So I think it might be faster than vector one with 'emulated' instr= incts. > > > > Unfortunately it is all mixed right now into one file and 'scalar' = version > could use few sse4 instrincts through resolve_priority(). > > > > Another thing - we consider to add another version of rte_acl_class= ify() > that will use avx2 instrincts. > > > > If we go the way you suggest - I am afraid will soon have to provid= e scalar > equivalents for several AVX2 instrincts too. > > > > So in summary that way (providing our own scalar equivalents of SIM= D > instrincts) seems to me slow, hard to maintain and error > > > prone. > > > > > > > > What porbably can be done instead: rework acl_run.c a bit, so it pr= ovide: > > > > rte_acl_classify_scalar() - could be build and used on all systems. > > > > rte_acl_classify_sse() - could be build and used only on systems wi= th sse4.2 > and upper, return ENOTSUP on lower arch. > > > > In future: rte_acl_classify_avx2 - could be build and used only on = systems > with avx2 and upper, return ENOTSUP on lower arch. > > > > > > > > I am looking at rte_acl right now anyway. > > > > So will try to come up with something workable. > > > > > > > So, this is exactly the opposite of what Bruce and I just spent sever= al days > and > > > a huge email thread that you clearly are aware of discussing run time= versus > > > compile time selection of paths. At this point I'm done ping ponging > between > > > your opposing viewpoints. If you want to implement something that do= es > run time > > > checking, I'm fine with it, but I'm not going back and forth until yo= u two > come > > > to an agreement on this. > > > > Right now, I am not talking about 'run time vs compile time selection'. > But you are talking about exactly that, allbeit implicitly. To implement= what > you recommend above (that being multiple functional paths that return a n= ot > supported error code at run time), we need to make run time tests for wha= t the > cpu supports. While I'm actually ok with doing that (I think it makes al= ot of > sense), Bruce and I just spent several days and dozens of emails debating= that, > so you can understand why I don't want to write yet another version of th= is > patch that requires doing the exact thing we just argued about, especiall= y if it > means he's going to pipe back up and say no, driving me back to a common > single > implementation that compiles and runs for all platforms. I'm not going t= o keep > re-writing this boucing back and forth between your opposing viewpoints. = We > need to agree on a direction before I make another pass at this. >=20 > > Whatever way we choose, I think the implementation need to be: > > 1) correct > Obviously. >=20 > > 2) allow easily add(/modify) code path optimised for particular archite= cture. > > Without need to modify/re-test what you call 'least common denominator= ' > code path. > > And visa-versa, if someone find a way to optimise common code path - no > need to > > touch/retest architecture specific ones. > So I'm fine with this, but it is anathema to what Bruce advocated for whe= n I did > this latest iteration. Bruce advocated for a single common path that com= piled > in all cases. Bruce, do you want to comment here? I'd really like to ge= t this > settled before I go try this again. In our previous discussion I was primarily concerned with the ixgbe driver,= which already had a number of scalar code paths as well as the vector one,= so I was very keen there not to see more code paths created. However, whil= e I hate seeing more code paths created that need to be maintained, I am ok= with having them created if the benefit is big enough. Up till now code pa= th selection would have been done at compile-time, but you've convinced me = that if we have the code paths there, selecting them at runtime makes more = sense for a packaged build.=20 For ACL specifically, I generally defer to Konstantin as the expert in this= area. If we need separate code paths for scalar, SSE and AVX, and each giv= es considerable performance improvement over the other, then I'm ok with th= at. /Bruce