From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 5643068A7 for ; Fri, 8 Aug 2014 15:07:03 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP; 08 Aug 2014 06:03:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,825,1400050800"; d="scan'208";a="555694096" Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157]) by orsmga001.jf.intel.com with ESMTP; 08 Aug 2014 06:09:36 -0700 Received: from irsmsx108.ger.corp.intel.com (163.33.3.3) by IRSMSX103.ger.corp.intel.com (163.33.3.157) with Microsoft SMTP Server (TLS) id 14.3.195.1; Fri, 8 Aug 2014 14:09:35 +0100 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.240]) by IRSMSX108.ger.corp.intel.com ([169.254.11.3]) with mapi id 14.03.0195.001; Fri, 8 Aug 2014 14:09:35 +0100 From: "Ananyev, Konstantin" To: Neil Horman Thread-Topic: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Thread-Index: AQHPsnv0JgRQRcF1o0yY+Gidu15a55vGip3wgAAHNoCAABLvkA== Date: Fri, 8 Aug 2014 13:09:34 +0000 Message-ID: <2601191342CEEE43887BDE71AB977258213522BE@IRSMSX105.ger.corp.intel.com> References: <1407436263-9360-1-git-send-email-konstantin.ananyev@intel.com> <20140807201134.GA1632@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB97725821352285@IRSMSX105.ger.corp.intel.com> <20140808122503.GA11413@hmsreliant.think-freely.org> In-Reply-To: <20140808122503.GA11413@hmsreliant.think-freely.org> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Aug 2014 13:07:04 -0000 > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Friday, August 08, 2014 1:25 PM > To: Ananyev, Konstantin > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'defa= ult' target >=20 > On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote: > > > From: Neil Horman [mailto:nhorman@tuxdriver.com] > > > Sent: Thursday, August 07, 2014 9:12 PM > > > To: Ananyev, Konstantin > > > Cc: dev@dpdk.org > > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for '= default' target > > > > > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote: > > > > Make ACL library to build/work on 'default' architecture: > > > > - make rte_acl_classify_scalar really scalar > > > > (make sure it wouldn't use sse4 instrincts through resolve_priorit= y()). > > > > - Provide two versions of rte_acl_classify code path: > > > > rte_acl_classify_sse() - could be build and used only on systems = with sse4.2 > > > > and upper, return -ENOTSUP on lower arch. > > > > rte_acl_classify_scalar() - a slower version, but could be build = and used > > > > on all systems. > > > > - keep common code shared between these two codepaths. > > > > > > > > v2 chages: > > > > run-time selection of most appropriate code-path for given ISA. > > > > By default the highest supprted one is selected. > > > > User can still override that selection by manually assigning new v= alue to > > > > the global function pointer rte_acl_default_classify. > > > > rte_acl_classify() becomes a macro calling whatever rte_acl_defaul= t_classify > > > > points to. > > > > > > > > > > > > Signed-off-by: Konstantin Ananyev > > > > > > This is alot better thank you. A few remaining issues. > > > > My comments inline too. > > Thanks > > Konstantin > > > > > > > > > --- > > > > app/test-acl/main.c | 13 +- > > > > lib/librte_acl/Makefile | 5 +- > > > > lib/librte_acl/acl_bld.c | 5 +- > > > > lib/librte_acl/acl_match_check.def | 92 ++++ > > > > lib/librte_acl/acl_run.c | 944 -------------------------= ------------ > > > > lib/librte_acl/acl_run.h | 220 +++++++++ > > > > lib/librte_acl/acl_run_scalar.c | 197 ++++++++ > > > > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++ > > > > lib/librte_acl/rte_acl.c | 15 + > > > > lib/librte_acl/rte_acl.h | 24 +- > > > > 10 files changed, 1189 insertions(+), 956 deletions(-) > > > > create mode 100644 lib/librte_acl/acl_match_check.def > > > > delete mode 100644 lib/librte_acl/acl_run.c > > > > create mode 100644 lib/librte_acl/acl_run.h > > > > create mode 100644 lib/librte_acl/acl_run_scalar.c > > > > create mode 100644 lib/librte_acl/acl_run_sse.c > > > > > > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c > > > > index d654409..45c6fa6 100644 > > > > --- a/app/test-acl/main.c > > > > +++ b/app/test-acl/main.c > > > > @@ -787,6 +787,10 @@ acx_init(void) > > > > /* perform build. */ > > > > ret =3D rte_acl_build(config.acx, &cfg); > > > > > > > > + /* setup default rte_acl_classify */ > > > > + if (config.scalar) > > > > + rte_acl_default_classify =3D rte_acl_classify_scalar; > > > > + > > > Exporting this variable as part of the ABI is a bad idea. If the pro= totype of > > > the function changes you have to update all your applications. > > > > If the prototype of rte_acl_classify will change, most likely you'll ha= ve to update code that uses it anyway. > > > Why? If you hide this from the application, changes to the internal > implementation will also be invisible. When building as a DSO, an applic= ation > will be able to transition between libraries without the need for a rebui= ld. Because rte_acl_classify() is part of the ACL API that users use. If we'll add/modify its parameters and/or return value - users would have t= o change their apps anyway. =20 =20 > > > Make the pointer > > > an internal symbol and set it using a get/set routine with an enum to= represent > > > the path to choose. That will help isolate the ABI from the internal > > > implementation. > > > > That's was my first intention too. > > But then I realised that if we'll make it internal, then we'll need to = make rte_acl_classify() a proper function > > and it will cost us extra call (or jump). > Thats true, but I don't see that as a problem. We're not talking about a= hot > code path here, its a setup function. I am talking not about rte_acl_select_classify() but about rte_acl_classify= () itself (not code path). If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would = need to become a real function and it'll be something like that: ->call rte_acl_acl_classify ---> load rte_acl_calssify_default value into the reg=20 ---> jmp (*reg) > Or do you think that an application will > be switching between classification functions on every classify operation= ? God no. > > Also I think user should have an ability to change default classify cod= e path without modifying/rebuilding acl library. > I agree, but both the methods we are advocating for allow that. Its real= ly just > a question of exposing the mechanism as data or text in the binary. Expo= sing it > as data comes with implicit ABI constraints that are less prevalanet when= done > as code entry points. =20 > > For example: a bug in an optimised code path is discovered, or user may= want to implement and use his own version of classify(). > In the case of a bug in the optimized path, you just fix the bug.=20 It is not about me. It is about a user who get librte_acl as part of binary= distribution. Of course, he probably will report about it and we probably fix it sooner o= r later. But with such ability he can switch to the safe implementation immediately without touching the library and then wait for the fix. > If you want > to provide your own classification function, thats fine I suppose, but th= at > seems completely outside the scope of what we're trying to do here. Its = not > adventageous to just throw that in there. If you want to be able to prov= ide > your own classifier function, lets at least take some time to make sure t= hat the > function prototype is sufficiently capable to accept all the data you mig= ht want > to pass it in the future, before we go exposing it. Otherwise you'll hav= e to > break the ABI in future versions, whcih is something we've been discussin= g > trying to avoid. rte_acl_classify() it is already exposed (PART of API), same as rte_acl_cla= ssify_scalar(). If in future, we'll change these functions prototypes will break ABI anyway= . >=20 > > > It will also let you prevent things like selecting a run time > > > path that is incompatible with the running system > > > > If the user going to update rte_acl_default_classify he is probably sma= rt enough to know what he is doing. > That really seems like poor design to me. I don't see why you wouldn't a= t least > want to warn the developer of an application if they were at run time to = assign > a default classifier method that was incompatible with a running system. = Yes, > they're likely smart enough to know what their doing, but smart people ma= ke > mistakes, and appreciate being told when they're doing so, especially if = the > method of telling is something a bit more civil than a machine check that > might occur well after the application has been initilized. I have no problem providing rte_acl_check_classify(flags_required, classify= _ptr) that would do checking and emit the warning. Though as I said above, I'll prefer not to hide rte_acl_default_classify it= will cause extra overhead for rte_acl_classify(). >=20 > > From other hand - user can hit same problem by simply calling rte_acl_c= lassify_sse() directly. > Not if the function is statically declared and not exposed to the applica= tion > they cant :) I don't really want to hide rte_acl_classify_sse/rte_acl_classify_scalar()= . Should be available directly I think. =20 In future we might introduce new versions for more sophisticated ISAs (rte_= acl_classify_avx() or something). Users should have an ability to downgrade their classify() function if they= like. =20 > > > > > and prevent path switching > > > during searches, which may produce unexpected results. > > > > Not that I am advertising it, but it should be safe to update rte_acl_= default_classify during searches: > > All versions of classify should produce exactly the same result for eac= h input packet and treat acl context as read-only. > > > Fair enough. >=20 > > > > > > > > > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c > > > > deleted file mode 100644 > > > > index e3d9fc1..0000000 > > > > --- a/lib/librte_acl/acl_run.c > > > > +++ /dev/null > > > > @@ -1,944 +0,0 @@ > > > > -/*- > > > > - * BSD LICENSE > > > > - * > > > > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved= . > > > > - * All rights reserved. > > > > - * > > > > - * Redistribution and use in source and binary forms, with or wi= thout > > > > - * modification, are permitted provided that the following condi= tions > > > > > > > > + > > > > +#define __func_resolve_priority__ resolve_priority_scalar > > > > +#define __func_match_check__ acl_match_check_scalar > > > > +#include "acl_match_check.def" > > > > + > > > I get this lets you make some more code common, but its just unpleasa= nt to trace > > > through. Looking at the defintion of __func_match_check__ I don't se= e anything > > > particularly performance sensitive there. What if instead you simply= redefined > > > __func_match_check__ in a common internal header as acl_match_check (= a generic > > > function), and had it accept priority resolution function as an argum= ent? That > > > would still give you all the performance enhancements without having = to include > > > c files in the middle of other c files, and would make the code a bit= more > > > parseable. > > > > Yes, that way it would look much better. > > And it seems that with '-findirect-inlining' gcc is able to inline them= via pointers properly. > > Will change as you suggested. > > > Thank you > Neil >=20 > > > > > > > +/* > > > > + * When processing the transition, rather than using if/else > > > > + * construct, the offset is calculated for DFA and QRANGE and > > > > + * then conditionally added to the address based on node type. > > > > + * This is done to avoid branch mis-predictions. Since the > > > > + * offset is rather simple calculation it is more efficient > > > > + * to do the calculation and do a condition move rather than > > > > + * a conditional branch to determine which calculation to do. > > > > + */ > > > > +static inline uint32_t > > > > +scan_forward(uint32_t input, uint32_t max) > > > > +{ > > > > + return (input =3D=3D 0) ? max : rte_bsf32(input); > > > > +} > > > > + } > > > > +} > > > > > > > > + > > > > +#define __func_resolve_priority__ resolve_priority_sse > > > > +#define __func_match_check__ acl_match_check_sse > > > > +#include "acl_match_check.def" > > > > + > > > Same deal as above. > > > > > > > +/* > > > > + * Extract transitions from an XMM register and check for any matc= hes > > > > + */ > > > > +static void > > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_ac= l_ctx *ctx, > > > > + struct parms *parms, struct acl_flow_data *flows) > > > > +{ > > > > + uint64_t transition1, transition2; > > > > + > > > > + /* extract transition from low 64 bits. */ > > > > + transition1 =3D MM_CVT64(*indicies); > > > > + > > > > + /* extract transition from high 64 bits. */ > > > > + *indicies =3D MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64); > > > > + transition2 =3D MM_CVT64(*indicies); > > > > + > > > > + transition1 =3D acl_match_check_sse(transition1, slot, ctx, > > > > + parms, flows); > > > > + transition2 =3D acl_match_check_sse(transition2, slot + 1, ctx, > > > > + parms, flows); > > > > + > > > > + /* update indicies with new transitions. */ > > > > + *indicies =3D MM_SET64(transition2, transition1); > > > > +} > > > > + > > > > +/* > > > > + * Check for a match in 2 transitions (contained in SSE register) > > > > + */ > > > > +static inline void > > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct= parms *parms, > > > > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask) > > > > +{ > > > > + xmm_t temp; > > > > + > > > > + temp =3D MM_AND(match_mask, *indicies); > > > > + while (!MM_TESTZ(temp, temp)) { > > > > + acl_process_matches(indicies, slot, ctx, parms, flows); > > > > + temp =3D MM_AND(match_mask, *indicies); > > > > + } > > > > +} > > > > + > > > > +/* > > > > + * Check for any match in 4 transitions (contained in 2 SSE regist= ers) > > > > + */ > > > > +static inline void > > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct= parms *parms, > > > > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2, > > > > + xmm_t match_mask) > > > > +{ > > > > + xmm_t temp; > > > > + > > > > + /* put low 32 bits of each transition into one register */ > > > > + temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicie= s2, > > > > + 0x88); > > > > + /* test for match node */ > > > > + temp =3D MM_AND(match_mask, temp); > > > > + > > > > + while (!MM_TESTZ(temp, temp)) { > > > > + acl_process_matches(indicies1, slot, ctx, parms, flows); > > > > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows); > > > > + > > > > + temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > > > > + (__m128)*indicies2, > > > > + 0x88); > > > > + temp =3D MM_AND(match_mask, temp); > > > > + } > > > > +} > > > > + > > > > +/* > > > > + * Calculate the address of the next transition for > > > > + * all types of nodes. Note that only DFA nodes and range > > > > + * nodes actually transition to another node. Match > > > > + * nodes don't move. > > > > + */ > > > > +static inline xmm_t > > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_in= put, > > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > > + xmm_t *indicies1, xmm_t *indicies2) > > > > +{ > > > > + xmm_t addr, node_types, temp; > > > > + > > > > + /* > > > > + * Note that no transition is done for a match > > > > + * node and therefore a stream freezes when > > > > + * it reaches a match. > > > > + */ > > > > + > > > > + /* Shuffle low 32 into temp and high 32 into indicies2 */ > > > > + temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicie= s2, > > > > + 0x88); > > > > + *indicies2 =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > > > > + (__m128)*indicies2, 0xdd); > > > > + > > > > + /* Calc node type and node addr */ > > > > + node_types =3D MM_ANDNOT(index_mask, temp); > > > > + addr =3D MM_AND(index_mask, temp); > > > > + > > > > + /* > > > > + * Calc addr for DFAs - addr =3D dfa_index + input_byte > > > > + */ > > > > + > > > > + /* mask for DFA type (0) nodes */ > > > > + temp =3D MM_CMPEQ32(node_types, MM_XOR(node_types, node_types)); > > > > + > > > > + /* add input byte to DFA position */ > > > > + temp =3D MM_AND(temp, bytes); > > > > + temp =3D MM_AND(temp, next_input); > > > > + addr =3D MM_ADD32(addr, temp); > > > > + > > > > + /* > > > > + * Calc addr for Range nodes -> range_index + range(input) > > > > + */ > > > > + node_types =3D MM_CMPEQ32(node_types, type_quad_range); > > > > + > > > > + /* > > > > + * Calculate number of range boundaries that are less than the > > > > + * input value. Range boundaries for each node are in signed 8 bi= t, > > > > + * ordered from -128 to 127 in the indicies2 register. > > > > + * This is effectively a popcnt of bytes that are greater than th= e > > > > + * input byte. > > > > + */ > > > > + > > > > + /* shuffle input byte to all 4 positions of 32 bit value */ > > > > + temp =3D MM_SHUFFLE8(next_input, shuffle_input); > > > > + > > > > + /* check ranges */ > > > > + temp =3D MM_CMPGT8(temp, *indicies2); > > > > + > > > > + /* convert -1 to 1 (bytes greater than input byte */ > > > > + temp =3D MM_SIGN8(temp, temp); > > > > + > > > > + /* horizontal add pairs of bytes into words */ > > > > + temp =3D MM_MADD8(temp, temp); > > > > + > > > > + /* horizontal add pairs of words into dwords */ > > > > + temp =3D MM_MADD16(temp, ones_16); > > > > + > > > > + /* mask to range type nodes */ > > > > + temp =3D MM_AND(temp, node_types); > > > > + > > > > + /* add index into node position */ > > > > + return MM_ADD32(addr, temp); > > > > +} > > > > + > > > > +/* > > > > + * Process 4 transitions (in 2 SIMD registers) in parallel > > > > + */ > > > > +static inline xmm_t > > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_inpu= t, > > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2) > > > > +{ > > > > + xmm_t addr; > > > > + uint64_t trans0, trans2; > > > > + > > > > + /* Calculate the address (array index) for all 4 transitions. */ > > > > + > > > > + addr =3D acl_calc_addr(index_mask, next_input, shuffle_input, one= s_16, > > > > + bytes, type_quad_range, indicies1, indicies2); > > > > + > > > > + /* Gather 64 bit transitions and pack back into 2 registers. */ > > > > + > > > > + trans0 =3D trans[MM_CVT32(addr)]; > > > > + > > > > + /* get slot 2 */ > > > > + > > > > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */ > > > > + addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT2); > > > > + trans2 =3D trans[MM_CVT32(addr)]; > > > > + > > > > + /* get slot 1 */ > > > > + > > > > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */ > > > > + addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > > > > + *indicies1 =3D MM_SET64(trans[MM_CVT32(addr)], trans0); > > > > + > > > > + /* get slot 3 */ > > > > + > > > > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */ > > > > + addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT3); > > > > + *indicies2 =3D MM_SET64(trans[MM_CVT32(addr)], trans2); > > > > + > > > > + return MM_SRL32(next_input, 8); > > > > +} > > > > + > > > > +/* > > > > + * Execute trie traversal with 8 traversals in parallel > > > > + */ > > > > +static inline int > > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > > + uint32_t *results, uint32_t total_packets, uint32_t categories) > > > > +{ > > > > + int n; > > > > + struct acl_flow_data flows; > > > > + uint64_t index_array[MAX_SEARCHES_SSE8]; > > > > + struct completion cmplt[MAX_SEARCHES_SSE8]; > > > > + struct parms parms[MAX_SEARCHES_SSE8]; > > > > + xmm_t input0, input1; > > > > + xmm_t indicies1, indicies2, indicies3, indicies4; > > > > + > > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > + total_packets, categories, ctx->trans_table); > > > > + > > > > + for (n =3D 0; n < MAX_SEARCHES_SSE8; n++) { > > > > + cmplt[n].count =3D 0; > > > > + index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx); > > > > + } > > > > + > > > > + /* > > > > + * indicies1 contains index_array[0,1] > > > > + * indicies2 contains index_array[2,3] > > > > + * indicies3 contains index_array[4,5] > > > > + * indicies4 contains index_array[6,7] > > > > + */ > > > > + > > > > + indicies1 =3D MM_LOADU((xmm_t *) &index_array[0]); > > > > + indicies2 =3D MM_LOADU((xmm_t *) &index_array[2]); > > > > + > > > > + indicies3 =3D MM_LOADU((xmm_t *) &index_array[4]); > > > > + indicies4 =3D MM_LOADU((xmm_t *) &index_array[6]); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > > + &indicies1, &indicies2, mm_match_mask.m); > > > > + acl_match_check_x4(4, ctx, parms, &flows, > > > > + &indicies3, &indicies4, mm_match_mask.m); > > > > + > > > > + while (flows.started > 0) { > > > > + > > > > + /* Gather 4 bytes of input data for each stream. */ > > > > + input0 =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), > > > > + 0); > > > > + input1 =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4), > > > > + 0); > > > > + > > > > + input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1); > > > > + input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1); > > > > + > > > > + input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2); > > > > + input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2); > > > > + > > > > + input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3); > > > > + input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3); > > > > + > > > > + /* Process the 4 bytes of input on each stream. */ > > > > + > > > > + input0 =3D transition4(mm_index_mask.m, input0, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input1 =3D transition4(mm_index_mask.m, input1, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies3, &indicies4); > > > > + > > > > + input0 =3D transition4(mm_index_mask.m, input0, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input1 =3D transition4(mm_index_mask.m, input1, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies3, &indicies4); > > > > + > > > > + input0 =3D transition4(mm_index_mask.m, input0, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input1 =3D transition4(mm_index_mask.m, input1, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies3, &indicies4); > > > > + > > > > + input0 =3D transition4(mm_index_mask.m, input0, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input1 =3D transition4(mm_index_mask.m, input1, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies3, &indicies4); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > > + &indicies1, &indicies2, mm_match_mask.m); > > > > + acl_match_check_x4(4, ctx, parms, &flows, > > > > + &indicies3, &indicies4, mm_match_mask.m); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/* > > > > + * Execute trie traversal with 4 traversals in parallel > > > > + */ > > > > +static inline int > > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > > + uint32_t *results, int total_packets, uint32_t categories) > > > > +{ > > > > + int n; > > > > + struct acl_flow_data flows; > > > > + uint64_t index_array[MAX_SEARCHES_SSE4]; > > > > + struct completion cmplt[MAX_SEARCHES_SSE4]; > > > > + struct parms parms[MAX_SEARCHES_SSE4]; > > > > + xmm_t input, indicies1, indicies2; > > > > + > > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > + total_packets, categories, ctx->trans_table); > > > > + > > > > + for (n =3D 0; n < MAX_SEARCHES_SSE4; n++) { > > > > + cmplt[n].count =3D 0; > > > > + index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx); > > > > + } > > > > + > > > > + indicies1 =3D MM_LOADU((xmm_t *) &index_array[0]); > > > > + indicies2 =3D MM_LOADU((xmm_t *) &index_array[2]); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > > + &indicies1, &indicies2, mm_match_mask.m); > > > > + > > > > + while (flows.started > 0) { > > > > + > > > > + /* Gather 4 bytes of input data for each stream. */ > > > > + input =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0= ); > > > > + input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > > > > + input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2); > > > > + input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3); > > > > + > > > > + /* Process the 4 bytes of input on each stream. */ > > > > + input =3D transition4(mm_index_mask.m, input, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input =3D transition4(mm_index_mask.m, input, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input =3D transition4(mm_index_mask.m, input, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + input =3D transition4(mm_index_mask.m, input, > > > > + mm_shuffle_input.m, mm_ones_16.m, > > > > + mm_bytes.m, mm_type_quad_range.m, > > > > + flows.trans, &indicies1, &indicies2); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > > + &indicies1, &indicies2, mm_match_mask.m); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static inline xmm_t > > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_inpu= t, > > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > > + const uint64_t *trans, xmm_t *indicies1) > > > > +{ > > > > + uint64_t t; > > > > + xmm_t addr, indicies2; > > > > + > > > > + indicies2 =3D MM_XOR(ones_16, ones_16); > > > > + > > > > + addr =3D acl_calc_addr(index_mask, next_input, shuffle_input, one= s_16, > > > > + bytes, type_quad_range, indicies1, &indicies2); > > > > + > > > > + /* Gather 64 bit transitions and pack 2 per register. */ > > > > + > > > > + t =3D trans[MM_CVT32(addr)]; > > > > + > > > > + /* get slot 1 */ > > > > + addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > > > > + *indicies1 =3D MM_SET64(trans[MM_CVT32(addr)], t); > > > > + > > > > + return MM_SRL32(next_input, 8); > > > > +} > > > > + > > > > +/* > > > > + * Execute trie traversal with 2 traversals in parallel. > > > > + */ > > > > +static inline int > > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > > + uint32_t *results, uint32_t total_packets, uint32_t categories) > > > > +{ > > > > + int n; > > > > + struct acl_flow_data flows; > > > > + uint64_t index_array[MAX_SEARCHES_SSE2]; > > > > + struct completion cmplt[MAX_SEARCHES_SSE2]; > > > > + struct parms parms[MAX_SEARCHES_SSE2]; > > > > + xmm_t input, indicies; > > > > + > > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > + total_packets, categories, ctx->trans_table); > > > > + > > > > + for (n =3D 0; n < MAX_SEARCHES_SSE2; n++) { > > > > + cmplt[n].count =3D 0; > > > > + index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx); > > > > + } > > > > + > > > > + indicies =3D MM_LOADU((xmm_t *) &index_array[0]); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mas= k64.m); > > > > + > > > > + while (flows.started > 0) { > > > > + > > > > + /* Gather 4 bytes of input data for each stream. */ > > > > + input =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0= ); > > > > + input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > > > > + > > > > + /* Process the 4 bytes of input on each stream. */ > > > > + > > > > + input =3D transition2(mm_index_mask64.m, input, > > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > > + flows.trans, &indicies); > > > > + > > > > + input =3D transition2(mm_index_mask64.m, input, > > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > > + flows.trans, &indicies); > > > > + > > > > + input =3D transition2(mm_index_mask64.m, input, > > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > > + flows.trans, &indicies); > > > > + > > > > + input =3D transition2(mm_index_mask64.m, input, > > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > > + flows.trans, &indicies); > > > > + > > > > + /* Check for any matches. */ > > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, > > > > + mm_match_mask64.m); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +int > > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t = **data, > > > > + uint32_t *results, uint32_t num, uint32_t categories) > > > > +{ > > > > + if (categories !=3D 1 && > > > > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) !=3D 0) > > > > + return -EINVAL; > > > > + > > > > + if (likely(num >=3D MAX_SEARCHES_SSE8)) > > > > + return search_sse_8(ctx, data, results, num, categories); > > > > + else if (num >=3D MAX_SEARCHES_SSE4) > > > > + return search_sse_4(ctx, data, results, num, categories); > > > > + else > > > > + return search_sse_2(ctx, data, results, num, categories); > > > > +} > > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c > > > > index 7c288bd..0cde07e 100644 > > > > --- a/lib/librte_acl/rte_acl.c > > > > +++ b/lib/librte_acl/rte_acl.c > > > > @@ -38,6 +38,21 @@ > > > > > > > > TAILQ_HEAD(rte_acl_list, rte_tailq_entry); > > > > > > > > +/* by default, use always avaialbe scalar code path. */ > > > > +rte_acl_classify_t rte_acl_default_classify =3D rte_acl_classify_s= calar; > > > > + > > > make this static, the outside world shouldn't need to see it. > > > > As I said above, I think it more plausible to keep it globally visible. > > > > > > > > > +void __attribute__((constructor(INT16_MAX))) > > > > +rte_acl_select_classify(void) > > > Make it static, The outside world doesn't need to call this. > > > > See above, would like user to have an ability to call it manually if ne= eded. > > > > > > > > > +{ > > > > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) { > > > > + /* SSE version requires SSE4.1 */ > > > > + rte_acl_default_classify =3D rte_acl_classify_sse; > > > > + } else { > > > > + /* reset to scalar version. */ > > > > + rte_acl_default_classify =3D rte_acl_classify_scalar; > > > Don't need the else clause here, the static initalizer has you covere= d. > > > > I think we better keep it like that - in case user calls it manually. > > We always reset rte_acl_default_classify to the 'best proper' value. > > > > > > + } > > > > +} > > > > + > > > > + > > > > +/** > > > > + * Invokes default rte_acl_classify function. > > > > + */ > > > > +extern rte_acl_classify_t rte_acl_default_classify; > > > > + > > > Doesn't need to be extern. > > > > +#define rte_acl_classify(ctx, data, results, num, categories) \ > > > > + (*rte_acl_default_classify)(ctx, data, results, num, categories) > > > > + > > > Not sure why you need this either. The rte_acl_classify_t should be = enough, no? > > > > We preserve existing rte_acl_classify() API, so users don't need to mod= ify their code. > > > This would be a great candidate for versioning (Bruce and have been discu= ssing > this). >=20 > Neil >=20 > >