From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 353A868A1 for ; Fri, 8 Aug 2014 14:22:45 +0200 (CEST) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1XFjEJ-0007kw-EM; Fri, 08 Aug 2014 08:25:16 -0400 Date: Fri, 8 Aug 2014 08:25:03 -0400 From: Neil Horman To: "Ananyev, Konstantin" Message-ID: <20140808122503.GA11413@hmsreliant.think-freely.org> References: <1407436263-9360-1-git-send-email-konstantin.ananyev@intel.com> <20140807201134.GA1632@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB97725821352285@IRSMSX105.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2601191342CEEE43887BDE71AB97725821352285@IRSMSX105.ger.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Aug 2014 12:22:45 -0000 On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote: > > From: Neil Horman [mailto:nhorman@tuxdriver.com] > > Sent: Thursday, August 07, 2014 9:12 PM > > To: Ananyev, Konstantin > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target > > > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote: > > > Make ACL library to build/work on 'default' architecture: > > > - make rte_acl_classify_scalar really scalar > > > (make sure it wouldn't use sse4 instrincts through resolve_priority()). > > > - Provide two versions of rte_acl_classify code path: > > > rte_acl_classify_sse() - could be build and used only on systems with sse4.2 > > > and upper, return -ENOTSUP on lower arch. > > > rte_acl_classify_scalar() - a slower version, but could be build and used > > > on all systems. > > > - keep common code shared between these two codepaths. > > > > > > v2 chages: > > > run-time selection of most appropriate code-path for given ISA. > > > By default the highest supprted one is selected. > > > User can still override that selection by manually assigning new value to > > > the global function pointer rte_acl_default_classify. > > > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify > > > points to. > > > > > > > > > Signed-off-by: Konstantin Ananyev > > > > This is alot better thank you. A few remaining issues. > > My comments inline too. > Thanks > Konstantin > > > > > > --- > > > app/test-acl/main.c | 13 +- > > > lib/librte_acl/Makefile | 5 +- > > > lib/librte_acl/acl_bld.c | 5 +- > > > lib/librte_acl/acl_match_check.def | 92 ++++ > > > lib/librte_acl/acl_run.c | 944 ------------------------------------- > > > lib/librte_acl/acl_run.h | 220 +++++++++ > > > lib/librte_acl/acl_run_scalar.c | 197 ++++++++ > > > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++ > > > lib/librte_acl/rte_acl.c | 15 + > > > lib/librte_acl/rte_acl.h | 24 +- > > > 10 files changed, 1189 insertions(+), 956 deletions(-) > > > create mode 100644 lib/librte_acl/acl_match_check.def > > > delete mode 100644 lib/librte_acl/acl_run.c > > > create mode 100644 lib/librte_acl/acl_run.h > > > create mode 100644 lib/librte_acl/acl_run_scalar.c > > > create mode 100644 lib/librte_acl/acl_run_sse.c > > > > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c > > > index d654409..45c6fa6 100644 > > > --- a/app/test-acl/main.c > > > +++ b/app/test-acl/main.c > > > @@ -787,6 +787,10 @@ acx_init(void) > > > /* perform build. */ > > > ret = rte_acl_build(config.acx, &cfg); > > > > > > + /* setup default rte_acl_classify */ > > > + if (config.scalar) > > > + rte_acl_default_classify = rte_acl_classify_scalar; > > > + > > Exporting this variable as part of the ABI is a bad idea. If the prototype of > > the function changes you have to update all your applications. > > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway. > Why? If you hide this from the application, changes to the internal implementation will also be invisible. When building as a DSO, an application will be able to transition between libraries without the need for a rebuild. > > Make the pointer > > an internal symbol and set it using a get/set routine with an enum to represent > > the path to choose. That will help isolate the ABI from the internal > > implementation. > > That's was my first intention too. > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function > and it will cost us extra call (or jump). Thats true, but I don't see that as a problem. We're not talking about a hot code path here, its a setup function. Or do you think that an application will be switching between classification functions on every classify operation? > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library. I agree, but both the methods we are advocating for allow that. Its really just a question of exposing the mechanism as data or text in the binary. Exposing it as data comes with implicit ABI constraints that are less prevalanet when done as code entry points. > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify(). In the case of a bug in the optimized path, you just fix the bug. If you want to provide your own classification function, thats fine I suppose, but that seems completely outside the scope of what we're trying to do here. Its not adventageous to just throw that in there. If you want to be able to provide your own classifier function, lets at least take some time to make sure that the function prototype is sufficiently capable to accept all the data you might want to pass it in the future, before we go exposing it. Otherwise you'll have to break the ABI in future versions, whcih is something we've been discussing trying to avoid. > > It will also let you prevent things like selecting a run time > > path that is incompatible with the running system > > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing. That really seems like poor design to me. I don't see why you wouldn't at least want to warn the developer of an application if they were at run time to assign a default classifier method that was incompatible with a running system. Yes, they're likely smart enough to know what their doing, but smart people make mistakes, and appreciate being told when they're doing so, especially if the method of telling is something a bit more civil than a machine check that might occur well after the application has been initilized. > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly. Not if the function is statically declared and not exposed to the application they cant :) > > > and prevent path switching > > during searches, which may produce unexpected results. > > Not that I am advertising it, but it should be safe to update rte_acl_default_classify during searches: > All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only. > Fair enough. > > > > > > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c > > > deleted file mode 100644 > > > index e3d9fc1..0000000 > > > --- a/lib/librte_acl/acl_run.c > > > +++ /dev/null > > > @@ -1,944 +0,0 @@ > > > -/*- > > > - * BSD LICENSE > > > - * > > > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved. > > > - * All rights reserved. > > > - * > > > - * Redistribution and use in source and binary forms, with or without > > > - * modification, are permitted provided that the following conditions > > > > > > + > > > +#define __func_resolve_priority__ resolve_priority_scalar > > > +#define __func_match_check__ acl_match_check_scalar > > > +#include "acl_match_check.def" > > > + > > I get this lets you make some more code common, but its just unpleasant to trace > > through. Looking at the defintion of __func_match_check__ I don't see anything > > particularly performance sensitive there. What if instead you simply redefined > > __func_match_check__ in a common internal header as acl_match_check (a generic > > function), and had it accept priority resolution function as an argument? That > > would still give you all the performance enhancements without having to include > > c files in the middle of other c files, and would make the code a bit more > > parseable. > > Yes, that way it would look much better. > And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly. > Will change as you suggested. > Thank you Neil > > > > > +/* > > > + * When processing the transition, rather than using if/else > > > + * construct, the offset is calculated for DFA and QRANGE and > > > + * then conditionally added to the address based on node type. > > > + * This is done to avoid branch mis-predictions. Since the > > > + * offset is rather simple calculation it is more efficient > > > + * to do the calculation and do a condition move rather than > > > + * a conditional branch to determine which calculation to do. > > > + */ > > > +static inline uint32_t > > > +scan_forward(uint32_t input, uint32_t max) > > > +{ > > > + return (input == 0) ? max : rte_bsf32(input); > > > +} > > > + } > > > +} > > > > > > + > > > +#define __func_resolve_priority__ resolve_priority_sse > > > +#define __func_match_check__ acl_match_check_sse > > > +#include "acl_match_check.def" > > > + > > Same deal as above. > > > > > +/* > > > + * Extract transitions from an XMM register and check for any matches > > > + */ > > > +static void > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx, > > > + struct parms *parms, struct acl_flow_data *flows) > > > +{ > > > + uint64_t transition1, transition2; > > > + > > > + /* extract transition from low 64 bits. */ > > > + transition1 = MM_CVT64(*indicies); > > > + > > > + /* extract transition from high 64 bits. */ > > > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64); > > > + transition2 = MM_CVT64(*indicies); > > > + > > > + transition1 = acl_match_check_sse(transition1, slot, ctx, > > > + parms, flows); > > > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx, > > > + parms, flows); > > > + > > > + /* update indicies with new transitions. */ > > > + *indicies = MM_SET64(transition2, transition1); > > > +} > > > + > > > +/* > > > + * Check for a match in 2 transitions (contained in SSE register) > > > + */ > > > +static inline void > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms, > > > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask) > > > +{ > > > + xmm_t temp; > > > + > > > + temp = MM_AND(match_mask, *indicies); > > > + while (!MM_TESTZ(temp, temp)) { > > > + acl_process_matches(indicies, slot, ctx, parms, flows); > > > + temp = MM_AND(match_mask, *indicies); > > > + } > > > +} > > > + > > > +/* > > > + * Check for any match in 4 transitions (contained in 2 SSE registers) > > > + */ > > > +static inline void > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms, > > > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2, > > > + xmm_t match_mask) > > > +{ > > > + xmm_t temp; > > > + > > > + /* put low 32 bits of each transition into one register */ > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2, > > > + 0x88); > > > + /* test for match node */ > > > + temp = MM_AND(match_mask, temp); > > > + > > > + while (!MM_TESTZ(temp, temp)) { > > > + acl_process_matches(indicies1, slot, ctx, parms, flows); > > > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows); > > > + > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > > > + (__m128)*indicies2, > > > + 0x88); > > > + temp = MM_AND(match_mask, temp); > > > + } > > > +} > > > + > > > +/* > > > + * Calculate the address of the next transition for > > > + * all types of nodes. Note that only DFA nodes and range > > > + * nodes actually transition to another node. Match > > > + * nodes don't move. > > > + */ > > > +static inline xmm_t > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > + xmm_t *indicies1, xmm_t *indicies2) > > > +{ > > > + xmm_t addr, node_types, temp; > > > + > > > + /* > > > + * Note that no transition is done for a match > > > + * node and therefore a stream freezes when > > > + * it reaches a match. > > > + */ > > > + > > > + /* Shuffle low 32 into temp and high 32 into indicies2 */ > > > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2, > > > + 0x88); > > > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > > > + (__m128)*indicies2, 0xdd); > > > + > > > + /* Calc node type and node addr */ > > > + node_types = MM_ANDNOT(index_mask, temp); > > > + addr = MM_AND(index_mask, temp); > > > + > > > + /* > > > + * Calc addr for DFAs - addr = dfa_index + input_byte > > > + */ > > > + > > > + /* mask for DFA type (0) nodes */ > > > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types)); > > > + > > > + /* add input byte to DFA position */ > > > + temp = MM_AND(temp, bytes); > > > + temp = MM_AND(temp, next_input); > > > + addr = MM_ADD32(addr, temp); > > > + > > > + /* > > > + * Calc addr for Range nodes -> range_index + range(input) > > > + */ > > > + node_types = MM_CMPEQ32(node_types, type_quad_range); > > > + > > > + /* > > > + * Calculate number of range boundaries that are less than the > > > + * input value. Range boundaries for each node are in signed 8 bit, > > > + * ordered from -128 to 127 in the indicies2 register. > > > + * This is effectively a popcnt of bytes that are greater than the > > > + * input byte. > > > + */ > > > + > > > + /* shuffle input byte to all 4 positions of 32 bit value */ > > > + temp = MM_SHUFFLE8(next_input, shuffle_input); > > > + > > > + /* check ranges */ > > > + temp = MM_CMPGT8(temp, *indicies2); > > > + > > > + /* convert -1 to 1 (bytes greater than input byte */ > > > + temp = MM_SIGN8(temp, temp); > > > + > > > + /* horizontal add pairs of bytes into words */ > > > + temp = MM_MADD8(temp, temp); > > > + > > > + /* horizontal add pairs of words into dwords */ > > > + temp = MM_MADD16(temp, ones_16); > > > + > > > + /* mask to range type nodes */ > > > + temp = MM_AND(temp, node_types); > > > + > > > + /* add index into node position */ > > > + return MM_ADD32(addr, temp); > > > +} > > > + > > > +/* > > > + * Process 4 transitions (in 2 SIMD registers) in parallel > > > + */ > > > +static inline xmm_t > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2) > > > +{ > > > + xmm_t addr; > > > + uint64_t trans0, trans2; > > > + > > > + /* Calculate the address (array index) for all 4 transitions. */ > > > + > > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16, > > > + bytes, type_quad_range, indicies1, indicies2); > > > + > > > + /* Gather 64 bit transitions and pack back into 2 registers. */ > > > + > > > + trans0 = trans[MM_CVT32(addr)]; > > > + > > > + /* get slot 2 */ > > > + > > > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */ > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2); > > > + trans2 = trans[MM_CVT32(addr)]; > > > + > > > + /* get slot 1 */ > > > + > > > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */ > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0); > > > + > > > + /* get slot 3 */ > > > + > > > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */ > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3); > > > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2); > > > + > > > + return MM_SRL32(next_input, 8); > > > +} > > > + > > > +/* > > > + * Execute trie traversal with 8 traversals in parallel > > > + */ > > > +static inline int > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > + uint32_t *results, uint32_t total_packets, uint32_t categories) > > > +{ > > > + int n; > > > + struct acl_flow_data flows; > > > + uint64_t index_array[MAX_SEARCHES_SSE8]; > > > + struct completion cmplt[MAX_SEARCHES_SSE8]; > > > + struct parms parms[MAX_SEARCHES_SSE8]; > > > + xmm_t input0, input1; > > > + xmm_t indicies1, indicies2, indicies3, indicies4; > > > + > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > + total_packets, categories, ctx->trans_table); > > > + > > > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) { > > > + cmplt[n].count = 0; > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > > > + } > > > + > > > + /* > > > + * indicies1 contains index_array[0,1] > > > + * indicies2 contains index_array[2,3] > > > + * indicies3 contains index_array[4,5] > > > + * indicies4 contains index_array[6,7] > > > + */ > > > + > > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]); > > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]); > > > + > > > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]); > > > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > + &indicies1, &indicies2, mm_match_mask.m); > > > + acl_match_check_x4(4, ctx, parms, &flows, > > > + &indicies3, &indicies4, mm_match_mask.m); > > > + > > > + while (flows.started > 0) { > > > + > > > + /* Gather 4 bytes of input data for each stream. */ > > > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), > > > + 0); > > > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4), > > > + 0); > > > + > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1); > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1); > > > + > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2); > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2); > > > + > > > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3); > > > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3); > > > + > > > + /* Process the 4 bytes of input on each stream. */ > > > + > > > + input0 = transition4(mm_index_mask.m, input0, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input1 = transition4(mm_index_mask.m, input1, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies3, &indicies4); > > > + > > > + input0 = transition4(mm_index_mask.m, input0, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input1 = transition4(mm_index_mask.m, input1, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies3, &indicies4); > > > + > > > + input0 = transition4(mm_index_mask.m, input0, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input1 = transition4(mm_index_mask.m, input1, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies3, &indicies4); > > > + > > > + input0 = transition4(mm_index_mask.m, input0, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input1 = transition4(mm_index_mask.m, input1, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies3, &indicies4); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > + &indicies1, &indicies2, mm_match_mask.m); > > > + acl_match_check_x4(4, ctx, parms, &flows, > > > + &indicies3, &indicies4, mm_match_mask.m); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/* > > > + * Execute trie traversal with 4 traversals in parallel > > > + */ > > > +static inline int > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > + uint32_t *results, int total_packets, uint32_t categories) > > > +{ > > > + int n; > > > + struct acl_flow_data flows; > > > + uint64_t index_array[MAX_SEARCHES_SSE4]; > > > + struct completion cmplt[MAX_SEARCHES_SSE4]; > > > + struct parms parms[MAX_SEARCHES_SSE4]; > > > + xmm_t input, indicies1, indicies2; > > > + > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > + total_packets, categories, ctx->trans_table); > > > + > > > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) { > > > + cmplt[n].count = 0; > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > > > + } > > > + > > > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]); > > > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > + &indicies1, &indicies2, mm_match_mask.m); > > > + > > > + while (flows.started > 0) { > > > + > > > + /* Gather 4 bytes of input data for each stream. */ > > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0); > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2); > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3); > > > + > > > + /* Process the 4 bytes of input on each stream. */ > > > + input = transition4(mm_index_mask.m, input, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input = transition4(mm_index_mask.m, input, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input = transition4(mm_index_mask.m, input, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + input = transition4(mm_index_mask.m, input, > > > + mm_shuffle_input.m, mm_ones_16.m, > > > + mm_bytes.m, mm_type_quad_range.m, > > > + flows.trans, &indicies1, &indicies2); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x4(0, ctx, parms, &flows, > > > + &indicies1, &indicies2, mm_match_mask.m); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static inline xmm_t > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > > > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > > > + const uint64_t *trans, xmm_t *indicies1) > > > +{ > > > + uint64_t t; > > > + xmm_t addr, indicies2; > > > + > > > + indicies2 = MM_XOR(ones_16, ones_16); > > > + > > > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16, > > > + bytes, type_quad_range, indicies1, &indicies2); > > > + > > > + /* Gather 64 bit transitions and pack 2 per register. */ > > > + > > > + t = trans[MM_CVT32(addr)]; > > > + > > > + /* get slot 1 */ > > > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > > > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t); > > > + > > > + return MM_SRL32(next_input, 8); > > > +} > > > + > > > +/* > > > + * Execute trie traversal with 2 traversals in parallel. > > > + */ > > > +static inline int > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > + uint32_t *results, uint32_t total_packets, uint32_t categories) > > > +{ > > > + int n; > > > + struct acl_flow_data flows; > > > + uint64_t index_array[MAX_SEARCHES_SSE2]; > > > + struct completion cmplt[MAX_SEARCHES_SSE2]; > > > + struct parms parms[MAX_SEARCHES_SSE2]; > > > + xmm_t input, indicies; > > > + > > > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > + total_packets, categories, ctx->trans_table); > > > + > > > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) { > > > + cmplt[n].count = 0; > > > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > > > + } > > > + > > > + indicies = MM_LOADU((xmm_t *) &index_array[0]); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m); > > > + > > > + while (flows.started > 0) { > > > + > > > + /* Gather 4 bytes of input data for each stream. */ > > > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0); > > > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > > > + > > > + /* Process the 4 bytes of input on each stream. */ > > > + > > > + input = transition2(mm_index_mask64.m, input, > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > + flows.trans, &indicies); > > > + > > > + input = transition2(mm_index_mask64.m, input, > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > + flows.trans, &indicies); > > > + > > > + input = transition2(mm_index_mask64.m, input, > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > + flows.trans, &indicies); > > > + > > > + input = transition2(mm_index_mask64.m, input, > > > + mm_shuffle_input64.m, mm_ones_16.m, > > > + mm_bytes64.m, mm_type_quad_range64.m, > > > + flows.trans, &indicies); > > > + > > > + /* Check for any matches. */ > > > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, > > > + mm_match_mask64.m); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +int > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data, > > > + uint32_t *results, uint32_t num, uint32_t categories) > > > +{ > > > + if (categories != 1 && > > > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0) > > > + return -EINVAL; > > > + > > > + if (likely(num >= MAX_SEARCHES_SSE8)) > > > + return search_sse_8(ctx, data, results, num, categories); > > > + else if (num >= MAX_SEARCHES_SSE4) > > > + return search_sse_4(ctx, data, results, num, categories); > > > + else > > > + return search_sse_2(ctx, data, results, num, categories); > > > +} > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c > > > index 7c288bd..0cde07e 100644 > > > --- a/lib/librte_acl/rte_acl.c > > > +++ b/lib/librte_acl/rte_acl.c > > > @@ -38,6 +38,21 @@ > > > > > > TAILQ_HEAD(rte_acl_list, rte_tailq_entry); > > > > > > +/* by default, use always avaialbe scalar code path. */ > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar; > > > + > > make this static, the outside world shouldn't need to see it. > > As I said above, I think it more plausible to keep it globally visible. > > > > > > +void __attribute__((constructor(INT16_MAX))) > > > +rte_acl_select_classify(void) > > Make it static, The outside world doesn't need to call this. > > See above, would like user to have an ability to call it manually if needed. > > > > > > +{ > > > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) { > > > + /* SSE version requires SSE4.1 */ > > > + rte_acl_default_classify = rte_acl_classify_sse; > > > + } else { > > > + /* reset to scalar version. */ > > > + rte_acl_default_classify = rte_acl_classify_scalar; > > Don't need the else clause here, the static initalizer has you covered. > > I think we better keep it like that - in case user calls it manually. > We always reset rte_acl_default_classify to the 'best proper' value. > > > > + } > > > +} > > > + > > > + > > > +/** > > > + * Invokes default rte_acl_classify function. > > > + */ > > > +extern rte_acl_classify_t rte_acl_default_classify; > > > + > > Doesn't need to be extern. > > > +#define rte_acl_classify(ctx, data, results, num, categories) \ > > > + (*rte_acl_default_classify)(ctx, data, results, num, categories) > > > + > > Not sure why you need this either. The rte_acl_classify_t should be enough, no? > > We preserve existing rte_acl_classify() API, so users don't need to modify their code. > This would be a great candidate for versioning (Bruce and have been discussing this). Neil >