From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 2A34618F for ; Thu, 7 Aug 2014 22:09:10 +0200 (CEST) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1XFU2B-0002HB-RO; Thu, 07 Aug 2014 16:11:42 -0400 Date: Thu, 7 Aug 2014 16:11:34 -0400 From: Neil Horman To: Konstantin Ananyev Message-ID: <20140807201134.GA1632@hmsreliant.think-freely.org> References: <1407436263-9360-1-git-send-email-konstantin.ananyev@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1407436263-9360-1-git-send-email-konstantin.ananyev@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: dev@dpdk.org Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Aug 2014 20:09:10 -0000 On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote: > Make ACL library to build/work on 'default' architecture: > - make rte_acl_classify_scalar really scalar > (make sure it wouldn't use sse4 instrincts through resolve_priority()). > - Provide two versions of rte_acl_classify code path: > rte_acl_classify_sse() - could be build and used only on systems with sse4.2 > and upper, return -ENOTSUP on lower arch. > rte_acl_classify_scalar() - a slower version, but could be build and used > on all systems. > - keep common code shared between these two codepaths. > > v2 chages: > run-time selection of most appropriate code-path for given ISA. > By default the highest supprted one is selected. > User can still override that selection by manually assigning new value to > the global function pointer rte_acl_default_classify. > rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify > points to. > > > Signed-off-by: Konstantin Ananyev This is alot better thank you. A few remaining issues. > --- > app/test-acl/main.c | 13 +- > lib/librte_acl/Makefile | 5 +- > lib/librte_acl/acl_bld.c | 5 +- > lib/librte_acl/acl_match_check.def | 92 ++++ > lib/librte_acl/acl_run.c | 944 ------------------------------------- > lib/librte_acl/acl_run.h | 220 +++++++++ > lib/librte_acl/acl_run_scalar.c | 197 ++++++++ > lib/librte_acl/acl_run_sse.c | 630 +++++++++++++++++++++++++ > lib/librte_acl/rte_acl.c | 15 + > lib/librte_acl/rte_acl.h | 24 +- > 10 files changed, 1189 insertions(+), 956 deletions(-) > create mode 100644 lib/librte_acl/acl_match_check.def > delete mode 100644 lib/librte_acl/acl_run.c > create mode 100644 lib/librte_acl/acl_run.h > create mode 100644 lib/librte_acl/acl_run_scalar.c > create mode 100644 lib/librte_acl/acl_run_sse.c > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c > index d654409..45c6fa6 100644 > --- a/app/test-acl/main.c > +++ b/app/test-acl/main.c > @@ -787,6 +787,10 @@ acx_init(void) > /* perform build. */ > ret = rte_acl_build(config.acx, &cfg); > > + /* setup default rte_acl_classify */ > + if (config.scalar) > + rte_acl_default_classify = rte_acl_classify_scalar; > + Exporting this variable as part of the ABI is a bad idea. If the prototype of the function changes you have to update all your applications. Make the pointer an internal symbol and set it using a get/set routine with an enum to represent the path to choose. That will help isolate the ABI from the internal implementation. It will also let you prevent things like selecting a run time path that is incompatible with the running system, and prevent path switching during searches, which may produce unexpected results. > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c > deleted file mode 100644 > index e3d9fc1..0000000 > --- a/lib/librte_acl/acl_run.c > +++ /dev/null > @@ -1,944 +0,0 @@ > -/*- > - * BSD LICENSE > - * > - * Copyright(c) 2010-2014 Intel Corporation. All rights reserved. > - * All rights reserved. > - * > - * Redistribution and use in source and binary forms, with or without > - * modification, are permitted provided that the following conditions > > + > +#define __func_resolve_priority__ resolve_priority_scalar > +#define __func_match_check__ acl_match_check_scalar > +#include "acl_match_check.def" > + I get this lets you make some more code common, but its just unpleasant to trace through. Looking at the defintion of __func_match_check__ I don't see anything particularly performance sensitive there. What if instead you simply redefined __func_match_check__ in a common internal header as acl_match_check (a generic function), and had it accept priority resolution function as an argument? That would still give you all the performance enhancements without having to include c files in the middle of other c files, and would make the code a bit more parseable. > +/* > + * When processing the transition, rather than using if/else > + * construct, the offset is calculated for DFA and QRANGE and > + * then conditionally added to the address based on node type. > + * This is done to avoid branch mis-predictions. Since the > + * offset is rather simple calculation it is more efficient > + * to do the calculation and do a condition move rather than > + * a conditional branch to determine which calculation to do. > + */ > +static inline uint32_t > +scan_forward(uint32_t input, uint32_t max) > +{ > + return (input == 0) ? max : rte_bsf32(input); > +} > + } > +} > > + > +#define __func_resolve_priority__ resolve_priority_sse > +#define __func_match_check__ acl_match_check_sse > +#include "acl_match_check.def" > + Same deal as above. > +/* > + * Extract transitions from an XMM register and check for any matches > + */ > +static void > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx, > + struct parms *parms, struct acl_flow_data *flows) > +{ > + uint64_t transition1, transition2; > + > + /* extract transition from low 64 bits. */ > + transition1 = MM_CVT64(*indicies); > + > + /* extract transition from high 64 bits. */ > + *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64); > + transition2 = MM_CVT64(*indicies); > + > + transition1 = acl_match_check_sse(transition1, slot, ctx, > + parms, flows); > + transition2 = acl_match_check_sse(transition2, slot + 1, ctx, > + parms, flows); > + > + /* update indicies with new transitions. */ > + *indicies = MM_SET64(transition2, transition1); > +} > + > +/* > + * Check for a match in 2 transitions (contained in SSE register) > + */ > +static inline void > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms, > + struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask) > +{ > + xmm_t temp; > + > + temp = MM_AND(match_mask, *indicies); > + while (!MM_TESTZ(temp, temp)) { > + acl_process_matches(indicies, slot, ctx, parms, flows); > + temp = MM_AND(match_mask, *indicies); > + } > +} > + > +/* > + * Check for any match in 4 transitions (contained in 2 SSE registers) > + */ > +static inline void > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms, > + struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2, > + xmm_t match_mask) > +{ > + xmm_t temp; > + > + /* put low 32 bits of each transition into one register */ > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2, > + 0x88); > + /* test for match node */ > + temp = MM_AND(match_mask, temp); > + > + while (!MM_TESTZ(temp, temp)) { > + acl_process_matches(indicies1, slot, ctx, parms, flows); > + acl_process_matches(indicies2, slot + 2, ctx, parms, flows); > + > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > + (__m128)*indicies2, > + 0x88); > + temp = MM_AND(match_mask, temp); > + } > +} > + > +/* > + * Calculate the address of the next transition for > + * all types of nodes. Note that only DFA nodes and range > + * nodes actually transition to another node. Match > + * nodes don't move. > + */ > +static inline xmm_t > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > + xmm_t *indicies1, xmm_t *indicies2) > +{ > + xmm_t addr, node_types, temp; > + > + /* > + * Note that no transition is done for a match > + * node and therefore a stream freezes when > + * it reaches a match. > + */ > + > + /* Shuffle low 32 into temp and high 32 into indicies2 */ > + temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2, > + 0x88); > + *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, > + (__m128)*indicies2, 0xdd); > + > + /* Calc node type and node addr */ > + node_types = MM_ANDNOT(index_mask, temp); > + addr = MM_AND(index_mask, temp); > + > + /* > + * Calc addr for DFAs - addr = dfa_index + input_byte > + */ > + > + /* mask for DFA type (0) nodes */ > + temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types)); > + > + /* add input byte to DFA position */ > + temp = MM_AND(temp, bytes); > + temp = MM_AND(temp, next_input); > + addr = MM_ADD32(addr, temp); > + > + /* > + * Calc addr for Range nodes -> range_index + range(input) > + */ > + node_types = MM_CMPEQ32(node_types, type_quad_range); > + > + /* > + * Calculate number of range boundaries that are less than the > + * input value. Range boundaries for each node are in signed 8 bit, > + * ordered from -128 to 127 in the indicies2 register. > + * This is effectively a popcnt of bytes that are greater than the > + * input byte. > + */ > + > + /* shuffle input byte to all 4 positions of 32 bit value */ > + temp = MM_SHUFFLE8(next_input, shuffle_input); > + > + /* check ranges */ > + temp = MM_CMPGT8(temp, *indicies2); > + > + /* convert -1 to 1 (bytes greater than input byte */ > + temp = MM_SIGN8(temp, temp); > + > + /* horizontal add pairs of bytes into words */ > + temp = MM_MADD8(temp, temp); > + > + /* horizontal add pairs of words into dwords */ > + temp = MM_MADD16(temp, ones_16); > + > + /* mask to range type nodes */ > + temp = MM_AND(temp, node_types); > + > + /* add index into node position */ > + return MM_ADD32(addr, temp); > +} > + > +/* > + * Process 4 transitions (in 2 SIMD registers) in parallel > + */ > +static inline xmm_t > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > + const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2) > +{ > + xmm_t addr; > + uint64_t trans0, trans2; > + > + /* Calculate the address (array index) for all 4 transitions. */ > + > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16, > + bytes, type_quad_range, indicies1, indicies2); > + > + /* Gather 64 bit transitions and pack back into 2 registers. */ > + > + trans0 = trans[MM_CVT32(addr)]; > + > + /* get slot 2 */ > + > + /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */ > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2); > + trans2 = trans[MM_CVT32(addr)]; > + > + /* get slot 1 */ > + > + /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */ > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0); > + > + /* get slot 3 */ > + > + /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */ > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3); > + *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2); > + > + return MM_SRL32(next_input, 8); > +} > + > +/* > + * Execute trie traversal with 8 traversals in parallel > + */ > +static inline int > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data, > + uint32_t *results, uint32_t total_packets, uint32_t categories) > +{ > + int n; > + struct acl_flow_data flows; > + uint64_t index_array[MAX_SEARCHES_SSE8]; > + struct completion cmplt[MAX_SEARCHES_SSE8]; > + struct parms parms[MAX_SEARCHES_SSE8]; > + xmm_t input0, input1; > + xmm_t indicies1, indicies2, indicies3, indicies4; > + > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > + total_packets, categories, ctx->trans_table); > + > + for (n = 0; n < MAX_SEARCHES_SSE8; n++) { > + cmplt[n].count = 0; > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > + } > + > + /* > + * indicies1 contains index_array[0,1] > + * indicies2 contains index_array[2,3] > + * indicies3 contains index_array[4,5] > + * indicies4 contains index_array[6,7] > + */ > + > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]); > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]); > + > + indicies3 = MM_LOADU((xmm_t *) &index_array[4]); > + indicies4 = MM_LOADU((xmm_t *) &index_array[6]); > + > + /* Check for any matches. */ > + acl_match_check_x4(0, ctx, parms, &flows, > + &indicies1, &indicies2, mm_match_mask.m); > + acl_match_check_x4(4, ctx, parms, &flows, > + &indicies3, &indicies4, mm_match_mask.m); > + > + while (flows.started > 0) { > + > + /* Gather 4 bytes of input data for each stream. */ > + input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), > + 0); > + input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4), > + 0); > + > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1); > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1); > + > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2); > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2); > + > + input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3); > + input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3); > + > + /* Process the 4 bytes of input on each stream. */ > + > + input0 = transition4(mm_index_mask.m, input0, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input1 = transition4(mm_index_mask.m, input1, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies3, &indicies4); > + > + input0 = transition4(mm_index_mask.m, input0, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input1 = transition4(mm_index_mask.m, input1, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies3, &indicies4); > + > + input0 = transition4(mm_index_mask.m, input0, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input1 = transition4(mm_index_mask.m, input1, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies3, &indicies4); > + > + input0 = transition4(mm_index_mask.m, input0, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input1 = transition4(mm_index_mask.m, input1, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies3, &indicies4); > + > + /* Check for any matches. */ > + acl_match_check_x4(0, ctx, parms, &flows, > + &indicies1, &indicies2, mm_match_mask.m); > + acl_match_check_x4(4, ctx, parms, &flows, > + &indicies3, &indicies4, mm_match_mask.m); > + } > + > + return 0; > +} > + > +/* > + * Execute trie traversal with 4 traversals in parallel > + */ > +static inline int > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data, > + uint32_t *results, int total_packets, uint32_t categories) > +{ > + int n; > + struct acl_flow_data flows; > + uint64_t index_array[MAX_SEARCHES_SSE4]; > + struct completion cmplt[MAX_SEARCHES_SSE4]; > + struct parms parms[MAX_SEARCHES_SSE4]; > + xmm_t input, indicies1, indicies2; > + > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > + total_packets, categories, ctx->trans_table); > + > + for (n = 0; n < MAX_SEARCHES_SSE4; n++) { > + cmplt[n].count = 0; > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > + } > + > + indicies1 = MM_LOADU((xmm_t *) &index_array[0]); > + indicies2 = MM_LOADU((xmm_t *) &index_array[2]); > + > + /* Check for any matches. */ > + acl_match_check_x4(0, ctx, parms, &flows, > + &indicies1, &indicies2, mm_match_mask.m); > + > + while (flows.started > 0) { > + > + /* Gather 4 bytes of input data for each stream. */ > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0); > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2); > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3); > + > + /* Process the 4 bytes of input on each stream. */ > + input = transition4(mm_index_mask.m, input, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input = transition4(mm_index_mask.m, input, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input = transition4(mm_index_mask.m, input, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + input = transition4(mm_index_mask.m, input, > + mm_shuffle_input.m, mm_ones_16.m, > + mm_bytes.m, mm_type_quad_range.m, > + flows.trans, &indicies1, &indicies2); > + > + /* Check for any matches. */ > + acl_match_check_x4(0, ctx, parms, &flows, > + &indicies1, &indicies2, mm_match_mask.m); > + } > + > + return 0; > +} > + > +static inline xmm_t > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input, > + xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range, > + const uint64_t *trans, xmm_t *indicies1) > +{ > + uint64_t t; > + xmm_t addr, indicies2; > + > + indicies2 = MM_XOR(ones_16, ones_16); > + > + addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16, > + bytes, type_quad_range, indicies1, &indicies2); > + > + /* Gather 64 bit transitions and pack 2 per register. */ > + > + t = trans[MM_CVT32(addr)]; > + > + /* get slot 1 */ > + addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1); > + *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t); > + > + return MM_SRL32(next_input, 8); > +} > + > +/* > + * Execute trie traversal with 2 traversals in parallel. > + */ > +static inline int > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data, > + uint32_t *results, uint32_t total_packets, uint32_t categories) > +{ > + int n; > + struct acl_flow_data flows; > + uint64_t index_array[MAX_SEARCHES_SSE2]; > + struct completion cmplt[MAX_SEARCHES_SSE2]; > + struct parms parms[MAX_SEARCHES_SSE2]; > + xmm_t input, indicies; > + > + acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > + total_packets, categories, ctx->trans_table); > + > + for (n = 0; n < MAX_SEARCHES_SSE2; n++) { > + cmplt[n].count = 0; > + index_array[n] = acl_start_next_trie(&flows, parms, n, ctx); > + } > + > + indicies = MM_LOADU((xmm_t *) &index_array[0]); > + > + /* Check for any matches. */ > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m); > + > + while (flows.started > 0) { > + > + /* Gather 4 bytes of input data for each stream. */ > + input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0); > + input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1); > + > + /* Process the 4 bytes of input on each stream. */ > + > + input = transition2(mm_index_mask64.m, input, > + mm_shuffle_input64.m, mm_ones_16.m, > + mm_bytes64.m, mm_type_quad_range64.m, > + flows.trans, &indicies); > + > + input = transition2(mm_index_mask64.m, input, > + mm_shuffle_input64.m, mm_ones_16.m, > + mm_bytes64.m, mm_type_quad_range64.m, > + flows.trans, &indicies); > + > + input = transition2(mm_index_mask64.m, input, > + mm_shuffle_input64.m, mm_ones_16.m, > + mm_bytes64.m, mm_type_quad_range64.m, > + flows.trans, &indicies); > + > + input = transition2(mm_index_mask64.m, input, > + mm_shuffle_input64.m, mm_ones_16.m, > + mm_bytes64.m, mm_type_quad_range64.m, > + flows.trans, &indicies); > + > + /* Check for any matches. */ > + acl_match_check_x2(0, ctx, parms, &flows, &indicies, > + mm_match_mask64.m); > + } > + > + return 0; > +} > + > +int > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data, > + uint32_t *results, uint32_t num, uint32_t categories) > +{ > + if (categories != 1 && > + ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0) > + return -EINVAL; > + > + if (likely(num >= MAX_SEARCHES_SSE8)) > + return search_sse_8(ctx, data, results, num, categories); > + else if (num >= MAX_SEARCHES_SSE4) > + return search_sse_4(ctx, data, results, num, categories); > + else > + return search_sse_2(ctx, data, results, num, categories); > +} > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c > index 7c288bd..0cde07e 100644 > --- a/lib/librte_acl/rte_acl.c > +++ b/lib/librte_acl/rte_acl.c > @@ -38,6 +38,21 @@ > > TAILQ_HEAD(rte_acl_list, rte_tailq_entry); > > +/* by default, use always avaialbe scalar code path. */ > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar; > + make this static, the outside world shouldn't need to see it. > +void __attribute__((constructor(INT16_MAX))) > +rte_acl_select_classify(void) Make it static, The outside world doesn't need to call this. > +{ > + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) { > + /* SSE version requires SSE4.1 */ > + rte_acl_default_classify = rte_acl_classify_sse; > + } else { > + /* reset to scalar version. */ > + rte_acl_default_classify = rte_acl_classify_scalar; Don't need the else clause here, the static initalizer has you covered. > + } > +} > + > + > +/** > + * Invokes default rte_acl_classify function. > + */ > +extern rte_acl_classify_t rte_acl_default_classify; > + Doesn't need to be extern. > +#define rte_acl_classify(ctx, data, results, num, categories) \ > + (*rte_acl_default_classify)(ctx, data, results, num, categories) > + Not sure why you need this either. The rte_acl_classify_t should be enough, no? Regards Neil