From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id 5643068A7
 for <dev@dpdk.org>; Fri,  8 Aug 2014 15:07:03 +0200 (CEST)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
 by orsmga102.jf.intel.com with ESMTP; 08 Aug 2014 06:03:50 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.01,825,1400050800"; d="scan'208";a="555694096"
Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157])
 by orsmga001.jf.intel.com with ESMTP; 08 Aug 2014 06:09:36 -0700
Received: from irsmsx108.ger.corp.intel.com (163.33.3.3) by
 IRSMSX103.ger.corp.intel.com (163.33.3.157) with Microsoft SMTP Server (TLS)
 id 14.3.195.1; Fri, 8 Aug 2014 14:09:35 +0100
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.240]) by
 IRSMSX108.ger.corp.intel.com ([169.254.11.3]) with mapi id 14.03.0195.001;
 Fri, 8 Aug 2014 14:09:35 +0100
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Neil Horman <nhorman@tuxdriver.com>
Thread-Topic: [dpdk-dev] [PATCHv2] librte_acl make it build/work for
 'default' target
Thread-Index: AQHPsnv0JgRQRcF1o0yY+Gidu15a55vGip3wgAAHNoCAABLvkA==
Date: Fri, 8 Aug 2014 13:09:34 +0000
Message-ID: <2601191342CEEE43887BDE71AB977258213522BE@IRSMSX105.ger.corp.intel.com>
References: <1407436263-9360-1-git-send-email-konstantin.ananyev@intel.com>
 <20140807201134.GA1632@hmsreliant.think-freely.org>
 <2601191342CEEE43887BDE71AB97725821352285@IRSMSX105.ger.corp.intel.com>
 <20140808122503.GA11413@hmsreliant.think-freely.org>
In-Reply-To: <20140808122503.GA11413@hmsreliant.think-freely.org>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.181]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for
 'default' target
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Aug 2014 13:07:04 -0000


> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Friday, August 08, 2014 1:25 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'defa=
ult' target
>=20
> On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 07, 2014 9:12 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for '=
default' target
> > >
> > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > >  (make sure it wouldn't use sse4 instrincts through resolve_priorit=
y()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > >   rte_acl_classify_sse() - could be build and used only on systems =
with sse4.2
> > > >   and upper, return -ENOTSUP on lower arch.
> > > >   rte_acl_classify_scalar() - a slower version, but could be build =
and used
> > > >   on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > >  run-time selection of most appropriate code-path for given ISA.
> > > >  By default the highest supprted one is selected.
> > > >  User can still override that selection by manually assigning new v=
alue to
> > > >  the global function pointer rte_acl_default_classify.
> > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_defaul=
t_classify
> > > >  points to.
> > > >
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > This is alot better thank you.  A few remaining issues.
> >
> > My comments inline too.
> > Thanks
> > Konstantin
> >
> > >
> > > > ---
> > > >  app/test-acl/main.c                |  13 +-
> > > >  lib/librte_acl/Makefile            |   5 +-
> > > >  lib/librte_acl/acl_bld.c           |   5 +-
> > > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > > >  lib/librte_acl/acl_run.c           | 944 -------------------------=
------------
> > > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > > >  lib/librte_acl/rte_acl.c           |  15 +
> > > >  lib/librte_acl/rte_acl.h           |  24 +-
> > > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > > >  delete mode 100644 lib/librte_acl/acl_run.c
> > > >  create mode 100644 lib/librte_acl/acl_run.h
> > > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > > >
> > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > index d654409..45c6fa6 100644
> > > > --- a/app/test-acl/main.c
> > > > +++ b/app/test-acl/main.c
> > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > >  	/* perform build. */
> > > >  	ret =3D rte_acl_build(config.acx, &cfg);
> > > >
> > > > +	/* setup default rte_acl_classify */
> > > > +	if (config.scalar)
> > > > +		rte_acl_default_classify =3D rte_acl_classify_scalar;
> > > > +
> > > Exporting this variable as part of the ABI is a bad idea.  If the pro=
totype of
> > > the function changes you have to update all your applications.
> >
> > If the prototype of rte_acl_classify will change, most likely you'll ha=
ve to update code that uses it anyway.
> >
> Why?  If you hide this from the application, changes to the internal
> implementation will also be invisible.  When building as a DSO, an applic=
ation
> will be able to transition between libraries without the need for a rebui=
ld.

Because rte_acl_classify() is part of the ACL API that users use.
If we'll add/modify its parameters and/or return value - users would have t=
o change their apps anyway.  =20
=20
> > >  Make the pointer
> > > an internal symbol and set it using a get/set routine with an enum to=
 represent
> > > the path to choose.  That will help isolate the ABI from the internal
> > > implementation.
> >
> > That's was my first intention too.
> > But then I realised that if we'll make it internal, then we'll need to =
make rte_acl_classify() a proper function
> > and it will cost us extra call (or jump).
> Thats true, but I don't see that as a problem.  We're not talking about a=
 hot
> code path here, its a setup function.

I am talking not about rte_acl_select_classify() but about rte_acl_classify=
() itself (not code path).
If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would =
need to become a real function and it'll be something like that:

->call rte_acl_acl_classify
---> load rte_acl_calssify_default value into the reg=20
--->  jmp (*reg)

>  Or do you think that an application will
> be switching between classification functions on every classify operation=
?

God no.

> > Also I think user should have an ability to change default classify cod=
e path without modifying/rebuilding acl library.
> I agree, but both the methods we are advocating for allow that.  Its real=
ly just
> a question of exposing the mechanism as data or text in the binary.  Expo=
sing it
> as data comes with implicit ABI constraints that are less prevalanet when=
 done
> as code entry points.
=20
> > For example: a bug in an optimised code path is discovered, or user may=
 want to implement and use his own version of classify().

> In the case of a bug in the optimized path, you just fix the bug.=20

It is not about me. It is about a user who get librte_acl as part of binary=
 distribution.
Of course, he probably will report about it and we probably fix it sooner o=
r later.
But with such ability he can switch to the safe implementation immediately
without touching the library and then wait for the fix.

>  If you want
> to provide your own classification function, thats fine I suppose, but th=
at
> seems completely outside the scope of what we're trying to do here.  Its =
not
> adventageous to just throw that in there.  If you want to be able to prov=
ide
> your own classifier function, lets at least take some time to make sure t=
hat the
> function prototype is sufficiently capable to accept all the data you mig=
ht want
> to pass it in the future, before we go exposing it.  Otherwise you'll hav=
e to
> break the ABI in future versions, whcih is something we've been discussin=
g
> trying to avoid.

rte_acl_classify() it is already exposed (PART of API), same as rte_acl_cla=
ssify_scalar().
If in future, we'll change these functions prototypes will break ABI anyway=
.

>=20
> > > It will also let you prevent things like selecting a run time
> > > path that is incompatible with the running system
> >
> > If the user going to update rte_acl_default_classify he is probably sma=
rt enough to know what he is doing.
> That really seems like poor design to me.  I don't see why you wouldn't a=
t least
> want to warn the developer of an application if they were at run time to =
assign
> a default classifier method that was incompatible with a running system. =
 Yes,
> they're likely smart enough to know what their doing, but smart people ma=
ke
> mistakes, and appreciate being told when they're doing so, especially if =
the
> method of telling is something a bit more civil than a machine check that
> might occur well after the application has been initilized.

I have no problem providing rte_acl_check_classify(flags_required, classify=
_ptr) that would do checking and emit the warning.
Though as I said above, I'll prefer not to hide rte_acl_default_classify it=
 will cause extra overhead for rte_acl_classify().

>=20
> > From other hand - user can hit same problem by simply calling rte_acl_c=
lassify_sse() directly.
> Not if the function is statically declared and not exposed to the applica=
tion
> they cant :)

I don't really want to hide  rte_acl_classify_sse/rte_acl_classify_scalar()=
.
Should be available directly I think.  =20
In future we might introduce new versions for more sophisticated ISAs (rte_=
acl_classify_avx() or something).
Users should have an ability to downgrade their classify() function if they=
 like. =20

> >
> > > and prevent path switching
> > > during searches, which may produce unexpected results.
> >
> > Not that I am advertising it, but  it should be safe to update rte_acl_=
default_classify during searches:
> > All versions of classify should produce exactly the same result for eac=
h input packet and treat acl context as read-only.
> >
> Fair enough.
>=20
> > >
> > > ><snip>
> > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > > deleted file mode 100644
> > > > index e3d9fc1..0000000
> > > > --- a/lib/librte_acl/acl_run.c
> > > > +++ /dev/null
> > > > @@ -1,944 +0,0 @@
> > > > -/*-
> > > > - *   BSD LICENSE
> > > > - *
> > > > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved=
.
> > > > - *   All rights reserved.
> > > > - *
> > > > - *   Redistribution and use in source and binary forms, with or wi=
thout
> > > > - *   modification, are permitted provided that the following condi=
tions
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_scalar
> > > > +#define	__func_match_check__		acl_match_check_scalar
> > > > +#include "acl_match_check.def"
> > > > +
> > > I get this lets you make some more code common, but its just unpleasa=
nt to trace
> > > through.  Looking at the defintion of __func_match_check__ I don't se=
e anything
> > > particularly performance sensitive there.  What if instead you simply=
 redefined
> > > __func_match_check__ in a common internal header as acl_match_check (=
a generic
> > > function), and had it accept priority resolution function as an argum=
ent?  That
> > > would still give you all the performance enhancements without having =
to include
> > > c files in the middle of other c files, and would make the code a bit=
 more
> > > parseable.
> >
> > Yes, that way it would look much better.
> > And it seems that with '-findirect-inlining' gcc is able to inline them=
 via pointers properly.
> > Will change as you suggested.
> >
> Thank you
> Neil
>=20
> > >
> > > > +/*
> > > > + * When processing the transition, rather than using if/else
> > > > + * construct, the offset is calculated for DFA and QRANGE and
> > > > + * then conditionally added to the address based on node type.
> > > > + * This is done to avoid branch mis-predictions. Since the
> > > > + * offset is rather simple calculation it is more efficient
> > > > + * to do the calculation and do a condition move rather than
> > > > + * a conditional branch to determine which calculation to do.
> > > > + */
> > > > +static inline uint32_t
> > > > +scan_forward(uint32_t input, uint32_t max)
> > > > +{
> > > > +	return (input =3D=3D 0) ? max : rte_bsf32(input);
> > > > +}
> > > > +	}
> > > > +}
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_sse
> > > > +#define	__func_match_check__		acl_match_check_sse
> > > > +#include "acl_match_check.def"
> > > > +
> > > Same deal as above.
> > >
> > > > +/*
> > > > + * Extract transitions from an XMM register and check for any matc=
hes
> > > > + */
> > > > +static void
> > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_ac=
l_ctx *ctx,
> > > > +	struct parms *parms, struct acl_flow_data *flows)
> > > > +{
> > > > +	uint64_t transition1, transition2;
> > > > +
> > > > +	/* extract transition from low 64 bits. */
> > > > +	transition1 =3D MM_CVT64(*indicies);
> > > > +
> > > > +	/* extract transition from high 64 bits. */
> > > > +	*indicies =3D MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > > +	transition2 =3D MM_CVT64(*indicies);
> > > > +
> > > > +	transition1 =3D acl_match_check_sse(transition1, slot, ctx,
> > > > +		parms, flows);
> > > > +	transition2 =3D acl_match_check_sse(transition2, slot + 1, ctx,
> > > > +		parms, flows);
> > > > +
> > > > +	/* update indicies with new transitions. */
> > > > +	*indicies =3D MM_SET64(transition2, transition1);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for a match in 2 transitions (contained in SSE register)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct=
 parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	temp =3D MM_AND(match_mask, *indicies);
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > > > +		temp =3D MM_AND(match_mask, *indicies);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for any match in 4 transitions (contained in 2 SSE regist=
ers)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct=
 parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > > +	xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	/* put low 32 bits of each transition into one register */
> > > > +	temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicie=
s2,
> > > > +		0x88);
> > > > +	/* test for match node */
> > > > +	temp =3D MM_AND(match_mask, temp);
> > > > +
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > > +
> > > > +		temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +					(__m128)*indicies2,
> > > > +					0x88);
> > > > +		temp =3D MM_AND(match_mask, temp);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Calculate the address of the next transition for
> > > > + * all types of nodes. Note that only DFA nodes and range
> > > > + * nodes actually transition to another node. Match
> > > > + * nodes don't move.
> > > > + */
> > > > +static inline xmm_t
> > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_in=
put,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr, node_types, temp;
> > > > +
> > > > +	/*
> > > > +	 * Note that no transition is done for a match
> > > > +	 * node and therefore a stream freezes when
> > > > +	 * it reaches a match.
> > > > +	 */
> > > > +
> > > > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > > > +	temp =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicie=
s2,
> > > > +		0x88);
> > > > +	*indicies2 =3D (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +		(__m128)*indicies2, 0xdd);
> > > > +
> > > > +	/* Calc node type and node addr */
> > > > +	node_types =3D MM_ANDNOT(index_mask, temp);
> > > > +	addr =3D MM_AND(index_mask, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for DFAs - addr =3D dfa_index + input_byte
> > > > +	 */
> > > > +
> > > > +	/* mask for DFA type (0) nodes */
> > > > +	temp =3D MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > > +
> > > > +	/* add input byte to DFA position */
> > > > +	temp =3D MM_AND(temp, bytes);
> > > > +	temp =3D MM_AND(temp, next_input);
> > > > +	addr =3D MM_ADD32(addr, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for Range nodes -> range_index + range(input)
> > > > +	 */
> > > > +	node_types =3D MM_CMPEQ32(node_types, type_quad_range);
> > > > +
> > > > +	/*
> > > > +	 * Calculate number of range boundaries that are less than the
> > > > +	 * input value. Range boundaries for each node are in signed 8 bi=
t,
> > > > +	 * ordered from -128 to 127 in the indicies2 register.
> > > > +	 * This is effectively a popcnt of bytes that are greater than th=
e
> > > > +	 * input byte.
> > > > +	 */
> > > > +
> > > > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > > > +	temp =3D MM_SHUFFLE8(next_input, shuffle_input);
> > > > +
> > > > +	/* check ranges */
> > > > +	temp =3D MM_CMPGT8(temp, *indicies2);
> > > > +
> > > > +	/* convert -1 to 1 (bytes greater than input byte */
> > > > +	temp =3D MM_SIGN8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of bytes into words */
> > > > +	temp =3D MM_MADD8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of words into dwords */
> > > > +	temp =3D MM_MADD16(temp, ones_16);
> > > > +
> > > > +	/* mask to range type nodes */
> > > > +	temp =3D MM_AND(temp, node_types);
> > > > +
> > > > +	/* add index into node position */
> > > > +	return MM_ADD32(addr, temp);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > > + */
> > > > +static inline xmm_t
> > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_inpu=
t,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr;
> > > > +	uint64_t trans0, trans2;
> > > > +
> > > > +	 /* Calculate the address (array index) for all 4 transitions. */
> > > > +
> > > > +	addr =3D acl_calc_addr(index_mask, next_input, shuffle_input, one=
s_16,
> > > > +		bytes, type_quad_range, indicies1, indicies2);
> > > > +
> > > > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > > > +
> > > > +	trans0 =3D trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 2 */
> > > > +
> > > > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > > +	addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > > +	trans2 =3D trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +
> > > > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > > +	addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 =3D MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > > +
> > > > +	/* get slot 3 */
> > > > +
> > > > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > > +	addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > > +	*indicies2 =3D MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 8 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > > > +	struct parms parms[MAX_SEARCHES_SSE8];
> > > > +	xmm_t input0, input1;
> > > > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n =3D 0; n < MAX_SEARCHES_SSE8; n++) {
> > > > +		cmplt[n].count =3D 0;
> > > > +		index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * indicies1 contains index_array[0,1]
> > > > +	 * indicies2 contains index_array[2,3]
> > > > +	 * indicies3 contains index_array[4,5]
> > > > +	 * indicies4 contains index_array[6,7]
> > > > +	 */
> > > > +
> > > > +	indicies1 =3D MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 =3D MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	indicies3 =3D MM_LOADU((xmm_t *) &index_array[4]);
> > > > +	indicies4 =3D MM_LOADU((xmm_t *) &index_array[6]);
> > > > +
> > > > +	 /* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +	acl_match_check_x4(4, ctx, parms, &flows,
> > > > +		&indicies3, &indicies4, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input0 =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > > +			0);
> > > > +		input1 =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > > +			0);
> > > > +
> > > > +		input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > > +
> > > > +		input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > > +
> > > > +		input0 =3D MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +		input1 =3D MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > > +
> > > > +		 /* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input0 =3D transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 =3D transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 =3D transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 =3D transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 =3D transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 =3D transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 =3D transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 =3D transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		 /* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +		acl_match_check_x4(4, ctx, parms, &flows,
> > > > +			&indicies3, &indicies4, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 4 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	 uint32_t *results, int total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > > > +	struct parms parms[MAX_SEARCHES_SSE4];
> > > > +	xmm_t input, indicies1, indicies2;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n =3D 0; n < MAX_SEARCHES_SSE4; n++) {
> > > > +		cmplt[n].count =3D 0;
> > > > +		index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies1 =3D MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 =3D MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0=
);
> > > > +		input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +		input =3D transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input =3D transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input =3D transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input =3D transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static inline xmm_t
> > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_inpu=
t,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1)
> > > > +{
> > > > +	uint64_t t;
> > > > +	xmm_t addr, indicies2;
> > > > +
> > > > +	indicies2 =3D MM_XOR(ones_16, ones_16);
> > > > +
> > > > +	addr =3D acl_calc_addr(index_mask, next_input, shuffle_input, one=
s_16,
> > > > +		bytes, type_quad_range, indicies1, &indicies2);
> > > > +
> > > > +	/* Gather 64 bit transitions and pack 2 per register. */
> > > > +
> > > > +	t =3D trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +	addr =3D MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 =3D MM_SET64(trans[MM_CVT32(addr)], t);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 2 traversals in parallel.
> > > > + */
> > > > +static inline int
> > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > > > +	struct parms parms[MAX_SEARCHES_SSE2];
> > > > +	xmm_t input, indicies;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n =3D 0; n < MAX_SEARCHES_SSE2; n++) {
> > > > +		cmplt[n].count =3D 0;
> > > > +		index_array[n] =3D acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies =3D MM_LOADU((xmm_t *) &index_array[0]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mas=
k64.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input =3D MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0=
);
> > > > +		input =3D MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input =3D transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input =3D transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input =3D transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input =3D transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > > +			mm_match_mask64.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t =
**data,
> > > > +	uint32_t *results, uint32_t num, uint32_t categories)
> > > > +{
> > > > +	if (categories !=3D 1 &&
> > > > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) !=3D 0)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (likely(num >=3D MAX_SEARCHES_SSE8))
> > > > +		return search_sse_8(ctx, data, results, num, categories);
> > > > +	else if (num >=3D MAX_SEARCHES_SSE4)
> > > > +		return search_sse_4(ctx, data, results, num, categories);
> > > > +	else
> > > > +		return search_sse_2(ctx, data, results, num, categories);
> > > > +}
> > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > > index 7c288bd..0cde07e 100644
> > > > --- a/lib/librte_acl/rte_acl.c
> > > > +++ b/lib/librte_acl/rte_acl.c
> > > > @@ -38,6 +38,21 @@
> > > >
> > > >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > > >
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify =3D rte_acl_classify_s=
calar;
> > > > +
> > > make this static, the outside world shouldn't need to see it.
> >
> > As I said above, I think it more plausible to keep it globally visible.
> >
> > >
> > > > +void __attribute__((constructor(INT16_MAX)))
> > > > +rte_acl_select_classify(void)
> > > Make it static, The outside world doesn't need to call this.
> >
> > See above, would like user to have an ability to call it manually if ne=
eded.
> >
> > >
> > > > +{
> > > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > > +		/* SSE version requires SSE4.1 */
> > > > +		rte_acl_default_classify =3D rte_acl_classify_sse;
> > > > +	} else {
> > > > +		/* reset to scalar version. */
> > > > +		rte_acl_default_classify =3D rte_acl_classify_scalar;
> > > Don't need the else clause here, the static initalizer has you covere=
d.
> >
> > I think we better keep it like that - in case user calls it manually.
> > We always reset  rte_acl_default_classify to the 'best proper' value.
> >
> > > > +	}
> > > > +}
> > > > +
> > > > +
> > > > +/**
> > > > + * Invokes default rte_acl_classify function.
> > > > + */
> > > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > > +
> > > Doesn't need to be extern.
> > > > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > > > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > > +
> > > Not sure why you need this either.  The rte_acl_classify_t should be =
enough, no?
> >
> > We preserve existing rte_acl_classify() API, so users don't need to mod=
ify their code.
> >
> This would be a great candidate for versioning (Bruce and have been discu=
ssing
> this).
>=20
> Neil
>=20
> >