* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-26 22:02  4%           ` Stephen Hemminger
@ 2014-09-27  2:22  5%             ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-27  2:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev
On Fri, Sep 26, 2014 at 03:02:55PM -0700, Stephen Hemminger wrote:
> On Fri, 26 Sep 2014 10:45:49 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Fri, Sep 26, 2014 at 12:41:33PM +0200, Thomas Monjalon wrote:
> > > Hi Neil,
> > > 
> > > 2014-09-24 14:19, Neil Horman:
> > > > Ping Thomas. I know you're busy, but I would like this to not fall off anyones
> > > > radar.  You alluded to concerns regarding what, for lack of a better term,
> > > > ABI/API lockin.  I had asked you to enuumerate/elaborate on specifics, but never
> > > > heard back.  Are there further specifics you wish to discuss, or are you
> > > > satisfied with the above answers?
> > > 
> > > Sorry for not being very reactive on this thread.
> > > All this discussion is very interesting but it's really not the proper
> > > time to apply it. As you said, it requires an extra effort. I'm not saying
> > > it will never be integrated. I'm just saying that we cannot change
> > > everything at the same time.
> > > 
> > > Let me sum up the situation. This community project has been very active
> > > for few months now. First, we learnt how to make some releases together
> > > and we are improving the process to be able to deliver a new major release
> > > every 4 months while having some good quality process.
> > > But these releases are still not complete because documentation is not
> > > integrated yet. Then developers should have a role in documentation updates.
> > > We also need to integrate and learn how to use more tools to be more
> > > efficient and improve quality.
> > > 
> > > So the question is "when should we care about API compatibility"?
> > > And the answer is: ASAP, but not now. I feel next year is a better target.
> > > Because the most important priority is to move together at a pace which
> > > allow most of us to stay in the race.
> > > 
> > 
> > 
> > I'm sorry Thomas, I don't accept this.  I asked you for details as to your
> > concerns regarding this patch series, and you've provided more vague comments.
> > I need details to address
> > 
> > You say it requires extra effort, you're right it does.  Any feature that you
> > integreate requires some additional effort.  How is this patch any different
> > from adding the acl library or any other new API?  Everything requires
> > maintenence, thats how software works.  What specfically about this patch series
> > makes the effort insurmountable to you?
> > 
> > You say you're improving your process.  Great, this patch aids in that process
> > by ensuring backwards compatibility for a period of time.  Given that the API
> > and ABI can still evolve within this framework, as I've described, how is this
> > patch series not a significant step forward toward your goal of quality process.
> > 
> > You say documentation isn't integrated.  So, what does getting documentation
> > integrated have to do with this patch set, or any other?  I don't see you
> > holding any other patches based on documentation.  Again, nothing in this series
> > prevents evolution of the API or ABI.  If you're hope is to wait until
> > everything is perfect, then apply some control to the public facing API, and get
> > it all documented, none of thosse things will ever happen, I promise you.
> > 
> > You say you also need to learn to use more tools to be more efficient and
> > improve quality.  Great!  Thats exactly what this is. If we mandate even a short
> > term commitment to ABI stability (1 single relese worth of time), we will
> > quickly identify what API's change quickly and where we need to be cautious with
> > our API design.  If you just assume that developers will get better of their own
> > volition, it will never happen.
> > 
> > You say this should go in next year, but not now.  When exactly?  What event do
> > you forsee occuring in the next 12-18 months that will change everything such
> > that we can start supporing an ABI for more than just a few weeks at the head of
> > the tree?  
> > 
> > To this end, I just did a quick search through the git history for dpdk to look
> > at the histories of all the header files that are exposed via the makefile
> > SYMLINK command (given that that provides a list of header files that
> > applications can include, and embodies all the function symbols and data types
> > applications have access to.
> > 
> > There are 179 total commits in that list
> > Of those, a bit of spot checking suggests that about 10-15% of them actually
> > change ABI, and many of those came from Bruce's rework of the mbuf structure.
> > That about 17-20 instances over the last 2 years where an ABI update would have
> > been needed.  That seems pretty reasonable to me.  Where exactly is your concern
> > here?
> > 
> > Neil
> 
> Isn't ABI stablity a distro responsibility not a project responsibility?
> I have lots more API/ABI changes, just been too busy trying to release a real
> product using DPDK to upstream all the changes.
> 
No.  How well would glibc or any major library work without ABI stability?
Its definately a distros responsibility to not break ABI with backports from
upstream within a major release, but its upstreams responsibility to maintain
ABI for some period of time to prevent immediate distro breakage with an update.
Often libraries provide this by simply taking lots of care in their ABI
design, but if ABI flexibility is a need, providing some level of backwards
compatibility must fall on the upstream project.
Neil
> 
> 
^ permalink raw reply	[relevance 5%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-26 14:45  5%         ` Neil Horman
@ 2014-09-26 22:02  4%           ` Stephen Hemminger
  2014-09-27  2:22  5%             ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Stephen Hemminger @ 2014-09-26 22:02 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Fri, 26 Sep 2014 10:45:49 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:
> On Fri, Sep 26, 2014 at 12:41:33PM +0200, Thomas Monjalon wrote:
> > Hi Neil,
> > 
> > 2014-09-24 14:19, Neil Horman:
> > > Ping Thomas. I know you're busy, but I would like this to not fall off anyones
> > > radar.  You alluded to concerns regarding what, for lack of a better term,
> > > ABI/API lockin.  I had asked you to enuumerate/elaborate on specifics, but never
> > > heard back.  Are there further specifics you wish to discuss, or are you
> > > satisfied with the above answers?
> > 
> > Sorry for not being very reactive on this thread.
> > All this discussion is very interesting but it's really not the proper
> > time to apply it. As you said, it requires an extra effort. I'm not saying
> > it will never be integrated. I'm just saying that we cannot change
> > everything at the same time.
> > 
> > Let me sum up the situation. This community project has been very active
> > for few months now. First, we learnt how to make some releases together
> > and we are improving the process to be able to deliver a new major release
> > every 4 months while having some good quality process.
> > But these releases are still not complete because documentation is not
> > integrated yet. Then developers should have a role in documentation updates.
> > We also need to integrate and learn how to use more tools to be more
> > efficient and improve quality.
> > 
> > So the question is "when should we care about API compatibility"?
> > And the answer is: ASAP, but not now. I feel next year is a better target.
> > Because the most important priority is to move together at a pace which
> > allow most of us to stay in the race.
> > 
> 
> 
> I'm sorry Thomas, I don't accept this.  I asked you for details as to your
> concerns regarding this patch series, and you've provided more vague comments.
> I need details to address
> 
> You say it requires extra effort, you're right it does.  Any feature that you
> integreate requires some additional effort.  How is this patch any different
> from adding the acl library or any other new API?  Everything requires
> maintenence, thats how software works.  What specfically about this patch series
> makes the effort insurmountable to you?
> 
> You say you're improving your process.  Great, this patch aids in that process
> by ensuring backwards compatibility for a period of time.  Given that the API
> and ABI can still evolve within this framework, as I've described, how is this
> patch series not a significant step forward toward your goal of quality process.
> 
> You say documentation isn't integrated.  So, what does getting documentation
> integrated have to do with this patch set, or any other?  I don't see you
> holding any other patches based on documentation.  Again, nothing in this series
> prevents evolution of the API or ABI.  If you're hope is to wait until
> everything is perfect, then apply some control to the public facing API, and get
> it all documented, none of thosse things will ever happen, I promise you.
> 
> You say you also need to learn to use more tools to be more efficient and
> improve quality.  Great!  Thats exactly what this is. If we mandate even a short
> term commitment to ABI stability (1 single relese worth of time), we will
> quickly identify what API's change quickly and where we need to be cautious with
> our API design.  If you just assume that developers will get better of their own
> volition, it will never happen.
> 
> You say this should go in next year, but not now.  When exactly?  What event do
> you forsee occuring in the next 12-18 months that will change everything such
> that we can start supporing an ABI for more than just a few weeks at the head of
> the tree?  
> 
> To this end, I just did a quick search through the git history for dpdk to look
> at the histories of all the header files that are exposed via the makefile
> SYMLINK command (given that that provides a list of header files that
> applications can include, and embodies all the function symbols and data types
> applications have access to.
> 
> There are 179 total commits in that list
> Of those, a bit of spot checking suggests that about 10-15% of them actually
> change ABI, and many of those came from Bruce's rework of the mbuf structure.
> That about 17-20 instances over the last 2 years where an ABI update would have
> been needed.  That seems pretty reasonable to me.  Where exactly is your concern
> here?
> 
> Neil
Isn't ABI stablity a distro responsibility not a project responsibility?
I have lots more API/ABI changes, just been too busy trying to release a real
product using DPDK to upstream all the changes.
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-26 16:22  0%           ` Neil Horman
@ 2014-09-26 19:19  0%             ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-26 19:19 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy; +Cc: dev
On Fri, Sep 26, 2014 at 12:22:56PM -0400, Neil Horman wrote:
> On Fri, Sep 26, 2014 at 04:33:04PM +0100, Sergio Gonzalez Monroy wrote:
> > On Fri, Sep 26, 2014 at 11:16:30AM -0400, Neil Horman wrote:
> > > On Fri, Sep 26, 2014 at 03:16:08PM +0100, Sergio Gonzalez Monroy wrote:
> > > > On Thu, Sep 25, 2014 at 02:52:32PM -0400, Neil Horman wrote:
> > > > > Add initial pass header files to support symbol versioning.
> > > > > 
> > > > > ---
> > > > > Change notes
> > > > > v2)
> > > > > * Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
> > > > > non-existant RTE_SYMBOL_VERSIONING
> > > > > 
> > > > > * Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
> > > > > properly
> > > > > 
> > > > > * Improved/Clarified documentation
> > > > > 
> > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > > > > CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
> > > > > ---
> > > > >  lib/Makefile                   |  1 +
> > > > >  lib/librte_compat/Makefile     | 38 ++++++++++++++++++
> > > > >  lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  mk/rte.lib.mk                  |  6 +++
> > > > >  4 files changed, 132 insertions(+)
> > > > >  create mode 100644 lib/librte_compat/Makefile
> > > > >  create mode 100644 lib/librte_compat/rte_compat.h
> > > > > 
> > > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > > index 10c5bb3..a85b55b 100644
> > > > > --- a/lib/Makefile
> > > > > +++ b/lib/Makefile
> > > > > @@ -32,6 +32,7 @@
> > > > >  include $(RTE_SDK)/mk/rte.vars.mk
> > > > >  
> > > > >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > > > > +DIRS-y += librte_compat
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> > > > >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > > > > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > > > > new file mode 100644
> > > > > index 0000000..3415c7b
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_compat/Makefile
> > > > > @@ -0,0 +1,38 @@
> > > > > +#   BSD LICENSE
> > > > > +#
> > > > > +#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
> > > > > +#   All rights reserved.
> > > > > +#
> > > > > +#   Redistribution and use in source and binary forms, with or without
> > > > > +#   modification, are permitted provided that the following conditions
> > > > > +#   are met:
> > > > > +#
> > > > > +#     * Redistributions of source code must retain the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer.
> > > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > > +#       notice, this list of conditions and the following disclaimer in
> > > > > +#       the documentation and/or other materials provided with the
> > > > > +#       distribution.
> > > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > > +#       contributors may be used to endorse or promote products derived
> > > > > +#       from this software without specific prior written permission.
> > > > > +#
> > > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > > +
> > > > > +
> > > > > +# install includes
> > > > > +SYMLINK-y-include := rte_compat.h
> > > > > +
> > > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > > > > new file mode 100644
> > > > > index 0000000..cff9aea
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_compat/rte_compat.h
> > > > > @@ -0,0 +1,87 @@
> > > > > +/*-
> > > > > + *   BSD LICENSE
> > > > > + *
> > > > > + *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
> > > > > + *   All rights reserved.
> > > > > + *
> > > > > + *   Redistribution and use in source and binary forms, with or without
> > > > > + *   modification, are permitted provided that the following conditions
> > > > > + *   are met:
> > > > > + *
> > > > > + *     * Redistributions of source code must retain the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer.
> > > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > > + *       notice, this list of conditions and the following disclaimer in
> > > > > + *       the documentation and/or other materials provided with the
> > > > > + *       distribution.
> > > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > > + *       contributors may be used to endorse or promote products derived
> > > > > + *       from this software without specific prior written permission.
> > > > > + *
> > > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > > + */
> > > > > +
> > > > > +#ifndef _RTE_COMPAT_H_
> > > > > +#define _RTE_COMPAT_H_
> > > > > +
> > > > > +/*
> > > > > + * This is just a stringification macro for use below.
> > > > > + */
> > > > > +#define SA(x) #x
> > > > > +
> > > > > +#ifdef RTE_BUILD_SHARED_LIB
> > > > > +
> > > > > +/*
> > > > > + * Provides backwards compatibility when updating exported functions.
> > > > > + * When a symol is exported from a library to provide an API, it also provides a
> > > > > + * calling convention (ABI) that is embodied in its name, return type,
> > > > > + * arguments, etc.  On occasion that function may need to change to accomodate
> > > > > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > > > > + * allow for backwards compatibility for a time with older binaries that are
> > > > > + * dynamically linked to the dpdk.  to support that the __vsym and
> > > > > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > > > > + * <library>_version.map file for a given library allow for multiple versions of
> > > > > + * a symbol to exist in a shared library so that older binaries need not be
> > > > > + * immediately recompiled. Their use is outlined in the following example:
> > > > > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > > > > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > > > > + *
> > > > > + * To accomplish this:
> > > > > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
> > > > > + * foo is exported as a global symbol.  Note that foo must be removed from the
> > > > > + * DPDK.(X) node, or you will see multiple symbol definitions
> > > > > + *
> > > > 
> > > > By removing the symbol from the previous node in the version map, you make
> > > > it local instead of global and applications linked against DPDK 1.8 will fail
> > > > with the new library.
> > > > 
> > > It sounds like you just did the remove part, and not the add part.  What does
> > > your new version map file look like?
> > > 
> > > Neil
> > > 
> > 
> > $ cat lib/librte_acl/rte_acl_version.map
> > DPDK_1.8 {
> >     global:
> >     rte_acl_find_existing;
> >     rte_acl_free;
> >     rte_acl_add_rules;
> >     rte_acl_reset_rules;
> >     rte_acl_build;
> >     rte_acl_reset;
> >     rte_acl_classify;
> >     rte_acl_dump;
> >     rte_acl_list_dump;
> >     rte_acl_ipv4vlan_add_rules;
> >     rte_acl_ipv4vlan_build;
> >     rte_acl_classify_scalar;
> >     rte_acl_classify_alg;
> >     rte_acl_set_ctx_classify;
> > 
> >     local: *;
> > };
> > 
> > DPDK_1.9 {
> >     global:
> >     rte_acl_create;
> > } DPDK_1.8;
> > 
> > 
> > Anyway, if the DPDK_1.9 node was not defined, the dso would not have the symbol:
> > rte_acl_create@@DPDK_1.9
> > 
> > Sergio
> > 
> > > > Following the steps you describe, if we create a new version of the function
> > > > rte_acl_create we would end up with the following dso:
> > > > 
> > > > $ readelf -s x86_64-native-linuxapp-gcc/lib/librte_acl.so | grep "create\|\.symtab\|\.dynsym"
> > > > Symbol table '.dynsym' contains 42 entries:
> > > >     28: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create@@DPDK_1.9
> > > > Symbol table '.symtab' contains 147 entries:
> > > >     94: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create_v18
> > > >    105: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create@@DPDK_1.8
> > > >    138: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create
> > > > 
> > > > You can check that applications linked with the old lib will fail to run.
> > > > Note that to easily check this you should define the environment variable
> > > > LD_BIND_NOW to resolve all symbols at program startup (man ld.so).
> > > > 
> > > > Sergio
> > > > 
> Hm, thats odd, Using the same changes in my build here, I get both an exported
> global rte_acl_create@@DPDK_1.8 and @@DPDK_1.9 symbol.  Let me take a closer
> look at it here once I get through the rest of my email
> Neil
> 
> > > > > + * 2) rename the existing function int foo(char *string) to 
> > > > > + * 	int __vsym foo_v18(char *string)
> > > > > + *
> > > > > + * 3) Add this macro immediately below the function
> > > > > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > > > > + *
> > > > > + */
> > > > > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > > > > +#define __vsym __attribute__((used))
> > > > > +
> > > > > +#else
> > > > > +/*
> > > > > + * No symbol versioning in use
> > > > > + */
> > > > > +#define VERSION_SYMBOL(b, e, v)
> > > > > +#define __vsym
> > > > > +
> > > > > +/*
> > > > > + * RTE_BUILD_SHARED_LIB
> > > > > + */
> > > > > +#endif
> > > > > +
> > > > > +
> > > > > +#endif /* _RTE_COMPAT_H_ */
> > > > > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > > > > index f458258..82ac309 100644
> > > > > --- a/mk/rte.lib.mk
> > > > > +++ b/mk/rte.lib.mk
> > > > > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> > > > >  
> > > > >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> > > > >  LIB := $(patsubst %.a,%.so,$(LIB))
> > > > > +
> > > > > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > > > > +
> > > > >  endif
> > > > >  
> > > > > +
> > > > >  _BUILD = $(LIB)
> > > > >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> > > > >  _CLEAN = doclean
> > > > > @@ -160,7 +164,9 @@ endif
> > > > >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> > > > >  	@echo "  INSTALL-LIB $(LIB)"
> > > > >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > > > > +ifneq ($(LIB),)
> > > > >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > > > > +endif
> > > > >  
> > > > >  #
> > > > >  # Clean all generated files
> > > > > -- 
> > > > > 1.9.3
> > > > > 
> > > > 
> > 
> 
Well, I have to apologize Sergio.  Apparently I misread something in the guide
for symbol versioning and this isn't in fact working.  It appeared to be working
for me because something was messed up in my tree and I wasn't relinking when I
updated the map file.  So self-NAK on this series for now, I'll repost it
shortly.  The good news is I have a bit of smaple code working properly, which
I've verified, and I should have a new version of this series (which should by
an large look the same) ready early next week
Best
Neil
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-26 15:33  0%         ` Sergio Gonzalez Monroy
@ 2014-09-26 16:22  0%           ` Neil Horman
  2014-09-26 19:19  0%             ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-26 16:22 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy; +Cc: dev
On Fri, Sep 26, 2014 at 04:33:04PM +0100, Sergio Gonzalez Monroy wrote:
> On Fri, Sep 26, 2014 at 11:16:30AM -0400, Neil Horman wrote:
> > On Fri, Sep 26, 2014 at 03:16:08PM +0100, Sergio Gonzalez Monroy wrote:
> > > On Thu, Sep 25, 2014 at 02:52:32PM -0400, Neil Horman wrote:
> > > > Add initial pass header files to support symbol versioning.
> > > > 
> > > > ---
> > > > Change notes
> > > > v2)
> > > > * Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
> > > > non-existant RTE_SYMBOL_VERSIONING
> > > > 
> > > > * Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
> > > > properly
> > > > 
> > > > * Improved/Clarified documentation
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > > > CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
> > > > ---
> > > >  lib/Makefile                   |  1 +
> > > >  lib/librte_compat/Makefile     | 38 ++++++++++++++++++
> > > >  lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
> > > >  mk/rte.lib.mk                  |  6 +++
> > > >  4 files changed, 132 insertions(+)
> > > >  create mode 100644 lib/librte_compat/Makefile
> > > >  create mode 100644 lib/librte_compat/rte_compat.h
> > > > 
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 10c5bb3..a85b55b 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -32,6 +32,7 @@
> > > >  include $(RTE_SDK)/mk/rte.vars.mk
> > > >  
> > > >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > > > +DIRS-y += librte_compat
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > > > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > > > new file mode 100644
> > > > index 0000000..3415c7b
> > > > --- /dev/null
> > > > +++ b/lib/librte_compat/Makefile
> > > > @@ -0,0 +1,38 @@
> > > > +#   BSD LICENSE
> > > > +#
> > > > +#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
> > > > +#   All rights reserved.
> > > > +#
> > > > +#   Redistribution and use in source and binary forms, with or without
> > > > +#   modification, are permitted provided that the following conditions
> > > > +#   are met:
> > > > +#
> > > > +#     * Redistributions of source code must retain the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer.
> > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer in
> > > > +#       the documentation and/or other materials provided with the
> > > > +#       distribution.
> > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > +#       contributors may be used to endorse or promote products derived
> > > > +#       from this software without specific prior written permission.
> > > > +#
> > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > +
> > > > +
> > > > +# install includes
> > > > +SYMLINK-y-include := rte_compat.h
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > > > new file mode 100644
> > > > index 0000000..cff9aea
> > > > --- /dev/null
> > > > +++ b/lib/librte_compat/rte_compat.h
> > > > @@ -0,0 +1,87 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_COMPAT_H_
> > > > +#define _RTE_COMPAT_H_
> > > > +
> > > > +/*
> > > > + * This is just a stringification macro for use below.
> > > > + */
> > > > +#define SA(x) #x
> > > > +
> > > > +#ifdef RTE_BUILD_SHARED_LIB
> > > > +
> > > > +/*
> > > > + * Provides backwards compatibility when updating exported functions.
> > > > + * When a symol is exported from a library to provide an API, it also provides a
> > > > + * calling convention (ABI) that is embodied in its name, return type,
> > > > + * arguments, etc.  On occasion that function may need to change to accomodate
> > > > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > > > + * allow for backwards compatibility for a time with older binaries that are
> > > > + * dynamically linked to the dpdk.  to support that the __vsym and
> > > > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > > > + * <library>_version.map file for a given library allow for multiple versions of
> > > > + * a symbol to exist in a shared library so that older binaries need not be
> > > > + * immediately recompiled. Their use is outlined in the following example:
> > > > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > > > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > > > + *
> > > > + * To accomplish this:
> > > > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
> > > > + * foo is exported as a global symbol.  Note that foo must be removed from the
> > > > + * DPDK.(X) node, or you will see multiple symbol definitions
> > > > + *
> > > 
> > > By removing the symbol from the previous node in the version map, you make
> > > it local instead of global and applications linked against DPDK 1.8 will fail
> > > with the new library.
> > > 
> > It sounds like you just did the remove part, and not the add part.  What does
> > your new version map file look like?
> > 
> > Neil
> > 
> 
> $ cat lib/librte_acl/rte_acl_version.map
> DPDK_1.8 {
>     global:
>     rte_acl_find_existing;
>     rte_acl_free;
>     rte_acl_add_rules;
>     rte_acl_reset_rules;
>     rte_acl_build;
>     rte_acl_reset;
>     rte_acl_classify;
>     rte_acl_dump;
>     rte_acl_list_dump;
>     rte_acl_ipv4vlan_add_rules;
>     rte_acl_ipv4vlan_build;
>     rte_acl_classify_scalar;
>     rte_acl_classify_alg;
>     rte_acl_set_ctx_classify;
> 
>     local: *;
> };
> 
> DPDK_1.9 {
>     global:
>     rte_acl_create;
> } DPDK_1.8;
> 
> 
> Anyway, if the DPDK_1.9 node was not defined, the dso would not have the symbol:
> rte_acl_create@@DPDK_1.9
> 
> Sergio
> 
> > > Following the steps you describe, if we create a new version of the function
> > > rte_acl_create we would end up with the following dso:
> > > 
> > > $ readelf -s x86_64-native-linuxapp-gcc/lib/librte_acl.so | grep "create\|\.symtab\|\.dynsym"
> > > Symbol table '.dynsym' contains 42 entries:
> > >     28: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create@@DPDK_1.9
> > > Symbol table '.symtab' contains 147 entries:
> > >     94: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create_v18
> > >    105: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create@@DPDK_1.8
> > >    138: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create
> > > 
> > > You can check that applications linked with the old lib will fail to run.
> > > Note that to easily check this you should define the environment variable
> > > LD_BIND_NOW to resolve all symbols at program startup (man ld.so).
> > > 
> > > Sergio
> > > 
Hm, thats odd, Using the same changes in my build here, I get both an exported
global rte_acl_create@@DPDK_1.8 and @@DPDK_1.9 symbol.  Let me take a closer
look at it here once I get through the rest of my email
Neil
> > > > + * 2) rename the existing function int foo(char *string) to 
> > > > + * 	int __vsym foo_v18(char *string)
> > > > + *
> > > > + * 3) Add this macro immediately below the function
> > > > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > > > + *
> > > > + */
> > > > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > > > +#define __vsym __attribute__((used))
> > > > +
> > > > +#else
> > > > +/*
> > > > + * No symbol versioning in use
> > > > + */
> > > > +#define VERSION_SYMBOL(b, e, v)
> > > > +#define __vsym
> > > > +
> > > > +/*
> > > > + * RTE_BUILD_SHARED_LIB
> > > > + */
> > > > +#endif
> > > > +
> > > > +
> > > > +#endif /* _RTE_COMPAT_H_ */
> > > > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > > > index f458258..82ac309 100644
> > > > --- a/mk/rte.lib.mk
> > > > +++ b/mk/rte.lib.mk
> > > > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> > > >  
> > > >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> > > >  LIB := $(patsubst %.a,%.so,$(LIB))
> > > > +
> > > > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > > > +
> > > >  endif
> > > >  
> > > > +
> > > >  _BUILD = $(LIB)
> > > >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> > > >  _CLEAN = doclean
> > > > @@ -160,7 +164,9 @@ endif
> > > >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> > > >  	@echo "  INSTALL-LIB $(LIB)"
> > > >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > > > +ifneq ($(LIB),)
> > > >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > > > +endif
> > > >  
> > > >  #
> > > >  # Clean all generated files
> > > > -- 
> > > > 1.9.3
> > > > 
> > > 
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
  2014-09-26 15:41  0%                   ` Ananyev, Konstantin
@ 2014-09-26 16:21  3%                     ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-26 16:21 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Fri, Sep 26, 2014 at 03:41:58PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> > Sent: Friday, September 26, 2014 4:02 PM
> > To: Wodkowski, PawelX
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
> > 
> > On Fri, Sep 26, 2014 at 02:01:05PM +0000, Wodkowski, PawelX wrote:
> > > > > > Maybe I don't see something obvious? :)
> > > >
> > > > I think you're missing the fact that your patch doesn't do what you assert above
> > > > either :)
> > >
> > > Issue is not in setting alarms but canceling it. If you look closer to my patch you
> > > see that it address this issue (look at added *do { lock(); ....; unlock(); } while( )*
> > > part).
> > >
> > I get where the issue is, and I'm looking at your patch.  I see that you did
> > some locking there.  The issue I'm pointing out is that, if you call
> > rte_eal_alarm_cancel on an alarm callback, you will exit the alarm_cancel
> > function with, by definition, one alarm executing (the one you are currently
> > running).  You're patch works perfectly for the case where another thread calls
> > cancel, in that it waits until the executing alarm is complete, but it doesn't
> > work in the case where you are calling it from within the alarm callback.
> 
> Hm, and why do we need it from alarm callback?
Because you might not know if you're in an alarm callback or not. Pawel
explained that the point of the patch was to ensure that alarms are canceled and
complete when you call rte_eal_alarm_cancel, and thats not always going to be
the case, even whith this patch.
> After cb_func() is finished given alarm entry will be removed anyway.
> 
Yes, but thats true with or without this patch.
> > If  you're goal is to guarantee that all the matching alarms are cancelled and
> > complete, you haven't done that, because the recursive state is still unhandled.
> > 
> > > >
> > > > First, lets address rte_alarm_set.  There is no notion of "re-arming" in this
> > > > alarm implementation, because theres no ability to refer to a specific alarm
> > > > from the callers perspective.  When you call rte_eal_alarm_set you get a new
> > > > alarm every time.  So I don't really see a race there.  It might not be exactly
> > > > the behavior you want, but its not a race, becuase you're not modifying an
> > > > alarm
> > > > in the middle of execution, you're just creating a new alarm, which is safe.
> > >
> > > OK, it is safe, but this is not the case.
> > >
> > I don't know what you mean by this.  We agree its safe, great.  But it is the
> > case as I've described it, you can see it from the implementation, every call to
> > rte_eal_alarm_set starts with a malloc of a new alarm structure.
> > 
> > > >
> > > > There is a race in what you describe above, insofar as its possible that you
> > > > might call rte_eal_alarm_cancel and return without having canceled all the
> > > > matching alarms.  I don't see any clear documentation on what the behavior is
> > > > supposed to be, but if you want to ensure that all matching alarms are cancelled
> > > > or complete on return from rte_eal_alarm_cancel, thats perfectly fine (in linux
> > > > API parlance, thats usually denoted as a cancel_sync operation).
> > >
> > > Again, look at the patch. I changed documentation to inform about this behavior.
> > >
> > 
> > This is the documentation included in the patch:
> > Change alarm cancel function to thread-safe.
> >         It eliminates a race between threads using rte_alarm_cancel and
> >         rte_alarm_set.
> > 
> > neither have you compeltely described the race condition (though you now have
> > previously in this thread), nor have you completely addressed it (calling
> > rte_eal_alarm_cancel and rte_eal_alarm_set still behaves exactly as it did
> > previously with a 2nd thread).
> > 
> > > >
> > > > For that race condition, you're correct, my patch doesn't address it, I see that
> > > > now.  Though your patch doesn't either.  If you call rte_eal_alarm_cancel from
> > > > within a callback function, then, by definition, you can't wait on the
> > > > completion of the active alarm, because thats a deadlock.  Its a necessecary
> > > > evil, I grant you, but it means that you can't be guaranteed the cancelled and
> > > > complete (cancel_sync) behavior that you want, at least not with the current
> > > > api.  If you want that behavior, you need to do one of two things:
> > >
> > > This patch does not break any API. It only removes undefined behavior.
> > >
> > I never said it did break ABI.  I said that to completely fix it you would have
> > to break ABI.  And it doesn't really remove undefined behavior, because you
> > still have the old behavior in the recursive case (which you may be ok with, I
> > don't know, but if you really want to address the behavior, you should address
> > this aspect of it).
> > 
> > > >
> > > > 1) Modify the api to allow callers to individually reference timer instances, so
> > > > that when cancelling, we can return an appropriate return code to indicate to
> > > > the caller that this alarm is in-progress.  That way you can guarantee the
> > > > caller that the specific alarm that you cancelled is either complete and cancelled
> > > > or currently executing.  Add an api to expicitly wait on a referenced alarm as
> > > > well.  This allows developers to know that, when executing an alarm callback, an
> > > > -ECURRENTLYEXECUTING return code is ok, because they are in the currently
> > > > executing context.
> > >
> > > This would brake API for sure.
> > Yes, it would.  Bruce Richardson just made a major ABI break with his mbuf
> > cleanup set.  If there was a time to change ABI here, now would be the time I
> > think.
> 
> Ok, too many words for me, to be honest :)
Yeah, its getting a bit verbose :)
> Can I summarise:
> As I remember the purpose of the patch was to fix the race condition inside rte_alarm library.
> I believe that the patch provided by Michal & Pawel fixes the issues you discovered.
> If you think, that is not the case, could you please provide a list of remaining issues?
> Excluding ones that you just don't like it, and you are not happy with rte_alarm API in total?      
Gladly.  As Pawel explained the race, its possible that, after calling
rte_eal_alarm_cancel, an in-flight execution of an alarm callback may still be
running.  The problem with that ostensibly is that data which is being accessed
by the callback might be then accessed in parallel with another process leading
to data corruption or some other problem. The issue I have with his patch is
that it doesn't completely close the race.  While it does close the race for the
condition in whcih thread B is running the alarm callback while thread A is
executing the cancel operation, it does not close the case for when a single
thread B is running the cancel operation, as the in-flight execution itself is
still active.  If such a cancellation occurs via an intermediary function (i.e.
one which is not aware that it is explicitly running an alarm callback, which
signals another thread to execute via some other method (ipc communication,
etc), the same data corruption may occur, because the canceled and complete
guarantee has been violated.
> 
> If you have any concerns about mbuf reorg/expansion - probably better to contact Bruce and express them.
> Not to use it as an argument for breaking any existing API without really good reason behind.  
> 
No, no concerns at all, only pointing out that we've already broken ABI in this
release, which requires application writers to rebuild and adjust their
applications, so if we were going to adjust this api, now would be the time,
rather than in a futre release, requiring multiple application rebuilds.
Neil
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
  2014-09-26 15:01  5%                 ` Neil Horman
@ 2014-09-26 15:41  0%                   ` Ananyev, Konstantin
  2014-09-26 16:21  3%                     ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Ananyev, Konstantin @ 2014-09-26 15:41 UTC (permalink / raw)
  To: Neil Horman, Wodkowski, PawelX; +Cc: dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> Sent: Friday, September 26, 2014 4:02 PM
> To: Wodkowski, PawelX
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
> 
> On Fri, Sep 26, 2014 at 02:01:05PM +0000, Wodkowski, PawelX wrote:
> > > > > Maybe I don't see something obvious? :)
> > >
> > > I think you're missing the fact that your patch doesn't do what you assert above
> > > either :)
> >
> > Issue is not in setting alarms but canceling it. If you look closer to my patch you
> > see that it address this issue (look at added *do { lock(); ....; unlock(); } while( )*
> > part).
> >
> I get where the issue is, and I'm looking at your patch.  I see that you did
> some locking there.  The issue I'm pointing out is that, if you call
> rte_eal_alarm_cancel on an alarm callback, you will exit the alarm_cancel
> function with, by definition, one alarm executing (the one you are currently
> running).  You're patch works perfectly for the case where another thread calls
> cancel, in that it waits until the executing alarm is complete, but it doesn't
> work in the case where you are calling it from within the alarm callback.
Hm, and why do we need it from alarm callback?
After cb_func() is finished given alarm entry will be removed anyway.
> If  you're goal is to guarantee that all the matching alarms are cancelled and
> complete, you haven't done that, because the recursive state is still unhandled.
> 
> > >
> > > First, lets address rte_alarm_set.  There is no notion of "re-arming" in this
> > > alarm implementation, because theres no ability to refer to a specific alarm
> > > from the callers perspective.  When you call rte_eal_alarm_set you get a new
> > > alarm every time.  So I don't really see a race there.  It might not be exactly
> > > the behavior you want, but its not a race, becuase you're not modifying an
> > > alarm
> > > in the middle of execution, you're just creating a new alarm, which is safe.
> >
> > OK, it is safe, but this is not the case.
> >
> I don't know what you mean by this.  We agree its safe, great.  But it is the
> case as I've described it, you can see it from the implementation, every call to
> rte_eal_alarm_set starts with a malloc of a new alarm structure.
> 
> > >
> > > There is a race in what you describe above, insofar as its possible that you
> > > might call rte_eal_alarm_cancel and return without having canceled all the
> > > matching alarms.  I don't see any clear documentation on what the behavior is
> > > supposed to be, but if you want to ensure that all matching alarms are cancelled
> > > or complete on return from rte_eal_alarm_cancel, thats perfectly fine (in linux
> > > API parlance, thats usually denoted as a cancel_sync operation).
> >
> > Again, look at the patch. I changed documentation to inform about this behavior.
> >
> 
> This is the documentation included in the patch:
> Change alarm cancel function to thread-safe.
>         It eliminates a race between threads using rte_alarm_cancel and
>         rte_alarm_set.
> 
> neither have you compeltely described the race condition (though you now have
> previously in this thread), nor have you completely addressed it (calling
> rte_eal_alarm_cancel and rte_eal_alarm_set still behaves exactly as it did
> previously with a 2nd thread).
> 
> > >
> > > For that race condition, you're correct, my patch doesn't address it, I see that
> > > now.  Though your patch doesn't either.  If you call rte_eal_alarm_cancel from
> > > within a callback function, then, by definition, you can't wait on the
> > > completion of the active alarm, because thats a deadlock.  Its a necessecary
> > > evil, I grant you, but it means that you can't be guaranteed the cancelled and
> > > complete (cancel_sync) behavior that you want, at least not with the current
> > > api.  If you want that behavior, you need to do one of two things:
> >
> > This patch does not break any API. It only removes undefined behavior.
> >
> I never said it did break ABI.  I said that to completely fix it you would have
> to break ABI.  And it doesn't really remove undefined behavior, because you
> still have the old behavior in the recursive case (which you may be ok with, I
> don't know, but if you really want to address the behavior, you should address
> this aspect of it).
> 
> > >
> > > 1) Modify the api to allow callers to individually reference timer instances, so
> > > that when cancelling, we can return an appropriate return code to indicate to
> > > the caller that this alarm is in-progress.  That way you can guarantee the
> > > caller that the specific alarm that you cancelled is either complete and cancelled
> > > or currently executing.  Add an api to expicitly wait on a referenced alarm as
> > > well.  This allows developers to know that, when executing an alarm callback, an
> > > -ECURRENTLYEXECUTING return code is ok, because they are in the currently
> > > executing context.
> >
> > This would brake API for sure.
> Yes, it would.  Bruce Richardson just made a major ABI break with his mbuf
> cleanup set.  If there was a time to change ABI here, now would be the time I
> think.
Ok, too many words for me, to be honest :)
Can I summarise:
As I remember the purpose of the patch was to fix the race condition inside rte_alarm library.
I believe that the patch provided by Michal & Pawel fixes the issues you discovered.
If you think, that is not the case, could you please provide a list of remaining issues?
Excluding ones that you just don't like it, and you are not happy with rte_alarm API in total?      
If you have any concerns about mbuf reorg/expansion - probably better to contact Bruce and express them.
Not to use it as an argument for breaking any existing API without really good reason behind.  
Konstantin
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-26 15:16  0%       ` Neil Horman
@ 2014-09-26 15:33  0%         ` Sergio Gonzalez Monroy
  2014-09-26 16:22  0%           ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Sergio Gonzalez Monroy @ 2014-09-26 15:33 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Fri, Sep 26, 2014 at 11:16:30AM -0400, Neil Horman wrote:
> On Fri, Sep 26, 2014 at 03:16:08PM +0100, Sergio Gonzalez Monroy wrote:
> > On Thu, Sep 25, 2014 at 02:52:32PM -0400, Neil Horman wrote:
> > > Add initial pass header files to support symbol versioning.
> > > 
> > > ---
> > > Change notes
> > > v2)
> > > * Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
> > > non-existant RTE_SYMBOL_VERSIONING
> > > 
> > > * Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
> > > properly
> > > 
> > > * Improved/Clarified documentation
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > > CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
> > > ---
> > >  lib/Makefile                   |  1 +
> > >  lib/librte_compat/Makefile     | 38 ++++++++++++++++++
> > >  lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
> > >  mk/rte.lib.mk                  |  6 +++
> > >  4 files changed, 132 insertions(+)
> > >  create mode 100644 lib/librte_compat/Makefile
> > >  create mode 100644 lib/librte_compat/rte_compat.h
> > > 
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 10c5bb3..a85b55b 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -32,6 +32,7 @@
> > >  include $(RTE_SDK)/mk/rte.vars.mk
> > >  
> > >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > > +DIRS-y += librte_compat
> > >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> > >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> > >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > > new file mode 100644
> > > index 0000000..3415c7b
> > > --- /dev/null
> > > +++ b/lib/librte_compat/Makefile
> > > @@ -0,0 +1,38 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +
> > > +# install includes
> > > +SYMLINK-y-include := rte_compat.h
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > > new file mode 100644
> > > index 0000000..cff9aea
> > > --- /dev/null
> > > +++ b/lib/librte_compat/rte_compat.h
> > > @@ -0,0 +1,87 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_COMPAT_H_
> > > +#define _RTE_COMPAT_H_
> > > +
> > > +/*
> > > + * This is just a stringification macro for use below.
> > > + */
> > > +#define SA(x) #x
> > > +
> > > +#ifdef RTE_BUILD_SHARED_LIB
> > > +
> > > +/*
> > > + * Provides backwards compatibility when updating exported functions.
> > > + * When a symol is exported from a library to provide an API, it also provides a
> > > + * calling convention (ABI) that is embodied in its name, return type,
> > > + * arguments, etc.  On occasion that function may need to change to accomodate
> > > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > > + * allow for backwards compatibility for a time with older binaries that are
> > > + * dynamically linked to the dpdk.  to support that the __vsym and
> > > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > > + * <library>_version.map file for a given library allow for multiple versions of
> > > + * a symbol to exist in a shared library so that older binaries need not be
> > > + * immediately recompiled. Their use is outlined in the following example:
> > > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > > + *
> > > + * To accomplish this:
> > > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
> > > + * foo is exported as a global symbol.  Note that foo must be removed from the
> > > + * DPDK.(X) node, or you will see multiple symbol definitions
> > > + *
> > 
> > By removing the symbol from the previous node in the version map, you make
> > it local instead of global and applications linked against DPDK 1.8 will fail
> > with the new library.
> > 
> It sounds like you just did the remove part, and not the add part.  What does
> your new version map file look like?
> 
> Neil
> 
$ cat lib/librte_acl/rte_acl_version.map
DPDK_1.8 {
    global:
    rte_acl_find_existing;
    rte_acl_free;
    rte_acl_add_rules;
    rte_acl_reset_rules;
    rte_acl_build;
    rte_acl_reset;
    rte_acl_classify;
    rte_acl_dump;
    rte_acl_list_dump;
    rte_acl_ipv4vlan_add_rules;
    rte_acl_ipv4vlan_build;
    rte_acl_classify_scalar;
    rte_acl_classify_alg;
    rte_acl_set_ctx_classify;
    local: *;
};
DPDK_1.9 {
    global:
    rte_acl_create;
} DPDK_1.8;
Anyway, if the DPDK_1.9 node was not defined, the dso would not have the symbol:
rte_acl_create@@DPDK_1.9
Sergio
> > Following the steps you describe, if we create a new version of the function
> > rte_acl_create we would end up with the following dso:
> > 
> > $ readelf -s x86_64-native-linuxapp-gcc/lib/librte_acl.so | grep "create\|\.symtab\|\.dynsym"
> > Symbol table '.dynsym' contains 42 entries:
> >     28: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create@@DPDK_1.9
> > Symbol table '.symtab' contains 147 entries:
> >     94: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create_v18
> >    105: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create@@DPDK_1.8
> >    138: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create
> > 
> > You can check that applications linked with the old lib will fail to run.
> > Note that to easily check this you should define the environment variable
> > LD_BIND_NOW to resolve all symbols at program startup (man ld.so).
> > 
> > Sergio
> > 
> > > + * 2) rename the existing function int foo(char *string) to 
> > > + * 	int __vsym foo_v18(char *string)
> > > + *
> > > + * 3) Add this macro immediately below the function
> > > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > > + *
> > > + */
> > > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > > +#define __vsym __attribute__((used))
> > > +
> > > +#else
> > > +/*
> > > + * No symbol versioning in use
> > > + */
> > > +#define VERSION_SYMBOL(b, e, v)
> > > +#define __vsym
> > > +
> > > +/*
> > > + * RTE_BUILD_SHARED_LIB
> > > + */
> > > +#endif
> > > +
> > > +
> > > +#endif /* _RTE_COMPAT_H_ */
> > > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > > index f458258..82ac309 100644
> > > --- a/mk/rte.lib.mk
> > > +++ b/mk/rte.lib.mk
> > > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> > >  
> > >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> > >  LIB := $(patsubst %.a,%.so,$(LIB))
> > > +
> > > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > > +
> > >  endif
> > >  
> > > +
> > >  _BUILD = $(LIB)
> > >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> > >  _CLEAN = doclean
> > > @@ -160,7 +164,9 @@ endif
> > >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> > >  	@echo "  INSTALL-LIB $(LIB)"
> > >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > > +ifneq ($(LIB),)
> > >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > > +endif
> > >  
> > >  #
> > >  # Clean all generated files
> > > -- 
> > > 1.9.3
> > > 
> > 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-26 14:16  0%     ` Sergio Gonzalez Monroy
@ 2014-09-26 15:16  0%       ` Neil Horman
  2014-09-26 15:33  0%         ` Sergio Gonzalez Monroy
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-26 15:16 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy; +Cc: dev
On Fri, Sep 26, 2014 at 03:16:08PM +0100, Sergio Gonzalez Monroy wrote:
> On Thu, Sep 25, 2014 at 02:52:32PM -0400, Neil Horman wrote:
> > Add initial pass header files to support symbol versioning.
> > 
> > ---
> > Change notes
> > v2)
> > * Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
> > non-existant RTE_SYMBOL_VERSIONING
> > 
> > * Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
> > properly
> > 
> > * Improved/Clarified documentation
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
> > ---
> >  lib/Makefile                   |  1 +
> >  lib/librte_compat/Makefile     | 38 ++++++++++++++++++
> >  lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
> >  mk/rte.lib.mk                  |  6 +++
> >  4 files changed, 132 insertions(+)
> >  create mode 100644 lib/librte_compat/Makefile
> >  create mode 100644 lib/librte_compat/rte_compat.h
> > 
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 10c5bb3..a85b55b 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -32,6 +32,7 @@
> >  include $(RTE_SDK)/mk/rte.vars.mk
> >  
> >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > +DIRS-y += librte_compat
> >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > new file mode 100644
> > index 0000000..3415c7b
> > --- /dev/null
> > +++ b/lib/librte_compat/Makefile
> > @@ -0,0 +1,38 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +
> > +# install includes
> > +SYMLINK-y-include := rte_compat.h
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > new file mode 100644
> > index 0000000..cff9aea
> > --- /dev/null
> > +++ b/lib/librte_compat/rte_compat.h
> > @@ -0,0 +1,87 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_COMPAT_H_
> > +#define _RTE_COMPAT_H_
> > +
> > +/*
> > + * This is just a stringification macro for use below.
> > + */
> > +#define SA(x) #x
> > +
> > +#ifdef RTE_BUILD_SHARED_LIB
> > +
> > +/*
> > + * Provides backwards compatibility when updating exported functions.
> > + * When a symol is exported from a library to provide an API, it also provides a
> > + * calling convention (ABI) that is embodied in its name, return type,
> > + * arguments, etc.  On occasion that function may need to change to accomodate
> > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > + * allow for backwards compatibility for a time with older binaries that are
> > + * dynamically linked to the dpdk.  to support that the __vsym and
> > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > + * <library>_version.map file for a given library allow for multiple versions of
> > + * a symbol to exist in a shared library so that older binaries need not be
> > + * immediately recompiled. Their use is outlined in the following example:
> > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > + *
> > + * To accomplish this:
> > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
> > + * foo is exported as a global symbol.  Note that foo must be removed from the
> > + * DPDK.(X) node, or you will see multiple symbol definitions
> > + *
> 
> By removing the symbol from the previous node in the version map, you make
> it local instead of global and applications linked against DPDK 1.8 will fail
> with the new library.
> 
It sounds like you just did the remove part, and not the add part.  What does
your new version map file look like?
Neil
> Following the steps you describe, if we create a new version of the function
> rte_acl_create we would end up with the following dso:
> 
> $ readelf -s x86_64-native-linuxapp-gcc/lib/librte_acl.so | grep "create\|\.symtab\|\.dynsym"
> Symbol table '.dynsym' contains 42 entries:
>     28: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create@@DPDK_1.9
> Symbol table '.symtab' contains 147 entries:
>     94: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create_v18
>    105: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create@@DPDK_1.8
>    138: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create
> 
> You can check that applications linked with the old lib will fail to run.
> Note that to easily check this you should define the environment variable
> LD_BIND_NOW to resolve all symbols at program startup (man ld.so).
> 
> Sergio
> 
> > + * 2) rename the existing function int foo(char *string) to 
> > + * 	int __vsym foo_v18(char *string)
> > + *
> > + * 3) Add this macro immediately below the function
> > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > + *
> > + */
> > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > +#define __vsym __attribute__((used))
> > +
> > +#else
> > +/*
> > + * No symbol versioning in use
> > + */
> > +#define VERSION_SYMBOL(b, e, v)
> > +#define __vsym
> > +
> > +/*
> > + * RTE_BUILD_SHARED_LIB
> > + */
> > +#endif
> > +
> > +
> > +#endif /* _RTE_COMPAT_H_ */
> > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > index f458258..82ac309 100644
> > --- a/mk/rte.lib.mk
> > +++ b/mk/rte.lib.mk
> > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> >  
> >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> >  LIB := $(patsubst %.a,%.so,$(LIB))
> > +
> > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > +
> >  endif
> >  
> > +
> >  _BUILD = $(LIB)
> >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> >  _CLEAN = doclean
> > @@ -160,7 +164,9 @@ endif
> >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> >  	@echo "  INSTALL-LIB $(LIB)"
> >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > +ifneq ($(LIB),)
> >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > +endif
> >  
> >  #
> >  # Clean all generated files
> > -- 
> > 1.9.3
> > 
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
  @ 2014-09-26 15:01  5%                 ` Neil Horman
  2014-09-26 15:41  0%                   ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-26 15:01 UTC (permalink / raw)
  To: Wodkowski, PawelX; +Cc: dev
On Fri, Sep 26, 2014 at 02:01:05PM +0000, Wodkowski, PawelX wrote:
> > > > Maybe I don't see something obvious? :)
> > 
> > I think you're missing the fact that your patch doesn't do what you assert above
> > either :)
> 
> Issue is not in setting alarms but canceling it. If you look closer to my patch you
> see that it address this issue (look at added *do { lock(); ....; unlock(); } while( )* 
> part).
> 
I get where the issue is, and I'm looking at your patch.  I see that you did
some locking there.  The issue I'm pointing out is that, if you call
rte_eal_alarm_cancel on an alarm callback, you will exit the alarm_cancel
function with, by definition, one alarm executing (the one you are currently
running).  You're patch works perfectly for the case where another thread calls
cancel, in that it waits until the executing alarm is complete, but it doesn't
work in the case where you are calling it from within the alarm callback. If
you're goal is to guarantee that all the matching alarms are cancelled and
complete, you haven't done that, because the recursive state is still unhandled.
> > 
> > First, lets address rte_alarm_set.  There is no notion of "re-arming" in this
> > alarm implementation, because theres no ability to refer to a specific alarm
> > from the callers perspective.  When you call rte_eal_alarm_set you get a new
> > alarm every time.  So I don't really see a race there.  It might not be exactly
> > the behavior you want, but its not a race, becuase you're not modifying an
> > alarm
> > in the middle of execution, you're just creating a new alarm, which is safe.
> 
> OK, it is safe, but this is not the case.
> 
I don't know what you mean by this.  We agree its safe, great.  But it is the
case as I've described it, you can see it from the implementation, every call to
rte_eal_alarm_set starts with a malloc of a new alarm structure. 
> > 
> > There is a race in what you describe above, insofar as its possible that you
> > might call rte_eal_alarm_cancel and return without having canceled all the
> > matching alarms.  I don't see any clear documentation on what the behavior is
> > supposed to be, but if you want to ensure that all matching alarms are cancelled
> > or complete on return from rte_eal_alarm_cancel, thats perfectly fine (in linux
> > API parlance, thats usually denoted as a cancel_sync operation).
> 
> Again, look at the patch. I changed documentation to inform about this behavior.
> 
This is the documentation included in the patch:
Change alarm cancel function to thread-safe.
        It eliminates a race between threads using rte_alarm_cancel and
        rte_alarm_set.
neither have you compeltely described the race condition (though you now have
previously in this thread), nor have you completely addressed it (calling
rte_eal_alarm_cancel and rte_eal_alarm_set still behaves exactly as it did
previously with a 2nd thread).
> > 
> > For that race condition, you're correct, my patch doesn't address it, I see that
> > now.  Though your patch doesn't either.  If you call rte_eal_alarm_cancel from
> > within a callback function, then, by definition, you can't wait on the
> > completion of the active alarm, because thats a deadlock.  Its a necessecary
> > evil, I grant you, but it means that you can't be guaranteed the cancelled and
> > complete (cancel_sync) behavior that you want, at least not with the current
> > api.  If you want that behavior, you need to do one of two things:
> 
> This patch does not break any API. It only removes undefined behavior.
> 
I never said it did break ABI.  I said that to completely fix it you would have
to break ABI.  And it doesn't really remove undefined behavior, because you
still have the old behavior in the recursive case (which you may be ok with, I
don't know, but if you really want to address the behavior, you should address
this aspect of it).
> > 
> > 1) Modify the api to allow callers to individually reference timer instances, so
> > that when cancelling, we can return an appropriate return code to indicate to
> > the caller that this alarm is in-progress.  That way you can guarantee the
> > caller that the specific alarm that you cancelled is either complete and cancelled
> > or currently executing.  Add an api to expicitly wait on a referenced alarm as
> > well.  This allows developers to know that, when executing an alarm callback, an
> > -ECURRENTLYEXECUTING return code is ok, because they are in the currently
> > executing context.
> 
> This would brake API for sure.
Yes, it would.  Bruce Richardson just made a major ABI break with his mbuf
cleanup set.  If there was a time to change ABI here, now would be the time I
think.
Neil
> 
> 
^ permalink raw reply	[relevance 5%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-26 10:41  0%       ` Thomas Monjalon
@ 2014-09-26 14:45  5%         ` Neil Horman
  2014-09-26 22:02  4%           ` Stephen Hemminger
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-26 14:45 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Fri, Sep 26, 2014 at 12:41:33PM +0200, Thomas Monjalon wrote:
> Hi Neil,
> 
> 2014-09-24 14:19, Neil Horman:
> > Ping Thomas. I know you're busy, but I would like this to not fall off anyones
> > radar.  You alluded to concerns regarding what, for lack of a better term,
> > ABI/API lockin.  I had asked you to enuumerate/elaborate on specifics, but never
> > heard back.  Are there further specifics you wish to discuss, or are you
> > satisfied with the above answers?
> 
> Sorry for not being very reactive on this thread.
> All this discussion is very interesting but it's really not the proper
> time to apply it. As you said, it requires an extra effort. I'm not saying
> it will never be integrated. I'm just saying that we cannot change
> everything at the same time.
> 
> Let me sum up the situation. This community project has been very active
> for few months now. First, we learnt how to make some releases together
> and we are improving the process to be able to deliver a new major release
> every 4 months while having some good quality process.
> But these releases are still not complete because documentation is not
> integrated yet. Then developers should have a role in documentation updates.
> We also need to integrate and learn how to use more tools to be more
> efficient and improve quality.
> 
> So the question is "when should we care about API compatibility"?
> And the answer is: ASAP, but not now. I feel next year is a better target.
> Because the most important priority is to move together at a pace which
> allow most of us to stay in the race.
> 
I'm sorry Thomas, I don't accept this.  I asked you for details as to your
concerns regarding this patch series, and you've provided more vague comments.
I need details to address
You say it requires extra effort, you're right it does.  Any feature that you
integreate requires some additional effort.  How is this patch any different
from adding the acl library or any other new API?  Everything requires
maintenence, thats how software works.  What specfically about this patch series
makes the effort insurmountable to you?
You say you're improving your process.  Great, this patch aids in that process
by ensuring backwards compatibility for a period of time.  Given that the API
and ABI can still evolve within this framework, as I've described, how is this
patch series not a significant step forward toward your goal of quality process.
You say documentation isn't integrated.  So, what does getting documentation
integrated have to do with this patch set, or any other?  I don't see you
holding any other patches based on documentation.  Again, nothing in this series
prevents evolution of the API or ABI.  If you're hope is to wait until
everything is perfect, then apply some control to the public facing API, and get
it all documented, none of thosse things will ever happen, I promise you.
You say you also need to learn to use more tools to be more efficient and
improve quality.  Great!  Thats exactly what this is. If we mandate even a short
term commitment to ABI stability (1 single relese worth of time), we will
quickly identify what API's change quickly and where we need to be cautious with
our API design.  If you just assume that developers will get better of their own
volition, it will never happen.
You say this should go in next year, but not now.  When exactly?  What event do
you forsee occuring in the next 12-18 months that will change everything such
that we can start supporing an ABI for more than just a few weeks at the head of
the tree?  
To this end, I just did a quick search through the git history for dpdk to look
at the histories of all the header files that are exposed via the makefile
SYMLINK command (given that that provides a list of header files that
applications can include, and embodies all the function symbols and data types
applications have access to.
There are 179 total commits in that list
Of those, a bit of spot checking suggests that about 10-15% of them actually
change ABI, and many of those came from Bruce's rework of the mbuf structure.
That about 17-20 instances over the last 2 years where an ABI update would have
been needed.  That seems pretty reasonable to me.  Where exactly is your concern
here?
Neil
> -- 
> Thomas
> 
^ permalink raw reply	[relevance 5%]
* Re: [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-25 18:52  4%   ` [dpdk-dev] [PATCH 1/4 v2] " Neil Horman
@ 2014-09-26 14:16  0%     ` Sergio Gonzalez Monroy
  2014-09-26 15:16  0%       ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Sergio Gonzalez Monroy @ 2014-09-26 14:16 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Thu, Sep 25, 2014 at 02:52:32PM -0400, Neil Horman wrote:
> Add initial pass header files to support symbol versioning.
> 
> ---
> Change notes
> v2)
> * Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
> non-existant RTE_SYMBOL_VERSIONING
> 
> * Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
> properly
> 
> * Improved/Clarified documentation
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
> ---
>  lib/Makefile                   |  1 +
>  lib/librte_compat/Makefile     | 38 ++++++++++++++++++
>  lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
>  mk/rte.lib.mk                  |  6 +++
>  4 files changed, 132 insertions(+)
>  create mode 100644 lib/librte_compat/Makefile
>  create mode 100644 lib/librte_compat/rte_compat.h
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 10c5bb3..a85b55b 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -32,6 +32,7 @@
>  include $(RTE_SDK)/mk/rte.vars.mk
>  
>  DIRS-$(CONFIG_RTE_LIBC) += libc
> +DIRS-y += librte_compat
>  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
>  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
>  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> new file mode 100644
> index 0000000..3415c7b
> --- /dev/null
> +++ b/lib/librte_compat/Makefile
> @@ -0,0 +1,38 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +
> +# install includes
> +SYMLINK-y-include := rte_compat.h
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> new file mode 100644
> index 0000000..cff9aea
> --- /dev/null
> +++ b/lib/librte_compat/rte_compat.h
> @@ -0,0 +1,87 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_COMPAT_H_
> +#define _RTE_COMPAT_H_
> +
> +/*
> + * This is just a stringification macro for use below.
> + */
> +#define SA(x) #x
> +
> +#ifdef RTE_BUILD_SHARED_LIB
> +
> +/*
> + * Provides backwards compatibility when updating exported functions.
> + * When a symol is exported from a library to provide an API, it also provides a
> + * calling convention (ABI) that is embodied in its name, return type,
> + * arguments, etc.  On occasion that function may need to change to accomodate
> + * new functionality, behavior, etc.  When that occurs, it is desireable to
> + * allow for backwards compatibility for a time with older binaries that are
> + * dynamically linked to the dpdk.  to support that the __vsym and
> + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> + * <library>_version.map file for a given library allow for multiple versions of
> + * a symbol to exist in a shared library so that older binaries need not be
> + * immediately recompiled. Their use is outlined in the following example:
> + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> + *
> + * To accomplish this:
> + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
> + * foo is exported as a global symbol.  Note that foo must be removed from the
> + * DPDK.(X) node, or you will see multiple symbol definitions
> + *
By removing the symbol from the previous node in the version map, you make
it local instead of global and applications linked against DPDK 1.8 will fail
with the new library.
Following the steps you describe, if we create a new version of the function
rte_acl_create we would end up with the following dso:
$ readelf -s x86_64-native-linuxapp-gcc/lib/librte_acl.so | grep "create\|\.symtab\|\.dynsym"
Symbol table '.dynsym' contains 42 entries:
    28: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create@@DPDK_1.9
Symbol table '.symtab' contains 147 entries:
    94: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create_v18
   105: 0000000000001960    36 FUNC    LOCAL  DEFAULT   12 rte_acl_create@@DPDK_1.8
   138: 0000000000001990   627 FUNC    GLOBAL DEFAULT   12 rte_acl_create
You can check that applications linked with the old lib will fail to run.
Note that to easily check this you should define the environment variable
LD_BIND_NOW to resolve all symbols at program startup (man ld.so).
Sergio
> + * 2) rename the existing function int foo(char *string) to 
> + * 	int __vsym foo_v18(char *string)
> + *
> + * 3) Add this macro immediately below the function
> + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> + *
> + */
> +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> +#define __vsym __attribute__((used))
> +
> +#else
> +/*
> + * No symbol versioning in use
> + */
> +#define VERSION_SYMBOL(b, e, v)
> +#define __vsym
> +
> +/*
> + * RTE_BUILD_SHARED_LIB
> + */
> +#endif
> +
> +
> +#endif /* _RTE_COMPAT_H_ */
> diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> index f458258..82ac309 100644
> --- a/mk/rte.lib.mk
> +++ b/mk/rte.lib.mk
> @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
>  
>  ifeq ($(RTE_BUILD_SHARED_LIB),y)
>  LIB := $(patsubst %.a,%.so,$(LIB))
> +
> +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> +
>  endif
>  
> +
>  _BUILD = $(LIB)
>  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
>  _CLEAN = doclean
> @@ -160,7 +164,9 @@ endif
>  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
>  	@echo "  INSTALL-LIB $(LIB)"
>  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> +ifneq ($(LIB),)
>  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> +endif
>  
>  #
>  # Clean all generated files
> -- 
> 1.9.3
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe:
  @ 2014-09-26 13:43  3%         ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-26 13:43 UTC (permalink / raw)
  To: Wodkowski, PawelX; +Cc: dev
On Fri, Sep 26, 2014 at 06:33:12AM +0000, Wodkowski, PawelX wrote:
> > Given what you said above, I agree, at least in the current implementation.  It
> > still seems like theres a simpler solution that doesn't require all the
> > comparative gymnastics.
> 
> Yes, there is simpler solution, but this solution involve recursive locking.
> DPDK recursive spinlocks are no an option in here, so only option is posix recursive
> mutex, which I think is even worst option than this gymnastics.
> 
I agree, lets avoid more locking if we can.
> > 
> > What if, instead of testing if you're the callback thread, we turn the executing
> > field of alarm_entry into a bitfield, where bit 0 represents the former
> > "executing" state, and bit 1 is defined as a "cancelled" bit.  Then
> > rte_eal_alarm_cancel becomes a search that, when an alarm is found simply or's
> > in the cancelled bit to the executing bit field.  When the callback thread runs,
> > it skips executing any alarm that is marked as cancelled, but frees all alarm
> > entries that are executed or cancelled.  That gives us a single point at which
> > frees of alarm entires happen?  Something like the patch below (completely
> > untested)?
> > 
> > It also seems like the alarm api as a whole could use some improvement.  The
> > way its written right now, theres no way to refer to a specific alarm (i.e.
> > cancelation relies on the specification of a function and data pointer, which
> > may refer to multiple timers).  Shouldn't rte_eal_alarm_set return an opaque
> > handle to a unique timer instance that can be store by a caller and used to
> > specfically cancel that timer?  Thats how both the bsd and linux timer
> > subsystems model timers.
> > 
> 
> Goal was to not break user applications that use this library.
> 
You break API all the time, why are you worried about it here?  I'm all for
maintaining API definately, but once my ABI versioning code gets into place we
can manage this alot better.
> > 
> > 
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > b/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > index 480f0cb..73b6dc5 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_alarm.c
> > @@ -64,6 +64,9 @@
> >  #define MS_PER_S 1000
> >  #define US_PER_S (US_PER_MS * MS_PER_S)
> > 
> > +#define ALARM_EXECUTING (1 << 0)
> > +#define ALARM_CANCELLED (1 << 1)
> > +
> >  struct alarm_entry {
> >  	LIST_ENTRY(alarm_entry) next;
> >  	struct timeval time;
> > @@ -107,12 +110,14 @@ eal_alarm_callback(struct rte_intr_handle *hdl
> > __rte_unused,
> >  			gettimeofday(&now, NULL) == 0 &&
> >  			(ap->time.tv_sec < now.tv_sec || (ap->time.tv_sec ==
> > now.tv_sec &&
> >  						ap->time.tv_usec <=
> > now.tv_usec))){
> > -		ap->executing = 1;
> > -		rte_spinlock_unlock(&alarm_list_lk);
> 
> Removing unlock here introduce deadlock.
> 
Please look more closely, I've not removed anything, only moved where the lock
occurs.
> > +		ap->executing |= ALARM_EXECUTING;
> > +		if (likely(!(ap->executing & ALARM_CANCELLED)) {
> > +			rte_spinlock_unlock(&alarm_list_lk);
The unlock is now here, conditional on needing to call the callback.
> > 
> > -		ap->cb_fn(ap->cb_arg);
> > +			ap->cb_fn(ap->cb_arg);
> > 
> > -		rte_spinlock_lock(&alarm_list_lk);
> > +			rte_spinlock_lock(&alarm_list_lk);
> > +		}
> >  		LIST_REMOVE(ap, next);
> >  		rte_free(ap);
> >  	}
> > @@ -209,10 +214,9 @@ rte_eal_alarm_cancel(rte_eal_alarm_callback cb_fn,
> > void *cb_arg)
> >  	rte_spinlock_lock(&alarm_list_lk);
> >  	/* remove any matches at the start of the list */
> >  	while ((ap = LIST_FIRST(&alarm_list)) != NULL &&
> > -			cb_fn == ap->cb_fn && ap->executing == 0 &&
> > +			cb_fn == ap->cb_fn &&
> >  			(cb_arg == (void *)-1 || cb_arg == ap->cb_arg)) {
> > -		LIST_REMOVE(ap, next);
> > -		rte_free(ap);
> > +		ap->executing |= ALARM_CANCELLED;
> >  		count++;
> >  	}
> >  	ap_prev = ap;
> > @@ -220,10 +224,9 @@ rte_eal_alarm_cancel(rte_eal_alarm_callback cb_fn,
> > void *cb_arg)
> >  	/* now go through list, removing entries not at start */
> >  	LIST_FOREACH(ap, &alarm_list, next) {
> >  		/* this won't be true first time through */
> > -		if (cb_fn == ap->cb_fn &&  ap->executing == 0 &&
> > +		if (cb_fn == ap->cb_fn &&
> >  				(cb_arg == (void *)-1 || cb_arg == ap->cb_arg))
> > {
> > -			LIST_REMOVE(ap,next);
> > -			rte_free(ap);
> > +			ap->executing |= ALARM_CANCELLED;
> >  			count++;
> >  			ap = ap_prev;
> >  		}
> 
> Pawel
> 
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-24 18:19  3%     ` Neil Horman
@ 2014-09-26 10:41  0%       ` Thomas Monjalon
  2014-09-26 14:45  5%         ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Thomas Monjalon @ 2014-09-26 10:41 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
Hi Neil,
2014-09-24 14:19, Neil Horman:
> Ping Thomas. I know you're busy, but I would like this to not fall off anyones
> radar.  You alluded to concerns regarding what, for lack of a better term,
> ABI/API lockin.  I had asked you to enuumerate/elaborate on specifics, but never
> heard back.  Are there further specifics you wish to discuss, or are you
> satisfied with the above answers?
Sorry for not being very reactive on this thread.
All this discussion is very interesting but it's really not the proper
time to apply it. As you said, it requires an extra effort. I'm not saying
it will never be integrated. I'm just saying that we cannot change
everything at the same time.
Let me sum up the situation. This community project has been very active
for few months now. First, we learnt how to make some releases together
and we are improving the process to be able to deliver a new major release
every 4 months while having some good quality process.
But these releases are still not complete because documentation is not
integrated yet. Then developers should have a role in documentation updates.
We also need to integrate and learn how to use more tools to be more
efficient and improve quality.
So the question is "when should we care about API compatibility"?
And the answer is: ASAP, but not now. I feel next year is a better target.
Because the most important priority is to move together at a pace which
allow most of us to stay in the race.
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCH 1/4 v2] compat: Add infrastructure to support symbol versioning
  2014-09-15 19:23  4% ` [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning Neil Horman
  2014-09-23 10:39  0%   ` Sergio Gonzalez Monroy
@ 2014-09-25 18:52  4%   ` Neil Horman
  2014-09-26 14:16  0%     ` Sergio Gonzalez Monroy
  1 sibling, 1 reply; 86+ results
From: Neil Horman @ 2014-09-25 18:52 UTC (permalink / raw)
  To: dev
Add initial pass header files to support symbol versioning.
---
Change notes
v2)
* Fixed ifdef in rte_compat.h to test for RTE_BUILD_SHARED_LIB instead of the
non-existant RTE_SYMBOL_VERSIONING
* Fixed VERSION_SYMBOL macro to add the needed extra @ to make versioning work
properly
* Improved/Clarified documentation
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
CC: "Richardson, Bruce" <bruce.richardson@intel.com>
CC: "Gonzalez Monroy, Sergio" <sergio.gonzalez.monroy@intel.com>
---
 lib/Makefile                   |  1 +
 lib/librte_compat/Makefile     | 38 ++++++++++++++++++
 lib/librte_compat/rte_compat.h | 87 ++++++++++++++++++++++++++++++++++++++++++
 mk/rte.lib.mk                  |  6 +++
 4 files changed, 132 insertions(+)
 create mode 100644 lib/librte_compat/Makefile
 create mode 100644 lib/librte_compat/rte_compat.h
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3..a85b55b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -32,6 +32,7 @@
 include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBC) += libc
+DIRS-y += librte_compat
 DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
 DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
 DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
new file mode 100644
index 0000000..3415c7b
--- /dev/null
+++ b/lib/librte_compat/Makefile
@@ -0,0 +1,38 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+
+# install includes
+SYMLINK-y-include := rte_compat.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
new file mode 100644
index 0000000..cff9aea
--- /dev/null
+++ b/lib/librte_compat/rte_compat.h
@@ -0,0 +1,87 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Neil Horman <nhorman@tuxdriver.com>.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_COMPAT_H_
+#define _RTE_COMPAT_H_
+
+/*
+ * This is just a stringification macro for use below.
+ */
+#define SA(x) #x
+
+#ifdef RTE_BUILD_SHARED_LIB
+
+/*
+ * Provides backwards compatibility when updating exported functions.
+ * When a symol is exported from a library to provide an API, it also provides a
+ * calling convention (ABI) that is embodied in its name, return type,
+ * arguments, etc.  On occasion that function may need to change to accomodate
+ * new functionality, behavior, etc.  When that occurs, it is desireable to
+ * allow for backwards compatibility for a time with older binaries that are
+ * dynamically linked to the dpdk.  to support that the __vsym and
+ * VERSION_SYMBOL macros are created.  They, in conjunction with the
+ * <library>_version.map file for a given library allow for multiple versions of
+ * a symbol to exist in a shared library so that older binaries need not be
+ * immediately recompiled. Their use is outlined in the following example:
+ * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
+ *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
+ *
+ * To accomplish this:
+ * 1) Edit lib/<library>/library_version.map to add a DPDK_1.(X+1) node, in which
+ * foo is exported as a global symbol.  Note that foo must be removed from the
+ * DPDK.(X) node, or you will see multiple symbol definitions
+ *
+ * 2) rename the existing function int foo(char *string) to 
+ * 	int __vsym foo_v18(char *string)
+ *
+ * 3) Add this macro immediately below the function
+ * 	VERSION_SYMBOL(foo, _v18, 1.8);
+ *
+ */
+#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
+#define __vsym __attribute__((used))
+
+#else
+/*
+ * No symbol versioning in use
+ */
+#define VERSION_SYMBOL(b, e, v)
+#define __vsym
+
+/*
+ * RTE_BUILD_SHARED_LIB
+ */
+#endif
+
+
+#endif /* _RTE_COMPAT_H_ */
diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
index f458258..82ac309 100644
--- a/mk/rte.lib.mk
+++ b/mk/rte.lib.mk
@@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
 
 ifeq ($(RTE_BUILD_SHARED_LIB),y)
 LIB := $(patsubst %.a,%.so,$(LIB))
+
+CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
+
 endif
 
+
 _BUILD = $(LIB)
 _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
 _CLEAN = doclean
@@ -160,7 +164,9 @@ endif
 $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
 	@echo "  INSTALL-LIB $(LIB)"
 	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
+ifneq ($(LIB),)
 	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
+endif
 
 #
 # Clean all generated files
-- 
1.9.3
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-18 19:14  4%   ` Neil Horman
  2014-09-19  8:57  0%     ` Richardson, Bruce
  2014-09-19 14:18  0%     ` Venkatesan, Venky
@ 2014-09-24 18:19  3%     ` Neil Horman
  2014-09-26 10:41  0%       ` Thomas Monjalon
  2 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-24 18:19 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Thu, Sep 18, 2014 at 03:14:01PM -0400, Neil Horman wrote:
> On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> > Hi Neil,
> > 
> > 2014-09-15 15:23, Neil Horman:
> > > The DPDK ABI develops and changes quickly, which makes it difficult for
> > > applications to keep up with the latest version of the library, especially when
> > > it (the DPDK) is built as a set of shared objects, as applications may be built
> > > against an older version of the library.
> > > 
> > > To mitigate this, this patch series introduces support for library and symbol
> > > versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
> > > 
> > > 1) Adds initial support for library versioning.  Each library now has a version
> > > map that explicitly calls out what symbols are exported to using applications,
> > > and assigns version(s) to them
> > > 
> > > 2) Adds support macros so that when libraries create incompatible ABI's,
> > > multiple versions may be supported so that applications linked against older
> > > DPDK releases can continue to function
> > > 
> > > 3) Adds library soname versioning suffixes so that when ABI's must be broken in
> > > a fashion that requires a rebuild of older applications, they will break at load
> > > time, rather than cause unexpected issues at run time.
> > > 
> > > 4) Adds documentation for ABI policy, and provides space to document deprecated
> > > ABI versions, so that applications might be warned of impending changes.
> > > 
> > > With these elements in place the DPDK has some support to allow for the extended
> > > maintenence of older API's while still allowing the freedom to develop new and
> > > improved API's.
> > > 
> > > Implementing this feature will require some additional effort on the part of
> > > developers and reviewers.  When reviewing patches, must be checked against
> > > existing exports to ensure that the function prototypes are not changing.  If
> > > they are, the versioning macros must be used, and the library export map should
> > > be updated to reflect the new version of the function.
> > > 
> > > When data structures change, if those structures are application accessible,
> > > apis that accept or return instances of those data structures should have new
> > > versions created so that users of the old data structure version might co-exist
> > > at the same time.
> > 
> > Thanks for your efforts.
> > But I feel this change has too many constraints for the current status of
> > the DPDK. It's probably too early to adopt such policy.
> > 
> I think you may be misunderstanding something.  What constraints do you beleive
> that this patch imposes?  Note it doesn't in any way prevent changes to the ABI
> of the DPDK, but rather gives us infrastructure to support multiple ABI
> revisions at the same time, so that applications built against DPDK shared
> libraries can continue to function properly at least for some time until we
> decide to deprecate that ABI level.
> 
> This is all based on the versioning strategy outlined here:
> http://www.akkadia.org/drepper/dsohowto.pdf
> 
> That may help clarify things for you.
> 
> > By the way, this versioning doesn't cover structure changes.
> No, it doesn't.  No link-time mechanism does so.
> 
> > How could it be managed?
> Thats a subject that is open to discussion, but my initial thinking is that we
> need to handle it on a case by case basis:
> 
> * For minor updates, where allocation of a structure is done on the heap and new
> fields need to be added, appending them to the end of a structure and providing
> an initial value is sufficient.
> 
> * For major changes, where fields need to be removed, or re-arranged, mostly
> likely the API surfaces which accept or return those structures as
> inputs/outputs will need to have new versions written to accept the new version
> of the structure, and internally we will have to support both formats for a time
> (according to the policy I documented, that is currently a single major
> release).  I.e. if you want to change struct foo, which is accepted as a
> parameter for the function bar(struct foo *ptr), then for a release we would
> need to create struct foo_v2 with the new format, map a new function foo_v2 to
> the exported foo@@DPDK_1.(X+1), and internally make the foo functions understand
> both the origional and v2 versions of the structure.  Then in DPDK release
> 1.X+2, we can remove the old version after posting a deprecation notice with
> version 1.(X+1)
> 
> > Don't you think it would be more reliable if managed by packaging?
> Solving this with packaging defeats the purpose of having shared libraries at
> all.  While packaging each version of the dpdk separately is possible stopgap
> solution, in that it allows applications to link to differing versions of the
> library independently, but that negates any expectation of timely bugfixes for
> any given version of the DPDK.  That is to say, if you package things this way,
> and wind up with several parallel versions of the same package, and for any
> bugfix that comes out upstream, the packager then has the responsibility to
> adapt that fix to each package.  Thats an unscalable solution, and not something
> any packager is going to undertake willingly.  I did a hybrid version of this in
> fedora for exactly that reason.  I packaged the dpdk into its own directory, but
> have every intention of changing that directory every major release, so that
> application writers can clearly see when they need to stop updating the dpdk,
> lest their applications stop linking. I'm not going to have multiple dpdk
> packages to maintain in parallel, thats just too much work.
> 
> > 
> > Thank you for opening this discussion with a constructive proposal. 
> > Let's check it later on once structures will be more stable since 
> > performance is the most critical target.
> If I'm being honest, I have to say thats a cop out answer.  We all know that
> structure stability isn't a priority for the DPDK, nor will it ever be in all
> likelyhood.  It will continue to evolve and grow as the hardware does.  And this
> patch set doesn't prevent that from happening.  All it does is provide some
> level of stability in the API for a period of time to let 3rd party application
> writers write and package applications with some allowance of time to keep up
> with upstream changes on their own schedule.
> 
> I grant you that writing a good API for a shared library is difficult, but
> (and feel free to disagree with this), if we don't start enforcing policies that
> require good API design, its not going to happen on its own.  This patch set
> will highlight those API points which are difficult to maintain accross major
> releases, and force us to address and improve them.  To that end I've already
> begun talking to some of the individual library maintainers off list to address
> some of the API aspects that I have concerns about (exporting variable rather
> than accessor functions, structures that don't need to be visible to users,
> etc), and they've started reviewing them.  We can make this better, but we can't
> just say later, because theres no roadmap that lists structure stability as a
> line item.  As hardware improves, structures will change to operate more
> efficiently or support more features.  Without a hard plan, the initial goals of
> the DPDK (high performance networking) will relegate ABI to such a low priority
> that it will never be addressed. 
> 
> To that end, can we discuss specifics?  Can you ennumerate direct points that
> you feel make this patch unworkable at this time?  I know you mentioned some
> above, and I think I addressed them (though please ask follow up questions if
> I've been unclear).  What other concerns do you have?
> 
> Neil
>  
> 
Ping Thomas. I know you're busy, but I would like this to not fall off anyones
radar.  You alluded to concerns regarding what, for lack of a better term,
ABI/API lockin.  I had asked you to enuumerate/elaborate on specifics, but never
heard back.  Are there further specifics you wish to discuss, or are you
satisfied with the above answers?
Best
Neil
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning
  2014-09-23 16:29  0%       ` Sergio Gonzalez Monroy
@ 2014-09-23 17:31  0%         ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-23 17:31 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy; +Cc: dev
On Tue, Sep 23, 2014 at 05:29:48PM +0100, Sergio Gonzalez Monroy wrote:
> On Tue, Sep 23, 2014 at 10:58:29AM -0400, Neil Horman wrote:
> > On Tue, Sep 23, 2014 at 11:39:29AM +0100, Sergio Gonzalez Monroy wrote:
> > > Hi Neil,
> > > 
> > > On Mon, Sep 15, 2014 at 03:23:48PM -0400, Neil Horman wrote:
> > > > Add initial pass header files to support symbol versioning.
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > > > ---
> > > >  lib/Makefile                   |  1 +
> > > >  lib/librte_compat/Makefile     | 38 +++++++++++++++++++
> > > >  lib/librte_compat/rte_compat.h | 86 ++++++++++++++++++++++++++++++++++++++++++
> > > >  mk/rte.lib.mk                  |  6 +++
> > > >  4 files changed, 131 insertions(+)
> > > >  create mode 100644 lib/librte_compat/Makefile
> > > >  create mode 100644 lib/librte_compat/rte_compat.h
> > > > 
> > > > diff --git a/lib/Makefile b/lib/Makefile
> > > > index 10c5bb3..a85b55b 100644
> > > > --- a/lib/Makefile
> > > > +++ b/lib/Makefile
> > > > @@ -32,6 +32,7 @@
> > > >  include $(RTE_SDK)/mk/rte.vars.mk
> > > >  
> > > >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > > > +DIRS-y += librte_compat
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> > > >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > > > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > > > new file mode 100644
> > > > index 0000000..a61511a
> > > > --- /dev/null
> > > > +++ b/lib/librte_compat/Makefile
> > > > @@ -0,0 +1,38 @@
> > > > +#   BSD LICENSE
> > > > +#
> > > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > +#   All rights reserved.
> > > > +#
> > > > +#   Redistribution and use in source and binary forms, with or without
> > > > +#   modification, are permitted provided that the following conditions
> > > > +#   are met:
> > > > +#
> > > > +#     * Redistributions of source code must retain the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer.
> > > > +#     * Redistributions in binary form must reproduce the above copyright
> > > > +#       notice, this list of conditions and the following disclaimer in
> > > > +#       the documentation and/or other materials provided with the
> > > > +#       distribution.
> > > > +#     * Neither the name of Intel Corporation nor the names of its
> > > > +#       contributors may be used to endorse or promote products derived
> > > > +#       from this software without specific prior written permission.
> > > > +#
> > > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > > +
> > > > +
> > > > +# install includes
> > > > +SYMLINK-y-include := rte_compat.h
> > > > +
> > > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > > > new file mode 100644
> > > > index 0000000..6d65a53
> > > > --- /dev/null
> > > > +++ b/lib/librte_compat/rte_compat.h
> > > > @@ -0,0 +1,86 @@
> > > > +/*-
> > > > + *   BSD LICENSE
> > > > + *
> > > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > + *   All rights reserved.
> > > > + *
> > > > + *   Redistribution and use in source and binary forms, with or without
> > > > + *   modification, are permitted provided that the following conditions
> > > > + *   are met:
> > > > + *
> > > > + *     * Redistributions of source code must retain the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer.
> > > > + *     * Redistributions in binary form must reproduce the above copyright
> > > > + *       notice, this list of conditions and the following disclaimer in
> > > > + *       the documentation and/or other materials provided with the
> > > > + *       distribution.
> > > > + *     * Neither the name of Intel Corporation nor the names of its
> > > > + *       contributors may be used to endorse or promote products derived
> > > > + *       from this software without specific prior written permission.
> > > > + *
> > > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_COMPAT_H_
> > > > +#define _RTE_COMPAT_H_
> > > > +
> > > > +/*
> > > > + * This is just a stringification macro for use below.
> > > > + */
> > > > +#define SA(x) #x
> > > > +
> > > > +#ifdef RTE_SYMBOL_VERSIONING
> > > > +
> > > > +/*
> > > > + * Provides backwards compatibility when updating exported functions.
> > > > + * When a symol is exported from a library to provide an API, it also provides a
> > > > + * calling convention (ABI) that is embodied in its name, return type,
> > > > + * arguments, etc.  On occasion that function may need to change to accomodate
> > > > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > > > + * allow for backwards compatibility for a time with older binaries that are
> > > > + * dynamically linked to the dpdk.  to support that the __vsym and
> > > > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > > > + * <library>_version.map file for a given library allow for multiple versions of
> > > > + * a symbol to exist in a shared library so that older binaries need not be
> > > > + * immediately recompiled. Their use is outlined in the following example:
> > > > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > > > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > > > + *
> > > > + * To accomplish this:
> > > > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
> > > > + * foo is exported as a global symbol
> > > > + *
> > > > + * 2) rename the existing function int foo(char *string) to 
> > > > + * 	int __vsym foo_v18(char *string)
> > > > + *
> > > > + * 3) Add this macro immediately below the function
> > > > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > > > + *
> > > > + */
> > > > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
> > > > +#define __vsym __attribute__((used))
> > > > +
> > > 
> > > I may be missing something here but would it not be neccessary to define a
> > > default symbol?
> > > Otherwise there would be multiple definitions of a given symbol and the linker
> > > won't know which symbol version to bind to.
> > > 
> > > Following your example, something along these lines:
> > >  4) Edit lib/<library>/library_version.map to add a DPDK_1.9 node that is a
> > >    successor to DPDK_1.8, in which foo is exported as a global symbol 
> > >    DPDK_1.9 {
> > >       global: foo;
> > >    } DPDK_1.8;
> > > 
> > >  5) rename new function int foo(int index) to
> > >    int __vsym foo_v19(int index)
> > > 
> > >  6) Add this macro immediately below the function:
> > >    DEFAULT_SYMBOL(foo, _v19, 1.9);
> > > 
> > > #define DEFAULT_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > > 
> > 
> > You're spot on (though the macro that I created in rte_compat.h is
> > VERSION_SYMBOL).  
> > 
> > When you use a version script to create a DSO, at link time, the appropriate
> > version is appended to the symbol name (you can see it with objdump -t in a
> > linked binary).  If you want to update the symbol to a new version, you do what
> > I documented in the header file (though now that I re-read it, it could be more
> > clear.  Hows this for a change to the documentation:
> > 
> > To make a new version of a function foo in a DSO:
> > 
> > 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
> >    foo is exported as a global symbol
> > 
> > 2) rename the existing function int foo(char *string) to 
> >    int __vsym foo_v18(char *string)
> > 
> > 3) Add this macro immediately below the function
> >    VERSION_SYMBOL(foo, _v18, 1.8);
> > 
> > 4) Implement the new version of the function foo.
> > 
> > 
> > Those steps above will create two symbols in your export table of the DSO:
> > 
> > foo@@DPDK_1.8
> > foo@@DPDK_1.9
> > 
> > Any application linked against this DSO will link against the latest version
> > (DPDK_1.9).  But if you look at the symbols referenced in a binary linked
> > against an older version of the same DSO, you'll note they explicitly look for
> > foo@@DPDK_1.8.  Thats how we provide backwards compatibility
> > 
> > Does that answer your questions?
> > 
> > Neil
> > 
> Correct me if I am wrong but when we define multiple versions of a symbol we
> need to specify a default one.
You are corrected :).  The "Default" symbol is implicitly the latest version of
the symbol (where the ordinality of the symbol versions is defined by the map
file).
> As an example, if we were to have three versions of foo the export table of the
> DSO should look something like this:
> 
> foo@VER_1.0
> foo@VER_1.1
> foo@@VER_1.2
> 
> In the above example, foo VER_1.2 is the default one and is such based on the
fact that it is ordinally the most recent in the version map file.
Its the default symbol because its the latest one according to the map file (and
denoted by the double @'s).  When linking that is the only symbol visibile to
the application being linked.
> Effectively we would need two macros VERSION_SYMBOL and DEFAULT_VERSION_SYMBOL
> (maybe this name is more appropriate).
> 
Nope, we don't, because as you note above, the default is implicit by the fact
that it is ordinally the latest, and the latest version of the symbol is the
only version that the linker "sees" when linking new applications.  The
VERSION_SYMBOL macro exists to tie older binary applications to the older
versions of the symbol at _load_ time.
> #define VERSION_SYMBOL(b, e, v)         \
>     __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
> #define DEFAULT_VERSION_SYMBOL(b, e, v) \
>     __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> 
Nope.  Don't need it.
> Following on the example, we should have something like:
> 
>    int __vsym foo_v18(char *string) {...}
>    VERSION_SYMBOL(foo, _v18, 1.8);
> 
>    int __vsym foo_v19(int index) {...}
>    DEFAULT_VERSION_SYMBOL(foo, _v19, 1.9);
> 
Nope.  Lets start a bit further back.  Assume we have the following map file:
DPDK_1.8 {
	global:
	foo;
};
And we have this in a C file:
void foo(int num) {
	<implementation for version 1.8>	
}
Then we want to update the foo function to something that is binary
incompatible.  We would change the version map file as follows:
DPDK_1.8 {
	global:
	foo;
}
DPDK_1.9 {
	global:
	foo;
} DPDK_1.8;
That construct makes the linker see DPDK_1.9 ad ordinally "newer" than DPDK_1.8,
so symobls that are global in that version are exported rather than their older
counterparts in the DPDK_1.8 export set.  When the linker links a new
application it only links to the latest version
Then in the C file we do the following:
void foo_v18(int num) {
	<implementation for version 1.8>
}
VERSION_SYMBOL(foo, _v18, 1.8);
int foo(char *name) {
	<implementation for version 1.9>
}
With this change, the new foo function is implicitly matched to version 1.9 in
the map file, and thats what gets linked to new application.  The
VERSION_SYMBOL macro exports an additional symbol, foo@@DPDK_1.9, so that
previously built applications, that were linked when the origional version of
foo was the latest, will still find the appropriate symbol as foo@@DPDK_1.8 
We could in fact something like what you are suggesting, in that we could use
the VERSION_SYMBOL macro on every exported function so that we explicitly tied
every version of every exported symbol to a statically defined version of the
function with a variant name (i.e. we could have a foo_v18, a foo_v19, foo_vX,
for every supported API version if you wanted), but that creates alot more work
for us.  for instance, when doing a non DSO build, you still have to map each
symbol to a specfic version, and you have to keep that updated.  By doing it the
way I did above, the actual function name is always the latest version, and you
only have to rename functions if you need to modify the api (I'm working under
the assumption that needing to do this is going to be somewhat rare).
Hope that helps
Neil
> The DSO export table would have the following symbols:
> 
>    foo@DPDK_1.8
>    foo@@DPDK_1.9
> 
> Old binaries linked against DPDK 1.8 would have references to:
> foo@@DPDK_1.8
> 
> and new binaries linked against DPDK 1.9 would have to:
> foo@@DPDK_1.9
> 
> Sergio
> 
> > > > +#else
> > > > +/*
> > > > + * No symbol versioning in use
> > > > + */
> > > > +#define VERSION_SYMBOL(b, e, v)
> > > > +#define __vsym
> > > > +
> > > > +/*
> > > > + * RTE_SYMBOL_VERSIONING
> > > > + */
> > > > +#endif
> > > > +
> > > > +
> > > > +#endif /* _RTE_COMPAT_H_ */
> > > > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > > > index f458258..82ac309 100644
> > > > --- a/mk/rte.lib.mk
> > > > +++ b/mk/rte.lib.mk
> > > > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> > > >  
> > > >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> > > >  LIB := $(patsubst %.a,%.so,$(LIB))
> > > > +
> > > > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > > > +
> > > >  endif
> > > >  
> > > > +
> > > >  _BUILD = $(LIB)
> > > >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> > > >  _CLEAN = doclean
> > > > @@ -160,7 +164,9 @@ endif
> > > >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> > > >  	@echo "  INSTALL-LIB $(LIB)"
> > > >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > > > +ifneq ($(LIB),)
> > > >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > > > +endif
> > > >  
> > > >  #
> > > >  # Clean all generated files
> > > > -- 
> > > > 1.9.3
> > > > 
> > > 
> > > Sergio
> > > 
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning
  2014-09-23 14:58  0%     ` Neil Horman
@ 2014-09-23 16:29  0%       ` Sergio Gonzalez Monroy
  2014-09-23 17:31  0%         ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Sergio Gonzalez Monroy @ 2014-09-23 16:29 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Tue, Sep 23, 2014 at 10:58:29AM -0400, Neil Horman wrote:
> On Tue, Sep 23, 2014 at 11:39:29AM +0100, Sergio Gonzalez Monroy wrote:
> > Hi Neil,
> > 
> > On Mon, Sep 15, 2014 at 03:23:48PM -0400, Neil Horman wrote:
> > > Add initial pass header files to support symbol versioning.
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > > ---
> > >  lib/Makefile                   |  1 +
> > >  lib/librte_compat/Makefile     | 38 +++++++++++++++++++
> > >  lib/librte_compat/rte_compat.h | 86 ++++++++++++++++++++++++++++++++++++++++++
> > >  mk/rte.lib.mk                  |  6 +++
> > >  4 files changed, 131 insertions(+)
> > >  create mode 100644 lib/librte_compat/Makefile
> > >  create mode 100644 lib/librte_compat/rte_compat.h
> > > 
> > > diff --git a/lib/Makefile b/lib/Makefile
> > > index 10c5bb3..a85b55b 100644
> > > --- a/lib/Makefile
> > > +++ b/lib/Makefile
> > > @@ -32,6 +32,7 @@
> > >  include $(RTE_SDK)/mk/rte.vars.mk
> > >  
> > >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > > +DIRS-y += librte_compat
> > >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> > >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> > >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > > new file mode 100644
> > > index 0000000..a61511a
> > > --- /dev/null
> > > +++ b/lib/librte_compat/Makefile
> > > @@ -0,0 +1,38 @@
> > > +#   BSD LICENSE
> > > +#
> > > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > +#   All rights reserved.
> > > +#
> > > +#   Redistribution and use in source and binary forms, with or without
> > > +#   modification, are permitted provided that the following conditions
> > > +#   are met:
> > > +#
> > > +#     * Redistributions of source code must retain the above copyright
> > > +#       notice, this list of conditions and the following disclaimer.
> > > +#     * Redistributions in binary form must reproduce the above copyright
> > > +#       notice, this list of conditions and the following disclaimer in
> > > +#       the documentation and/or other materials provided with the
> > > +#       distribution.
> > > +#     * Neither the name of Intel Corporation nor the names of its
> > > +#       contributors may be used to endorse or promote products derived
> > > +#       from this software without specific prior written permission.
> > > +#
> > > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +
> > > +# install includes
> > > +SYMLINK-y-include := rte_compat.h
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > > new file mode 100644
> > > index 0000000..6d65a53
> > > --- /dev/null
> > > +++ b/lib/librte_compat/rte_compat.h
> > > @@ -0,0 +1,86 @@
> > > +/*-
> > > + *   BSD LICENSE
> > > + *
> > > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > + *   All rights reserved.
> > > + *
> > > + *   Redistribution and use in source and binary forms, with or without
> > > + *   modification, are permitted provided that the following conditions
> > > + *   are met:
> > > + *
> > > + *     * Redistributions of source code must retain the above copyright
> > > + *       notice, this list of conditions and the following disclaimer.
> > > + *     * Redistributions in binary form must reproduce the above copyright
> > > + *       notice, this list of conditions and the following disclaimer in
> > > + *       the documentation and/or other materials provided with the
> > > + *       distribution.
> > > + *     * Neither the name of Intel Corporation nor the names of its
> > > + *       contributors may be used to endorse or promote products derived
> > > + *       from this software without specific prior written permission.
> > > + *
> > > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > > + */
> > > +
> > > +#ifndef _RTE_COMPAT_H_
> > > +#define _RTE_COMPAT_H_
> > > +
> > > +/*
> > > + * This is just a stringification macro for use below.
> > > + */
> > > +#define SA(x) #x
> > > +
> > > +#ifdef RTE_SYMBOL_VERSIONING
> > > +
> > > +/*
> > > + * Provides backwards compatibility when updating exported functions.
> > > + * When a symol is exported from a library to provide an API, it also provides a
> > > + * calling convention (ABI) that is embodied in its name, return type,
> > > + * arguments, etc.  On occasion that function may need to change to accomodate
> > > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > > + * allow for backwards compatibility for a time with older binaries that are
> > > + * dynamically linked to the dpdk.  to support that the __vsym and
> > > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > > + * <library>_version.map file for a given library allow for multiple versions of
> > > + * a symbol to exist in a shared library so that older binaries need not be
> > > + * immediately recompiled. Their use is outlined in the following example:
> > > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > > + *
> > > + * To accomplish this:
> > > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
> > > + * foo is exported as a global symbol
> > > + *
> > > + * 2) rename the existing function int foo(char *string) to 
> > > + * 	int __vsym foo_v18(char *string)
> > > + *
> > > + * 3) Add this macro immediately below the function
> > > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > > + *
> > > + */
> > > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
> > > +#define __vsym __attribute__((used))
> > > +
> > 
> > I may be missing something here but would it not be neccessary to define a
> > default symbol?
> > Otherwise there would be multiple definitions of a given symbol and the linker
> > won't know which symbol version to bind to.
> > 
> > Following your example, something along these lines:
> >  4) Edit lib/<library>/library_version.map to add a DPDK_1.9 node that is a
> >    successor to DPDK_1.8, in which foo is exported as a global symbol 
> >    DPDK_1.9 {
> >       global: foo;
> >    } DPDK_1.8;
> > 
> >  5) rename new function int foo(int index) to
> >    int __vsym foo_v19(int index)
> > 
> >  6) Add this macro immediately below the function:
> >    DEFAULT_SYMBOL(foo, _v19, 1.9);
> > 
> > #define DEFAULT_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> > 
> 
> You're spot on (though the macro that I created in rte_compat.h is
> VERSION_SYMBOL).  
> 
> When you use a version script to create a DSO, at link time, the appropriate
> version is appended to the symbol name (you can see it with objdump -t in a
> linked binary).  If you want to update the symbol to a new version, you do what
> I documented in the header file (though now that I re-read it, it could be more
> clear.  Hows this for a change to the documentation:
> 
> To make a new version of a function foo in a DSO:
> 
> 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
>    foo is exported as a global symbol
> 
> 2) rename the existing function int foo(char *string) to 
>    int __vsym foo_v18(char *string)
> 
> 3) Add this macro immediately below the function
>    VERSION_SYMBOL(foo, _v18, 1.8);
> 
> 4) Implement the new version of the function foo.
> 
> 
> Those steps above will create two symbols in your export table of the DSO:
> 
> foo@@DPDK_1.8
> foo@@DPDK_1.9
> 
> Any application linked against this DSO will link against the latest version
> (DPDK_1.9).  But if you look at the symbols referenced in a binary linked
> against an older version of the same DSO, you'll note they explicitly look for
> foo@@DPDK_1.8.  Thats how we provide backwards compatibility
> 
> Does that answer your questions?
> 
> Neil
> 
Correct me if I am wrong but when we define multiple versions of a symbol we
need to specify a default one.
As an example, if we were to have three versions of foo the export table of the
DSO should look something like this:
foo@VER_1.0
foo@VER_1.1
foo@@VER_1.2
In the above example, foo VER_1.2 is the default one and is indicated by
having double @.
Effectively we would need two macros VERSION_SYMBOL and DEFAULT_VERSION_SYMBOL
(maybe this name is more appropriate).
#define VERSION_SYMBOL(b, e, v)         \
    __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
#define DEFAULT_VERSION_SYMBOL(b, e, v) \
    __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
Following on the example, we should have something like:
   int __vsym foo_v18(char *string) {...}
   VERSION_SYMBOL(foo, _v18, 1.8);
   int __vsym foo_v19(int index) {...}
   DEFAULT_VERSION_SYMBOL(foo, _v19, 1.9);
The DSO export table would have the following symbols:
   foo@DPDK_1.8
   foo@@DPDK_1.9
Old binaries linked against DPDK 1.8 would have references to:
foo@@DPDK_1.8
and new binaries linked against DPDK 1.9 would have to:
foo@@DPDK_1.9
Sergio
> > > +#else
> > > +/*
> > > + * No symbol versioning in use
> > > + */
> > > +#define VERSION_SYMBOL(b, e, v)
> > > +#define __vsym
> > > +
> > > +/*
> > > + * RTE_SYMBOL_VERSIONING
> > > + */
> > > +#endif
> > > +
> > > +
> > > +#endif /* _RTE_COMPAT_H_ */
> > > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > > index f458258..82ac309 100644
> > > --- a/mk/rte.lib.mk
> > > +++ b/mk/rte.lib.mk
> > > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> > >  
> > >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> > >  LIB := $(patsubst %.a,%.so,$(LIB))
> > > +
> > > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > > +
> > >  endif
> > >  
> > > +
> > >  _BUILD = $(LIB)
> > >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> > >  _CLEAN = doclean
> > > @@ -160,7 +164,9 @@ endif
> > >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> > >  	@echo "  INSTALL-LIB $(LIB)"
> > >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > > +ifneq ($(LIB),)
> > >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > > +endif
> > >  
> > >  #
> > >  # Clean all generated files
> > > -- 
> > > 1.9.3
> > > 
> > 
> > Sergio
> > 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning
  2014-09-23 10:39  0%   ` Sergio Gonzalez Monroy
@ 2014-09-23 14:58  0%     ` Neil Horman
  2014-09-23 16:29  0%       ` Sergio Gonzalez Monroy
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-23 14:58 UTC (permalink / raw)
  To: Sergio Gonzalez Monroy; +Cc: dev
On Tue, Sep 23, 2014 at 11:39:29AM +0100, Sergio Gonzalez Monroy wrote:
> Hi Neil,
> 
> On Mon, Sep 15, 2014 at 03:23:48PM -0400, Neil Horman wrote:
> > Add initial pass header files to support symbol versioning.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > ---
> >  lib/Makefile                   |  1 +
> >  lib/librte_compat/Makefile     | 38 +++++++++++++++++++
> >  lib/librte_compat/rte_compat.h | 86 ++++++++++++++++++++++++++++++++++++++++++
> >  mk/rte.lib.mk                  |  6 +++
> >  4 files changed, 131 insertions(+)
> >  create mode 100644 lib/librte_compat/Makefile
> >  create mode 100644 lib/librte_compat/rte_compat.h
> > 
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 10c5bb3..a85b55b 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -32,6 +32,7 @@
> >  include $(RTE_SDK)/mk/rte.vars.mk
> >  
> >  DIRS-$(CONFIG_RTE_LIBC) += libc
> > +DIRS-y += librte_compat
> >  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
> >  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
> >  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> > diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> > new file mode 100644
> > index 0000000..a61511a
> > --- /dev/null
> > +++ b/lib/librte_compat/Makefile
> > @@ -0,0 +1,38 @@
> > +#   BSD LICENSE
> > +#
> > +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > +#   All rights reserved.
> > +#
> > +#   Redistribution and use in source and binary forms, with or without
> > +#   modification, are permitted provided that the following conditions
> > +#   are met:
> > +#
> > +#     * Redistributions of source code must retain the above copyright
> > +#       notice, this list of conditions and the following disclaimer.
> > +#     * Redistributions in binary form must reproduce the above copyright
> > +#       notice, this list of conditions and the following disclaimer in
> > +#       the documentation and/or other materials provided with the
> > +#       distribution.
> > +#     * Neither the name of Intel Corporation nor the names of its
> > +#       contributors may be used to endorse or promote products derived
> > +#       from this software without specific prior written permission.
> > +#
> > +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +
> > +# install includes
> > +SYMLINK-y-include := rte_compat.h
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> > new file mode 100644
> > index 0000000..6d65a53
> > --- /dev/null
> > +++ b/lib/librte_compat/rte_compat.h
> > @@ -0,0 +1,86 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > + *   All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_COMPAT_H_
> > +#define _RTE_COMPAT_H_
> > +
> > +/*
> > + * This is just a stringification macro for use below.
> > + */
> > +#define SA(x) #x
> > +
> > +#ifdef RTE_SYMBOL_VERSIONING
> > +
> > +/*
> > + * Provides backwards compatibility when updating exported functions.
> > + * When a symol is exported from a library to provide an API, it also provides a
> > + * calling convention (ABI) that is embodied in its name, return type,
> > + * arguments, etc.  On occasion that function may need to change to accomodate
> > + * new functionality, behavior, etc.  When that occurs, it is desireable to
> > + * allow for backwards compatibility for a time with older binaries that are
> > + * dynamically linked to the dpdk.  to support that the __vsym and
> > + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> > + * <library>_version.map file for a given library allow for multiple versions of
> > + * a symbol to exist in a shared library so that older binaries need not be
> > + * immediately recompiled. Their use is outlined in the following example:
> > + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> > + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> > + *
> > + * To accomplish this:
> > + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
> > + * foo is exported as a global symbol
> > + *
> > + * 2) rename the existing function int foo(char *string) to 
> > + * 	int __vsym foo_v18(char *string)
> > + *
> > + * 3) Add this macro immediately below the function
> > + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> > + *
> > + */
> > +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
> > +#define __vsym __attribute__((used))
> > +
> 
> I may be missing something here but would it not be neccessary to define a
> default symbol?
> Otherwise there would be multiple definitions of a given symbol and the linker
> won't know which symbol version to bind to.
> 
> Following your example, something along these lines:
>  4) Edit lib/<library>/library_version.map to add a DPDK_1.9 node that is a
>    successor to DPDK_1.8, in which foo is exported as a global symbol 
>    DPDK_1.9 {
>       global: foo;
>    } DPDK_1.8;
> 
>  5) rename new function int foo(int index) to
>    int __vsym foo_v19(int index)
> 
>  6) Add this macro immediately below the function:
>    DEFAULT_SYMBOL(foo, _v19, 1.9);
> 
> #define DEFAULT_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> 
You're spot on (though the macro that I created in rte_compat.h is
VERSION_SYMBOL).  
When you use a version script to create a DSO, at link time, the appropriate
version is appended to the symbol name (you can see it with objdump -t in a
linked binary).  If you want to update the symbol to a new version, you do what
I documented in the header file (though now that I re-read it, it could be more
clear.  Hows this for a change to the documentation:
To make a new version of a function foo in a DSO:
1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
   foo is exported as a global symbol
2) rename the existing function int foo(char *string) to 
   int __vsym foo_v18(char *string)
3) Add this macro immediately below the function
   VERSION_SYMBOL(foo, _v18, 1.8);
4) Implement the new version of the function foo.
Those steps above will create two symbols in your export table of the DSO:
foo@@DPDK_1.8
foo@@DPDK_1.9
Any application linked against this DSO will link against the latest version
(DPDK_1.9).  But if you look at the symbols referenced in a binary linked
against an older version of the same DSO, you'll note they explicitly look for
foo@@DPDK_1.8.  Thats how we provide backwards compatibility
Does that answer your questions?
Neil
> > +#else
> > +/*
> > + * No symbol versioning in use
> > + */
> > +#define VERSION_SYMBOL(b, e, v)
> > +#define __vsym
> > +
> > +/*
> > + * RTE_SYMBOL_VERSIONING
> > + */
> > +#endif
> > +
> > +
> > +#endif /* _RTE_COMPAT_H_ */
> > diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> > index f458258..82ac309 100644
> > --- a/mk/rte.lib.mk
> > +++ b/mk/rte.lib.mk
> > @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
> >  
> >  ifeq ($(RTE_BUILD_SHARED_LIB),y)
> >  LIB := $(patsubst %.a,%.so,$(LIB))
> > +
> > +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> > +
> >  endif
> >  
> > +
> >  _BUILD = $(LIB)
> >  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
> >  _CLEAN = doclean
> > @@ -160,7 +164,9 @@ endif
> >  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
> >  	@echo "  INSTALL-LIB $(LIB)"
> >  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> > +ifneq ($(LIB),)
> >  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> > +endif
> >  
> >  #
> >  # Clean all generated files
> > -- 
> > 1.9.3
> > 
> 
> Sergio
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning
  2014-09-15 19:23  4% ` [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning Neil Horman
@ 2014-09-23 10:39  0%   ` Sergio Gonzalez Monroy
  2014-09-23 14:58  0%     ` Neil Horman
  2014-09-25 18:52  4%   ` [dpdk-dev] [PATCH 1/4 v2] " Neil Horman
  1 sibling, 1 reply; 86+ results
From: Sergio Gonzalez Monroy @ 2014-09-23 10:39 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
Hi Neil,
On Mon, Sep 15, 2014 at 03:23:48PM -0400, Neil Horman wrote:
> Add initial pass header files to support symbol versioning.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> ---
>  lib/Makefile                   |  1 +
>  lib/librte_compat/Makefile     | 38 +++++++++++++++++++
>  lib/librte_compat/rte_compat.h | 86 ++++++++++++++++++++++++++++++++++++++++++
>  mk/rte.lib.mk                  |  6 +++
>  4 files changed, 131 insertions(+)
>  create mode 100644 lib/librte_compat/Makefile
>  create mode 100644 lib/librte_compat/rte_compat.h
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 10c5bb3..a85b55b 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -32,6 +32,7 @@
>  include $(RTE_SDK)/mk/rte.vars.mk
>  
>  DIRS-$(CONFIG_RTE_LIBC) += libc
> +DIRS-y += librte_compat
>  DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
>  DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
>  DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
> diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
> new file mode 100644
> index 0000000..a61511a
> --- /dev/null
> +++ b/lib/librte_compat/Makefile
> @@ -0,0 +1,38 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +
> +# install includes
> +SYMLINK-y-include := rte_compat.h
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
> new file mode 100644
> index 0000000..6d65a53
> --- /dev/null
> +++ b/lib/librte_compat/rte_compat.h
> @@ -0,0 +1,86 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_COMPAT_H_
> +#define _RTE_COMPAT_H_
> +
> +/*
> + * This is just a stringification macro for use below.
> + */
> +#define SA(x) #x
> +
> +#ifdef RTE_SYMBOL_VERSIONING
> +
> +/*
> + * Provides backwards compatibility when updating exported functions.
> + * When a symol is exported from a library to provide an API, it also provides a
> + * calling convention (ABI) that is embodied in its name, return type,
> + * arguments, etc.  On occasion that function may need to change to accomodate
> + * new functionality, behavior, etc.  When that occurs, it is desireable to
> + * allow for backwards compatibility for a time with older binaries that are
> + * dynamically linked to the dpdk.  to support that the __vsym and
> + * VERSION_SYMBOL macros are created.  They, in conjunction with the
> + * <library>_version.map file for a given library allow for multiple versions of
> + * a symbol to exist in a shared library so that older binaries need not be
> + * immediately recompiled. Their use is outlined in the following example:
> + * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
> + *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
> + *
> + * To accomplish this:
> + * 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
> + * foo is exported as a global symbol
> + *
> + * 2) rename the existing function int foo(char *string) to 
> + * 	int __vsym foo_v18(char *string)
> + *
> + * 3) Add this macro immediately below the function
> + * 	VERSION_SYMBOL(foo, _v18, 1.8);
> + *
> + */
> +#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
> +#define __vsym __attribute__((used))
> +
I may be missing something here but would it not be neccessary to define a
default symbol?
Otherwise there would be multiple definitions of a given symbol and the linker
won't know which symbol version to bind to.
Following your example, something along these lines:
 4) Edit lib/<library>/library_version.map to add a DPDK_1.9 node that is a
   successor to DPDK_1.8, in which foo is exported as a global symbol 
   DPDK_1.9 {
      global: foo;
   } DPDK_1.8;
 5) rename new function int foo(int index) to
   int __vsym foo_v19(int index)
 6) Add this macro immediately below the function:
   DEFAULT_SYMBOL(foo, _v19, 1.9);
#define DEFAULT_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@@DPDK_"SA(v))
> +#else
> +/*
> + * No symbol versioning in use
> + */
> +#define VERSION_SYMBOL(b, e, v)
> +#define __vsym
> +
> +/*
> + * RTE_SYMBOL_VERSIONING
> + */
> +#endif
> +
> +
> +#endif /* _RTE_COMPAT_H_ */
> diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
> index f458258..82ac309 100644
> --- a/mk/rte.lib.mk
> +++ b/mk/rte.lib.mk
> @@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
>  
>  ifeq ($(RTE_BUILD_SHARED_LIB),y)
>  LIB := $(patsubst %.a,%.so,$(LIB))
> +
> +CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
> +
>  endif
>  
> +
>  _BUILD = $(LIB)
>  _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
>  _CLEAN = doclean
> @@ -160,7 +164,9 @@ endif
>  $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
>  	@echo "  INSTALL-LIB $(LIB)"
>  	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
> +ifneq ($(LIB),)
>  	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
> +endif
>  
>  #
>  # Clean all generated files
> -- 
> 1.9.3
> 
Sergio
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-19 14:18  0%     ` Venkatesan, Venky
@ 2014-09-19 17:45  4%       ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-19 17:45 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev
On Fri, Sep 19, 2014 at 07:18:36AM -0700, Venkatesan, Venky wrote:
> On 9/18/2014 12:14 PM, Neil Horman wrote:
> >On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> >>Hi Neil,
> >>
> >>2014-09-15 15:23, Neil Horman:
> >>>The DPDK ABI develops and changes quickly, which makes it difficult for
> >>>applications to keep up with the latest version of the library, especially when
> >>>it (the DPDK) is built as a set of shared objects, as applications may be built
> >>>against an older version of the library.
> >>>
> >>>To mitigate this, this patch series introduces support for library and symbol
> >>>versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
> >>>
> >>>1) Adds initial support for library versioning.  Each library now has a version
> >>>map that explicitly calls out what symbols are exported to using applications,
> >>>and assigns version(s) to them
> >>>
> >>>2) Adds support macros so that when libraries create incompatible ABI's,
> >>>multiple versions may be supported so that applications linked against older
> >>>DPDK releases can continue to function
> >>>
> >>>3) Adds library soname versioning suffixes so that when ABI's must be broken in
> >>>a fashion that requires a rebuild of older applications, they will break at load
> >>>time, rather than cause unexpected issues at run time.
> >>>
> >>>4) Adds documentation for ABI policy, and provides space to document deprecated
> >>>ABI versions, so that applications might be warned of impending changes.
> >>>
> >>>With these elements in place the DPDK has some support to allow for the extended
> >>>maintenence of older API's while still allowing the freedom to develop new and
> >>>improved API's.
> >>>
> >>>Implementing this feature will require some additional effort on the part of
> >>>developers and reviewers.  When reviewing patches, must be checked against
> >>>existing exports to ensure that the function prototypes are not changing.  If
> >>>they are, the versioning macros must be used, and the library export map should
> >>>be updated to reflect the new version of the function.
> >>>
> >>>When data structures change, if those structures are application accessible,
> >>>apis that accept or return instances of those data structures should have new
> >>>versions created so that users of the old data structure version might co-exist
> >>>at the same time.
> >>Thanks for your efforts.
> >>But I feel this change has too many constraints for the current status of
> >>the DPDK. It's probably too early to adopt such policy.
> >>
> >I think you may be misunderstanding something.  What constraints do you beleive
> >that this patch imposes?  Note it doesn't in any way prevent changes to the ABI
> >of the DPDK, but rather gives us infrastructure to support multiple ABI
> >revisions at the same time, so that applications built against DPDK shared
> >libraries can continue to function properly at least for some time until we
> >decide to deprecate that ABI level.
> >
> >This is all based on the versioning strategy outlined here:
> >http://www.akkadia.org/drepper/dsohowto.pdf
> >
> >That may help clarify things for you.
> >
> >>By the way, this versioning doesn't cover structure changes.
> >No, it doesn't.  No link-time mechanism does so.
> >
> >>How could it be managed?
> >Thats a subject that is open to discussion, but my initial thinking is that we
> >need to handle it on a case by case basis:
> >
> >* For minor updates, where allocation of a structure is done on the heap and new
> >fields need to be added, appending them to the end of a structure and providing
> >an initial value is sufficient.
> >
> >* For major changes, where fields need to be removed, or re-arranged, mostly
> >likely the API surfaces which accept or return those structures as
> >inputs/outputs will need to have new versions written to accept the new version
> >of the structure, and internally we will have to support both formats for a time
> >(according to the policy I documented, that is currently a single major
> >release).  I.e. if you want to change struct foo, which is accepted as a
> >parameter for the function bar(struct foo *ptr), then for a release we would
> >need to create struct foo_v2 with the new format, map a new function foo_v2 to
> >the exported foo@@DPDK_1.(X+1), and internally make the foo functions understand
> >both the origional and v2 versions of the structure.  Then in DPDK release
> >1.X+2, we can remove the old version after posting a deprecation notice with
> >version 1.(X+1)
> >
> >>Don't you think it would be more reliable if managed by packaging?
> >Solving this with packaging defeats the purpose of having shared libraries at
> >all.  While packaging each version of the dpdk separately is possible stopgap
> >solution, in that it allows applications to link to differing versions of the
> >library independently, but that negates any expectation of timely bugfixes for
> >any given version of the DPDK.  That is to say, if you package things this way,
> >and wind up with several parallel versions of the same package, and for any
> >bugfix that comes out upstream, the packager then has the responsibility to
> >adapt that fix to each package.  Thats an unscalable solution, and not something
> >any packager is going to undertake willingly.  I did a hybrid version of this in
> >fedora for exactly that reason.  I packaged the dpdk into its own directory, but
> >have every intention of changing that directory every major release, so that
> >application writers can clearly see when they need to stop updating the dpdk,
> >lest their applications stop linking. I'm not going to have multiple dpdk
> >packages to maintain in parallel, thats just too much work.
>  I do think that this is something that needs to be addressed in the DPDK
> (and not with packaging). Besides what Neil points out, DPDK can work with a
> lot of linux distros, and other operating systems too. Replicating the work
> with each (even if it is just two or three that we focus on) is wasteful.
While its nice and generous of the upstream community to provide packaging
samples, its also not something that that upstream development should really
need to worry about.  Packaging is really meant to address the needs of the
distribution doing the packaging.  Relying on it to solve versioning problems
in place of a more appropriate solution just leads to fragmentation accross
distributions, as invariably different distros will manage that versioning
differently, which leads to applications needing to manage versioning
differently, which is what I'm trying to avoid :)
> >>Thank you for opening this discussion with a constructive proposal.
> >>Let's check it later on once structures will be more stable since
> >>performance is the most critical target.
> Performance will always be a critical target for us. However, as we find
> more problems that need to be solved, we will add new libraries and new
> APIs. That can't be a reason to
> >If I'm being honest, I have to say thats a cop out answer.  We all know that
> >structure stability isn't a priority for the DPDK, nor will it ever be in all
> >likelyhood.  It will continue to evolve and grow as the hardware does.  And this
> >patch set doesn't prevent that from happening.  All it does is provide some
> >level of stability in the API for a period of time to let 3rd party application
> >writers write and package applications with some allowance of time to keep up
> >with upstream changes on their own schedule.
> >
> >I grant you that writing a good API for a shared library is difficult, but
> >(and feel free to disagree with this), if we don't start enforcing policies that
> >require good API design, its not going to happen on its own.  This patch set
> >will highlight those API points which are difficult to maintain accross major
> >releases, and force us to address and improve them.  To that end I've already
> >begun talking to some of the individual library maintainers off list to address
> >some of the API aspects that I have concerns about (exporting variable rather
> >than accessor functions, structures that don't need to be visible to users,
> >etc), and they've started reviewing them.  We can make this better, but we can't
> >just say later, because theres no roadmap that lists structure stability as a
> >line item.  As hardware improves, structures will change to operate more
> >efficiently or support more features.  Without a hard plan, the initial goals of
> >the DPDK (high performance networking) will relegate ABI to such a low priority
> >that it will never be addressed.
> Neil, you're spot on here. To an extent, there will always be changes to the
> API for various reasons. We've done a reasonable job of managing changes so
> far, but there are going to be changes. I do think that this patch provides
> a way for applications to manage through those changes at the pace they can
> absorb.
> 
Thank you.  Let me be clear for anyone who might not have heard me say this
before.  In no way am I trying to limit the evolution of the DPDK ABI or API.
All I'm trying to do here is provide some infrastructure that allows for exiting
ABI to carry on for a minimal guaranteed period of time (currently set to at
least one release beyond its deprecation notification).  I won't lie, that does
mean that API design and maintenence will be a potential extra work effort, but
I don't think thats a bad thing, as doing so will lead to better API design
(that will hopefully just last longer naturally), and it will help expand the
reach of the DPDK as application writers can better separate their development
cycles from that of the DPDK itself.
> Secondly, one other usage scenario that we will run into is when apps using
> different versions of DPDK are installed on the same system - this patch at
> least gives us a start point to at least flag this problem.
Yeah, I'm not sure how I'll deal with that from a packaging standpoint yet, but
soname versioning at least gives me a tool to make it easier to do using
whatever method I choose.
> >
> >To that end, can we discuss specifics?  Can you ennumerate direct points that
> >you feel make this patch unworkable at this time?  I know you mentioned some
> >above, and I think I addressed them (though please ask follow up questions if
> >I've been unclear).  What other concerns do you have?
> >
> >Neil
>  This is a good start - I've put it into my development systems and will let
> you know if I find anything that is a showstopper.
> 
Thanks!
Neil
> Regards,
> -Venky
> 
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-18 19:14  4%   ` Neil Horman
  2014-09-19  8:57  0%     ` Richardson, Bruce
@ 2014-09-19 14:18  0%     ` Venkatesan, Venky
  2014-09-19 17:45  4%       ` Neil Horman
  2014-09-24 18:19  3%     ` Neil Horman
  2 siblings, 1 reply; 86+ results
From: Venkatesan, Venky @ 2014-09-19 14:18 UTC (permalink / raw)
  To: dev
On 9/18/2014 12:14 PM, Neil Horman wrote:
> On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
>> Hi Neil,
>>
>> 2014-09-15 15:23, Neil Horman:
>>> The DPDK ABI develops and changes quickly, which makes it difficult for
>>> applications to keep up with the latest version of the library, especially when
>>> it (the DPDK) is built as a set of shared objects, as applications may be built
>>> against an older version of the library.
>>>
>>> To mitigate this, this patch series introduces support for library and symbol
>>> versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
>>>
>>> 1) Adds initial support for library versioning.  Each library now has a version
>>> map that explicitly calls out what symbols are exported to using applications,
>>> and assigns version(s) to them
>>>
>>> 2) Adds support macros so that when libraries create incompatible ABI's,
>>> multiple versions may be supported so that applications linked against older
>>> DPDK releases can continue to function
>>>
>>> 3) Adds library soname versioning suffixes so that when ABI's must be broken in
>>> a fashion that requires a rebuild of older applications, they will break at load
>>> time, rather than cause unexpected issues at run time.
>>>
>>> 4) Adds documentation for ABI policy, and provides space to document deprecated
>>> ABI versions, so that applications might be warned of impending changes.
>>>
>>> With these elements in place the DPDK has some support to allow for the extended
>>> maintenence of older API's while still allowing the freedom to develop new and
>>> improved API's.
>>>
>>> Implementing this feature will require some additional effort on the part of
>>> developers and reviewers.  When reviewing patches, must be checked against
>>> existing exports to ensure that the function prototypes are not changing.  If
>>> they are, the versioning macros must be used, and the library export map should
>>> be updated to reflect the new version of the function.
>>>
>>> When data structures change, if those structures are application accessible,
>>> apis that accept or return instances of those data structures should have new
>>> versions created so that users of the old data structure version might co-exist
>>> at the same time.
>> Thanks for your efforts.
>> But I feel this change has too many constraints for the current status of
>> the DPDK. It's probably too early to adopt such policy.
>>
> I think you may be misunderstanding something.  What constraints do you beleive
> that this patch imposes?  Note it doesn't in any way prevent changes to the ABI
> of the DPDK, but rather gives us infrastructure to support multiple ABI
> revisions at the same time, so that applications built against DPDK shared
> libraries can continue to function properly at least for some time until we
> decide to deprecate that ABI level.
>
> This is all based on the versioning strategy outlined here:
> http://www.akkadia.org/drepper/dsohowto.pdf
>
> That may help clarify things for you.
>
>> By the way, this versioning doesn't cover structure changes.
> No, it doesn't.  No link-time mechanism does so.
>
>> How could it be managed?
> Thats a subject that is open to discussion, but my initial thinking is that we
> need to handle it on a case by case basis:
>
> * For minor updates, where allocation of a structure is done on the heap and new
> fields need to be added, appending them to the end of a structure and providing
> an initial value is sufficient.
>
> * For major changes, where fields need to be removed, or re-arranged, mostly
> likely the API surfaces which accept or return those structures as
> inputs/outputs will need to have new versions written to accept the new version
> of the structure, and internally we will have to support both formats for a time
> (according to the policy I documented, that is currently a single major
> release).  I.e. if you want to change struct foo, which is accepted as a
> parameter for the function bar(struct foo *ptr), then for a release we would
> need to create struct foo_v2 with the new format, map a new function foo_v2 to
> the exported foo@@DPDK_1.(X+1), and internally make the foo functions understand
> both the origional and v2 versions of the structure.  Then in DPDK release
> 1.X+2, we can remove the old version after posting a deprecation notice with
> version 1.(X+1)
>
>> Don't you think it would be more reliable if managed by packaging?
> Solving this with packaging defeats the purpose of having shared libraries at
> all.  While packaging each version of the dpdk separately is possible stopgap
> solution, in that it allows applications to link to differing versions of the
> library independently, but that negates any expectation of timely bugfixes for
> any given version of the DPDK.  That is to say, if you package things this way,
> and wind up with several parallel versions of the same package, and for any
> bugfix that comes out upstream, the packager then has the responsibility to
> adapt that fix to each package.  Thats an unscalable solution, and not something
> any packager is going to undertake willingly.  I did a hybrid version of this in
> fedora for exactly that reason.  I packaged the dpdk into its own directory, but
> have every intention of changing that directory every major release, so that
> application writers can clearly see when they need to stop updating the dpdk,
> lest their applications stop linking. I'm not going to have multiple dpdk
> packages to maintain in parallel, thats just too much work.
  I do think that this is something that needs to be addressed in the 
DPDK (and not with packaging). Besides what Neil points out, DPDK can 
work with a lot of linux distros, and other operating systems too. 
Replicating the work with each (even if it is just two or three that we 
focus on) is wasteful.
>> Thank you for opening this discussion with a constructive proposal.
>> Let's check it later on once structures will be more stable since
>> performance is the most critical target.
Performance will always be a critical target for us. However, as we find 
more problems that need to be solved, we will add new libraries and new 
APIs. That can't be a reason to
> If I'm being honest, I have to say thats a cop out answer.  We all know that
> structure stability isn't a priority for the DPDK, nor will it ever be in all
> likelyhood.  It will continue to evolve and grow as the hardware does.  And this
> patch set doesn't prevent that from happening.  All it does is provide some
> level of stability in the API for a period of time to let 3rd party application
> writers write and package applications with some allowance of time to keep up
> with upstream changes on their own schedule.
>
> I grant you that writing a good API for a shared library is difficult, but
> (and feel free to disagree with this), if we don't start enforcing policies that
> require good API design, its not going to happen on its own.  This patch set
> will highlight those API points which are difficult to maintain accross major
> releases, and force us to address and improve them.  To that end I've already
> begun talking to some of the individual library maintainers off list to address
> some of the API aspects that I have concerns about (exporting variable rather
> than accessor functions, structures that don't need to be visible to users,
> etc), and they've started reviewing them.  We can make this better, but we can't
> just say later, because theres no roadmap that lists structure stability as a
> line item.  As hardware improves, structures will change to operate more
> efficiently or support more features.  Without a hard plan, the initial goals of
> the DPDK (high performance networking) will relegate ABI to such a low priority
> that it will never be addressed.
Neil, you're spot on here. To an extent, there will always be changes to 
the API for various reasons. We've done a reasonable job of managing 
changes so far, but there are going to be changes. I do think that this 
patch provides a way for applications to manage through those changes at 
the pace they can absorb.
Secondly, one other usage scenario that we will run into is when apps 
using different versions of DPDK are installed on the same system - this 
patch at least gives us a start point to at least flag this problem.
>
> To that end, can we discuss specifics?  Can you ennumerate direct points that
> you feel make this patch unworkable at this time?  I know you mentioned some
> above, and I think I addressed them (though please ask follow up questions if
> I've been unclear).  What other concerns do you have?
>
> Neil
>   
  This is a good start - I've put it into my development systems and 
will let you know if I find anything that is a showstopper.
Regards,
-Venky
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 2/4] Provide initial versioning for all DPDK libraries
  2014-09-19  9:45  4%   ` Bruce Richardson
@ 2014-09-19 10:22  0%     ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-19 10:22 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev
On Fri, Sep 19, 2014 at 10:45:38AM +0100, Bruce Richardson wrote:
> On Mon, Sep 15, 2014 at 03:23:49PM -0400, Neil Horman wrote:
> > Add linker version script files to each DPDK library to put a stake in the
> > ground from which we can start cleaning up API's
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> > ---
> >  <... snip for brevity ...>
> >
> > diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> > index 65e566d..1f96645 100644
> > --- a/lib/librte_acl/Makefile
> > +++ b/lib/librte_acl/Makefile
> > @@ -37,6 +37,8 @@ LIB = librte_acl.a
> >  CFLAGS += -O3
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
> >  
> > +EXPORT_MAP := $(RTE_SDK)/lib/librte_acl/rte_acl_version.map
> > +
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
> >  
> > diff --git a/lib/librte_acl/rte_acl_version.map b/lib/librte_acl/rte_acl_version.map
> > new file mode 100644
> > index 0000000..4480690
> > --- /dev/null
> > +++ b/lib/librte_acl/rte_acl_version.map
> > @@ -0,0 +1,19 @@
> > +DPDK_1.8 {
> > +	global:
> > +	rte_acl_create;
> > +	rte_acl_find_existing;
> > +	rte_acl_free;
> > +	rte_acl_add_rules;
> > +	rte_acl_reset_rules;
> > +	rte_acl_build;
> > +	rte_acl_reset;
> > +	rte_acl_classify;
> > +	rte_acl_dump;
> > +	rte_acl_list_dump;
> > +	rte_acl_ipv4vlan_add_rules;
> > +	rte_acl_ipv4vlan_build;
> > +	rte_acl_classify_scalar;
> > +
> > +	local: *;
> > +};
> > +
> 
> Looking at this versionning, it strikes me that this looks like the perfect 
> opportunity to go to a 2.0 version number.
> 
> My reasoning:
> * We have already got fairly significant ABI and indeed API changes in this 
>   release due to the mbuf rework. That allow makes it a logical point to 
>   bump the Intel DPDK major version number to 2.0
> * Having the API versioning start at a 2.0 looks neater than having it at 
>   1.8, since .0 is a nice round version number to start with. Also if we 
>   decide in the near future for whatever reasons to go to a 2.0 release, the 
>   ABIs are probably still going to be 1.8. [Again, if we ever want to go to 
>   2.0, now looks the perfect time]
> * For the naming of the .so files, starting with them at a .2 now seems 
>   reasonable to me, denoting a clean break with the older releases which did 
>   have a different ABI. [Though again it makes more sense if you consider 
>   that we may want to move to a 2.0 in future].
> 
> What do people think?
> 
I'm fine with it.  Just so that we're clear, this patch treats versions like
arbitrary strings (the file structure denotes version ordinality), so 1.8 vs 2.0
makes absolutely no difference as far as it goes, the exported version value is
a matter of policy, but I'm fine with making that adjustment
Neil
> /Bruce
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 2/4] Provide initial versioning for all DPDK libraries
  @ 2014-09-19  9:45  4%   ` Bruce Richardson
  2014-09-19 10:22  0%     ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Bruce Richardson @ 2014-09-19  9:45 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Mon, Sep 15, 2014 at 03:23:49PM -0400, Neil Horman wrote:
> Add linker version script files to each DPDK library to put a stake in the
> ground from which we can start cleaning up API's
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> CC: "Richardson, Bruce" <bruce.richardson@intel.com>
> ---
>  <... snip for brevity ...>
>
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 65e566d..1f96645 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -37,6 +37,8 @@ LIB = librte_acl.a
>  CFLAGS += -O3
>  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
>  
> +EXPORT_MAP := $(RTE_SDK)/lib/librte_acl/rte_acl_version.map
> +
>  # all source are stored in SRCS-y
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
>  
> diff --git a/lib/librte_acl/rte_acl_version.map b/lib/librte_acl/rte_acl_version.map
> new file mode 100644
> index 0000000..4480690
> --- /dev/null
> +++ b/lib/librte_acl/rte_acl_version.map
> @@ -0,0 +1,19 @@
> +DPDK_1.8 {
> +	global:
> +	rte_acl_create;
> +	rte_acl_find_existing;
> +	rte_acl_free;
> +	rte_acl_add_rules;
> +	rte_acl_reset_rules;
> +	rte_acl_build;
> +	rte_acl_reset;
> +	rte_acl_classify;
> +	rte_acl_dump;
> +	rte_acl_list_dump;
> +	rte_acl_ipv4vlan_add_rules;
> +	rte_acl_ipv4vlan_build;
> +	rte_acl_classify_scalar;
> +
> +	local: *;
> +};
> +
Looking at this versionning, it strikes me that this looks like the perfect 
opportunity to go to a 2.0 version number.
My reasoning:
* We have already got fairly significant ABI and indeed API changes in this 
  release due to the mbuf rework. That allow makes it a logical point to 
  bump the Intel DPDK major version number to 2.0
* Having the API versioning start at a 2.0 looks neater than having it at 
  1.8, since .0 is a nice round version number to start with. Also if we 
  decide in the near future for whatever reasons to go to a 2.0 release, the 
  ABIs are probably still going to be 1.8. [Again, if we ever want to go to 
  2.0, now looks the perfect time]
* For the naming of the .so files, starting with them at a .2 now seems 
  reasonable to me, denoting a clean break with the older releases which did 
  have a different ABI. [Though again it makes more sense if you consider 
  that we may want to move to a 2.0 in future].
What do people think?
/Bruce
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-18 19:14  4%   ` Neil Horman
@ 2014-09-19  8:57  0%     ` Richardson, Bruce
  2014-09-19 14:18  0%     ` Venkatesan, Venky
  2014-09-24 18:19  3%     ` Neil Horman
  2 siblings, 0 replies; 86+ results
From: Richardson, Bruce @ 2014-09-19  8:57 UTC (permalink / raw)
  To: Neil Horman, Thomas Monjalon; +Cc: dev
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, September 18, 2014 8:14 PM
> To: Thomas Monjalon
> Cc: dev@dpdk.org; Richardson, Bruce
> Subject: Re: [PATCH 0/4] Add DSO symbol versioning to support backwards
> compatibility
> 
> On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> > Hi Neil,
> >
> > 2014-09-15 15:23, Neil Horman:
> > > The DPDK ABI develops and changes quickly, which makes it difficult for
> > > applications to keep up with the latest version of the library, especially when
> > > it (the DPDK) is built as a set of shared objects, as applications may be built
> > > against an older version of the library.
> > >
> > > To mitigate this, this patch series introduces support for library and symbol
> > > versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
> > >
> > > 1) Adds initial support for library versioning.  Each library now has a version
> > > map that explicitly calls out what symbols are exported to using applications,
> > > and assigns version(s) to them
> > >
> > > 2) Adds support macros so that when libraries create incompatible ABI's,
> > > multiple versions may be supported so that applications linked against older
> > > DPDK releases can continue to function
> > >
> > > 3) Adds library soname versioning suffixes so that when ABI's must be broken
> in
> > > a fashion that requires a rebuild of older applications, they will break at load
> > > time, rather than cause unexpected issues at run time.
> > >
> > > 4) Adds documentation for ABI policy, and provides space to document
> deprecated
> > > ABI versions, so that applications might be warned of impending changes.
> > >
> > > With these elements in place the DPDK has some support to allow for the
> extended
> > > maintenence of older API's while still allowing the freedom to develop new
> and
> > > improved API's.
> > >
> > > Implementing this feature will require some additional effort on the part of
> > > developers and reviewers.  When reviewing patches, must be checked
> against
> > > existing exports to ensure that the function prototypes are not changing.  If
> > > they are, the versioning macros must be used, and the library export map
> should
> > > be updated to reflect the new version of the function.
> > >
> > > When data structures change, if those structures are application accessible,
> > > apis that accept or return instances of those data structures should have new
> > > versions created so that users of the old data structure version might co-
> exist
> > > at the same time.
> >
> > Thanks for your efforts.
> > But I feel this change has too many constraints for the current status of
> > the DPDK. It's probably too early to adopt such policy.
> >
> I think you may be misunderstanding something.  What constraints do you
> beleive
> that this patch imposes?  Note it doesn't in any way prevent changes to the ABI
> of the DPDK, but rather gives us infrastructure to support multiple ABI
> revisions at the same time, so that applications built against DPDK shared
> libraries can continue to function properly at least for some time until we
> decide to deprecate that ABI level.
> 
I view all this as a positive step. I consider backward compatibility as something that should always be encouraged, and I agree with Neil that this should allow us to guarantee compatibility for our customers while still having a path open to us to change things if we really need to.
> This is all based on the versioning strategy outlined here:
> http://www.akkadia.org/drepper/dsohowto.pdf
> 
> That may help clarify things for you.
> 
> > By the way, this versioning doesn't cover structure changes.
> No, it doesn't.  No link-time mechanism does so.
> 
> > How could it be managed?
> Thats a subject that is open to discussion, but my initial thinking is that we
> need to handle it on a case by case basis:
> 
> * For minor updates, where allocation of a structure is done on the heap and
> new
> fields need to be added, appending them to the end of a structure and providing
> an initial value is sufficient.
> 
> * For major changes, where fields need to be removed, or re-arranged, mostly
> likely the API surfaces which accept or return those structures as
> inputs/outputs will need to have new versions written to accept the new version
> of the structure, and internally we will have to support both formats for a time
> (according to the policy I documented, that is currently a single major
> release).  I.e. if you want to change struct foo, which is accepted as a
> parameter for the function bar(struct foo *ptr), then for a release we would
> need to create struct foo_v2 with the new format, map a new function foo_v2
> to
> the exported foo@@DPDK_1.(X+1), and internally make the foo functions
> understand
> both the origional and v2 versions of the structure.  Then in DPDK release
> 1.X+2, we can remove the old version after posting a deprecation notice with
> version 1.(X+1)
I really, really like having an official deprecation policy. The one proposed seems reasonable as a start point - we can always decide later whether we want a 1, 2 or 3 release gap between marking something as deprecated and having it removed.
/Bruce
 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-18 18:23  0% ` [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Thomas Monjalon
@ 2014-09-18 19:14  4%   ` Neil Horman
  2014-09-19  8:57  0%     ` Richardson, Bruce
                       ` (2 more replies)
  0 siblings, 3 replies; 86+ results
From: Neil Horman @ 2014-09-18 19:14 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Thu, Sep 18, 2014 at 08:23:36PM +0200, Thomas Monjalon wrote:
> Hi Neil,
> 
> 2014-09-15 15:23, Neil Horman:
> > The DPDK ABI develops and changes quickly, which makes it difficult for
> > applications to keep up with the latest version of the library, especially when
> > it (the DPDK) is built as a set of shared objects, as applications may be built
> > against an older version of the library.
> > 
> > To mitigate this, this patch series introduces support for library and symbol
> > versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
> > 
> > 1) Adds initial support for library versioning.  Each library now has a version
> > map that explicitly calls out what symbols are exported to using applications,
> > and assigns version(s) to them
> > 
> > 2) Adds support macros so that when libraries create incompatible ABI's,
> > multiple versions may be supported so that applications linked against older
> > DPDK releases can continue to function
> > 
> > 3) Adds library soname versioning suffixes so that when ABI's must be broken in
> > a fashion that requires a rebuild of older applications, they will break at load
> > time, rather than cause unexpected issues at run time.
> > 
> > 4) Adds documentation for ABI policy, and provides space to document deprecated
> > ABI versions, so that applications might be warned of impending changes.
> > 
> > With these elements in place the DPDK has some support to allow for the extended
> > maintenence of older API's while still allowing the freedom to develop new and
> > improved API's.
> > 
> > Implementing this feature will require some additional effort on the part of
> > developers and reviewers.  When reviewing patches, must be checked against
> > existing exports to ensure that the function prototypes are not changing.  If
> > they are, the versioning macros must be used, and the library export map should
> > be updated to reflect the new version of the function.
> > 
> > When data structures change, if those structures are application accessible,
> > apis that accept or return instances of those data structures should have new
> > versions created so that users of the old data structure version might co-exist
> > at the same time.
> 
> Thanks for your efforts.
> But I feel this change has too many constraints for the current status of
> the DPDK. It's probably too early to adopt such policy.
> 
I think you may be misunderstanding something.  What constraints do you beleive
that this patch imposes?  Note it doesn't in any way prevent changes to the ABI
of the DPDK, but rather gives us infrastructure to support multiple ABI
revisions at the same time, so that applications built against DPDK shared
libraries can continue to function properly at least for some time until we
decide to deprecate that ABI level.
This is all based on the versioning strategy outlined here:
http://www.akkadia.org/drepper/dsohowto.pdf
That may help clarify things for you.
> By the way, this versioning doesn't cover structure changes.
No, it doesn't.  No link-time mechanism does so.
> How could it be managed?
Thats a subject that is open to discussion, but my initial thinking is that we
need to handle it on a case by case basis:
* For minor updates, where allocation of a structure is done on the heap and new
fields need to be added, appending them to the end of a structure and providing
an initial value is sufficient.
* For major changes, where fields need to be removed, or re-arranged, mostly
likely the API surfaces which accept or return those structures as
inputs/outputs will need to have new versions written to accept the new version
of the structure, and internally we will have to support both formats for a time
(according to the policy I documented, that is currently a single major
release).  I.e. if you want to change struct foo, which is accepted as a
parameter for the function bar(struct foo *ptr), then for a release we would
need to create struct foo_v2 with the new format, map a new function foo_v2 to
the exported foo@@DPDK_1.(X+1), and internally make the foo functions understand
both the origional and v2 versions of the structure.  Then in DPDK release
1.X+2, we can remove the old version after posting a deprecation notice with
version 1.(X+1)
> Don't you think it would be more reliable if managed by packaging?
Solving this with packaging defeats the purpose of having shared libraries at
all.  While packaging each version of the dpdk separately is possible stopgap
solution, in that it allows applications to link to differing versions of the
library independently, but that negates any expectation of timely bugfixes for
any given version of the DPDK.  That is to say, if you package things this way,
and wind up with several parallel versions of the same package, and for any
bugfix that comes out upstream, the packager then has the responsibility to
adapt that fix to each package.  Thats an unscalable solution, and not something
any packager is going to undertake willingly.  I did a hybrid version of this in
fedora for exactly that reason.  I packaged the dpdk into its own directory, but
have every intention of changing that directory every major release, so that
application writers can clearly see when they need to stop updating the dpdk,
lest their applications stop linking. I'm not going to have multiple dpdk
packages to maintain in parallel, thats just too much work.
> 
> Thank you for opening this discussion with a constructive proposal. 
> Let's check it later on once structures will be more stable since 
> performance is the most critical target.
If I'm being honest, I have to say thats a cop out answer.  We all know that
structure stability isn't a priority for the DPDK, nor will it ever be in all
likelyhood.  It will continue to evolve and grow as the hardware does.  And this
patch set doesn't prevent that from happening.  All it does is provide some
level of stability in the API for a period of time to let 3rd party application
writers write and package applications with some allowance of time to keep up
with upstream changes on their own schedule.
I grant you that writing a good API for a shared library is difficult, but
(and feel free to disagree with this), if we don't start enforcing policies that
require good API design, its not going to happen on its own.  This patch set
will highlight those API points which are difficult to maintain accross major
releases, and force us to address and improve them.  To that end I've already
begun talking to some of the individual library maintainers off list to address
some of the API aspects that I have concerns about (exporting variable rather
than accessor functions, structures that don't need to be visible to users,
etc), and they've started reviewing them.  We can make this better, but we can't
just say later, because theres no roadmap that lists structure stability as a
line item.  As hardware improves, structures will change to operate more
efficiently or support more features.  Without a hard plan, the initial goals of
the DPDK (high performance networking) will relegate ABI to such a low priority
that it will never be addressed. 
To that end, can we discuss specifics?  Can you ennumerate direct points that
you feel make this patch unworkable at this time?  I know you mentioned some
above, and I think I addressed them (though please ask follow up questions if
I've been unclear).  What other concerns do you have?
Neil
 
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
  2014-09-15 19:23  4% [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Neil Horman
                   ` (3 preceding siblings ...)
  2014-09-15 19:23 23% ` [dpdk-dev] [PATCH 4/4] docs: Add ABI documentation Neil Horman
@ 2014-09-18 18:23  0% ` Thomas Monjalon
  2014-09-18 19:14  4%   ` Neil Horman
  4 siblings, 1 reply; 86+ results
From: Thomas Monjalon @ 2014-09-18 18:23 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
Hi Neil,
2014-09-15 15:23, Neil Horman:
> The DPDK ABI develops and changes quickly, which makes it difficult for
> applications to keep up with the latest version of the library, especially when
> it (the DPDK) is built as a set of shared objects, as applications may be built
> against an older version of the library.
> 
> To mitigate this, this patch series introduces support for library and symbol
> versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
> 
> 1) Adds initial support for library versioning.  Each library now has a version
> map that explicitly calls out what symbols are exported to using applications,
> and assigns version(s) to them
> 
> 2) Adds support macros so that when libraries create incompatible ABI's,
> multiple versions may be supported so that applications linked against older
> DPDK releases can continue to function
> 
> 3) Adds library soname versioning suffixes so that when ABI's must be broken in
> a fashion that requires a rebuild of older applications, they will break at load
> time, rather than cause unexpected issues at run time.
> 
> 4) Adds documentation for ABI policy, and provides space to document deprecated
> ABI versions, so that applications might be warned of impending changes.
> 
> With these elements in place the DPDK has some support to allow for the extended
> maintenence of older API's while still allowing the freedom to develop new and
> improved API's.
> 
> Implementing this feature will require some additional effort on the part of
> developers and reviewers.  When reviewing patches, must be checked against
> existing exports to ensure that the function prototypes are not changing.  If
> they are, the versioning macros must be used, and the library export map should
> be updated to reflect the new version of the function.
> 
> When data structures change, if those structures are application accessible,
> apis that accept or return instances of those data structures should have new
> versions created so that users of the old data structure version might co-exist
> at the same time.
Thanks for your efforts.
But I feel this change has too many constraints for the current status of
the DPDK. It's probably too early to adopt such policy.
By the way, this versioning doesn't cover structure changes.
How could it be managed?
Don't you think it would be more reliable if managed by packaging?
Thank you for opening this discussion with a constructive proposal. 
Let's check it later on once structures will be more stable since 
performance is the most critical target.
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
  2014-09-18 15:31  0%   ` De Lara Guarch, Pablo
  2014-09-18 15:45  0%     ` Thomas Monjalon
@ 2014-09-18 16:09  3%     ` Neil Horman
  1 sibling, 0 replies; 86+ results
From: Neil Horman @ 2014-09-18 16:09 UTC (permalink / raw)
  To: De Lara Guarch, Pablo; +Cc: dev
On Thu, Sep 18, 2014 at 03:31:34PM +0000, De Lara Guarch, Pablo wrote:
> Hi Neil,
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, September 18, 2014 1:21 PM
> > To: De Lara Guarch, Pablo
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
> > 
> > On Thu, Sep 18, 2014 at 11:34:28AM +0100, Pablo de Lara wrote:
> > > This is an alternative hash implementation to the existing hash library.
> > > This patch set provides a thread safe hash implementation, it allows users
> > > to use multiple readers/writers working on a same hash table.
> > > Main differences between the previous and the new implementation are:
> > >
> > > - Multiple readers/writers can work on the same hash table,
> > >   whereas in the previous implementation writers could not work
> > >   on the table at the same time readers do.
> > > - Previous implementation returned an index to a table after a lookup.
> > >   This implementation returns 8-byte integers or pointers to external data.
> > > - Maximum entries to be looked up in bursts is 64, instead of 16.
> > > - Maximum key length has being increased to 128, instead of a maximum of
> > 64.
> > >
> > > Basic implementation:
> > >
> > > - A sparse table containing buckets (64-byte long) with hashes,
> > >   most of which are empty, and indexes to the second table.
> > > - A compact table containing keys for final matching,
> > >   plus data associated to them.
> > >
> > Thread safe hash tables seem to me like a configuration option rather than a
> > new
> > library.  Instead of creating a whole new library (with a new API and ABI to
> > maintain, why not just add thread safety as a configurable option to the
> > existing hash library.  That saves code space in the DPDK, and reduces
> > application complexity (as the same api is useable for thread safe and unsafe
> > hash tables)
> 
> Makes sense, but implementation has changed so much to add it directly into the existing library.
> At first, this was designed to be a replacement of the existing library, 
> but since API is a bit different from the old one, it was thought to leave it as an alternative,
>  so users are not forced to have to change their applications if they don't want to use thread safe hash tables.
What are you talking about?  The API calls between rte_hash and the new
rte_tshash are identical.  The only thing that differs are the names slightly
(rte_hash vs rte_tshash), and some of the elements of the internal data
structure, which really shouldn't be accessed by the application anyway (though
that does play into some of the ABI work we've started looking at).  It should
be pretty easy to modify the rte_hash library to optionally include thread
safety.  A flag in the config structure, a spinlock in the internal
representation, and you're home free.
Neil
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
  2014-09-18 15:31  0%   ` De Lara Guarch, Pablo
@ 2014-09-18 15:45  0%     ` Thomas Monjalon
  2014-09-18 16:09  3%     ` Neil Horman
  1 sibling, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-09-18 15:45 UTC (permalink / raw)
  To: De Lara Guarch, Pablo; +Cc: dev
2014-09-18 15:31, De Lara Guarch, Pablo:
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Thread safe hash tables seem to me like a configuration option rather than
> > a new
> > library.  Instead of creating a whole new library (with a new API and ABI
> > to maintain, why not just add thread safety as a configurable option to
> > the existing hash library.  That saves code space in the DPDK, and
> > reduces application complexity (as the same api is useable for thread
> > safe and unsafe hash tables)
> 
> Makes sense, but implementation has changed so much to add it directly into
> the existing library. At first, this was designed to be a replacement of
> the existing library, but since API is a bit different from the old one, it
> was thought to leave it as an alternative, so users are not forced to have
> to change their applications if they don't want to use thread safe hash
> tables.
It makes me smile :)
You basically explain that it's more complicated to properly merge two
different implementations than just throwing a new one in the big DPDK bucket.
My opinion is that it should not be integrated as is because we must try to
make DPDK something else than a trash bucket.
Thanks for continuing your effort to make DPDK easier and better.
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
  2014-09-18 12:21  3% ` Neil Horman
@ 2014-09-18 15:31  0%   ` De Lara Guarch, Pablo
  2014-09-18 15:45  0%     ` Thomas Monjalon
  2014-09-18 16:09  3%     ` Neil Horman
  0 siblings, 2 replies; 86+ results
From: De Lara Guarch, Pablo @ 2014-09-18 15:31 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
Hi Neil,
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, September 18, 2014 1:21 PM
> To: De Lara Guarch, Pablo
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
> 
> On Thu, Sep 18, 2014 at 11:34:28AM +0100, Pablo de Lara wrote:
> > This is an alternative hash implementation to the existing hash library.
> > This patch set provides a thread safe hash implementation, it allows users
> > to use multiple readers/writers working on a same hash table.
> > Main differences between the previous and the new implementation are:
> >
> > - Multiple readers/writers can work on the same hash table,
> >   whereas in the previous implementation writers could not work
> >   on the table at the same time readers do.
> > - Previous implementation returned an index to a table after a lookup.
> >   This implementation returns 8-byte integers or pointers to external data.
> > - Maximum entries to be looked up in bursts is 64, instead of 16.
> > - Maximum key length has being increased to 128, instead of a maximum of
> 64.
> >
> > Basic implementation:
> >
> > - A sparse table containing buckets (64-byte long) with hashes,
> >   most of which are empty, and indexes to the second table.
> > - A compact table containing keys for final matching,
> >   plus data associated to them.
> >
> Thread safe hash tables seem to me like a configuration option rather than a
> new
> library.  Instead of creating a whole new library (with a new API and ABI to
> maintain, why not just add thread safety as a configurable option to the
> existing hash library.  That saves code space in the DPDK, and reduces
> application complexity (as the same api is useable for thread safe and unsafe
> hash tables)
Makes sense, but implementation has changed so much to add it directly into the existing library.
At first, this was designed to be a replacement of the existing library, 
but since API is a bit different from the old one, it was thought to leave it as an alternative,
 so users are not forced to have to change their applications if they don't want to use thread safe hash tables.
> 
> Neil
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library
  @ 2014-09-18 12:21  3% ` Neil Horman
  2014-09-18 15:31  0%   ` De Lara Guarch, Pablo
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-18 12:21 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev
On Thu, Sep 18, 2014 at 11:34:28AM +0100, Pablo de Lara wrote:
> This is an alternative hash implementation to the existing hash library. 
> This patch set provides a thread safe hash implementation, it allows users 
> to use multiple readers/writers working on a same hash table.
> Main differences between the previous and the new implementation are:
> 
> - Multiple readers/writers can work on the same hash table, 
>   whereas in the previous implementation writers could not work 
>   on the table at the same time readers do.
> - Previous implementation returned an index to a table after a lookup. 
>   This implementation returns 8-byte integers or pointers to external data.
> - Maximum entries to be looked up in bursts is 64, instead of 16.
> - Maximum key length has being increased to 128, instead of a maximum of 64.
> 
> Basic implementation:
> 
> - A sparse table containing buckets (64-byte long) with hashes,
>   most of which are empty, and indexes to the second table.
> - A compact table containing keys for final matching, 
>   plus data associated to them.
> 
Thread safe hash tables seem to me like a configuration option rather than a new
library.  Instead of creating a whole new library (with a new API and ABI to
maintain, why not just add thread safety as a configurable option to the
existing hash library.  That saves code space in the DPDK, and reduces
application complexity (as the same api is useable for thread safe and unsafe
hash tables)
Neil
^ permalink raw reply	[relevance 3%]
* [dpdk-dev] [PATCH 4/4] docs: Add ABI documentation
  2014-09-15 19:23  4% [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Neil Horman
                   ` (2 preceding siblings ...)
  2014-09-15 19:23  7% ` [dpdk-dev] [PATCH 3/4] Add library version extenstion Neil Horman
@ 2014-09-15 19:23 23% ` Neil Horman
  2014-09-18 18:23  0% ` [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Thomas Monjalon
  4 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-15 19:23 UTC (permalink / raw)
  To: dev
Adding a document describing rudimentary ABI policy and adding notice space for
any deprecation announcements
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
CC: "Richardson, Bruce" <bruce.richardson@intel.com>
---
 doc/abi.txt | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)
 create mode 100644 doc/abi.txt
diff --git a/doc/abi.txt b/doc/abi.txt
new file mode 100644
index 0000000..b6dcc7d
--- /dev/null
+++ b/doc/abi.txt
@@ -0,0 +1,17 @@
+ABI policy:
+	ABI versions are set at the time of major release labeling, and ABI may
+change multiple times between the last labeling and the HEAD label of the git
+tree without warning
+
+	ABI versions, once released are available until such time as their
+deprecation has been noted here for at least one major release cycle, after it
+has been tagged.  E.g. the ABI for DPDK 1.8 is shipped, and then the decision to
+remove it is made during the development of DPDK 1.9.  The decision will be
+recorded here, shipped with the DPDK 1.9 release, and actually removed when DPDK
+1.10 ships.
+
+	ABI versions may be deprecated in whole, or in part as needed by a given
+update.
+
+Deprecation Notices:
+
-- 
1.9.3
^ permalink raw reply	[relevance 23%]
* [dpdk-dev] [PATCH 3/4] Add library version extenstion
  2014-09-15 19:23  4% [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Neil Horman
  2014-09-15 19:23  4% ` [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning Neil Horman
  @ 2014-09-15 19:23  7% ` Neil Horman
  2014-09-15 19:23 23% ` [dpdk-dev] [PATCH 4/4] docs: Add ABI documentation Neil Horman
  2014-09-18 18:23  0% ` [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Thomas Monjalon
  4 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-09-15 19:23 UTC (permalink / raw)
  To: dev
To differentiate libraries that break ABI, we add a library version number
suffix to the library, which must be incremented when a given libraries ABI is
broken.  This patch enforces that addition, sets the initial abi soname
extension to 1 for each library and creates a symlink to the base SONAME so that
the test applications will link properly.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
CC: "Richardson, Bruce" <bruce.richardson@intel.com>
---
 lib/librte_acl/Makefile              |  2 ++
 lib/librte_cfgfile/Makefile          |  2 ++
 lib/librte_cmdline/Makefile          |  2 ++
 lib/librte_compat/Makefile           |  2 ++
 lib/librte_distributor/Makefile      |  2 ++
 lib/librte_eal/bsdapp/eal/Makefile   |  2 ++
 lib/librte_eal/linuxapp/eal/Makefile |  2 ++
 lib/librte_ether/Makefile            |  2 ++
 lib/librte_hash/Makefile             |  2 ++
 lib/librte_ip_frag/Makefile          |  2 ++
 lib/librte_ivshmem/Makefile          |  2 ++
 lib/librte_kni/Makefile              |  2 ++
 lib/librte_kvargs/Makefile           |  2 ++
 lib/librte_lpm/Makefile              |  2 ++
 lib/librte_malloc/Makefile           |  2 ++
 lib/librte_mbuf/Makefile             |  2 ++
 lib/librte_mempool/Makefile          |  2 ++
 lib/librte_meter/Makefile            |  2 ++
 lib/librte_pipeline/Makefile         |  2 ++
 lib/librte_pmd_bond/Makefile         |  2 ++
 lib/librte_pmd_e1000/Makefile        |  2 ++
 lib/librte_pmd_i40e/Makefile         |  2 ++
 lib/librte_pmd_ixgbe/Makefile        |  2 ++
 lib/librte_pmd_pcap/Makefile         |  2 ++
 lib/librte_pmd_ring/Makefile         |  2 ++
 lib/librte_pmd_virtio/Makefile       |  2 ++
 lib/librte_pmd_vmxnet3/Makefile      |  2 ++
 lib/librte_pmd_xenvirt/Makefile      |  2 ++
 lib/librte_port/Makefile             |  2 ++
 lib/librte_power/Makefile            |  2 ++
 lib/librte_ring/Makefile             |  2 ++
 lib/librte_sched/Makefile            |  2 ++
 lib/librte_table/Makefile            |  2 ++
 lib/librte_timer/Makefile            |  2 ++
 mk/rte.lib.mk                        | 12 +++++++++---
 35 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 1f96645..4db403b 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -39,6 +39,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_acl/rte_acl_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 
diff --git a/lib/librte_cfgfile/Makefile b/lib/librte_cfgfile/Makefile
index e655098..1c81579 100644
--- a/lib/librte_cfgfile/Makefile
+++ b/lib/librte_cfgfile/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_cfgfile/rte_cfgfile_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_cmdline/Makefile b/lib/librte_cmdline/Makefile
index 1a47173..b0ab5b6 100644
--- a/lib/librte_cmdline/Makefile
+++ b/lib/librte_cmdline/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_cmdline/rte_cmdline_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_CMDLINE) := cmdline.c
 SRCS-$(CONFIG_RTE_LIBRTE_CMDLINE) += cmdline_cirbuf.c
diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
index a61511a..5f369e5 100644
--- a/lib/librte_compat/Makefile
+++ b/lib/librte_compat/Makefile
@@ -32,6 +32,8 @@
 include $(RTE_SDK)/mk/rte.vars.mk
 
 
+LIBABIVER := 1
+
 # install includes
 SYMLINK-y-include := rte_compat.h
 
diff --git a/lib/librte_distributor/Makefile b/lib/librte_distributor/Makefile
index 97d8bbb..12d9df1 100644
--- a/lib/librte_distributor/Makefile
+++ b/lib/librte_distributor/Makefile
@@ -39,6 +39,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_distributor/rte_distributor_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_DISTRIBUTOR) := rte_distributor.c
 
diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 2caaf00..2edd880 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -47,6 +47,8 @@ CFLAGS += $(WERROR_FLAGS) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_eal/bsdapp/eal/rte_eal_version.map
 
+LIBABIVER := 1
+
 # specific to linuxapp exec-env
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_BSDAPP) := eal.c
 SRCS-$(CONFIG_RTE_LIBRTE_EAL_BSDAPP) += eal_memory.c
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 254d59c..267f2c7 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -35,6 +35,8 @@ LIB = librte_eal.a
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_eal/linuxapp/eal/rte_eal_version.map
 
+LIBABIVER := 1
+
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 
 CFLAGS += -I$(SRCDIR)/include
diff --git a/lib/librte_ether/Makefile b/lib/librte_ether/Makefile
index f40b5cc..62bcf0c 100644
--- a/lib/librte_ether/Makefile
+++ b/lib/librte_ether/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_ether/rte_ether_version.map
 
+LIBABIVER := 1
+
 SRCS-y += rte_ethdev.c
 
 #
diff --git a/lib/librte_hash/Makefile b/lib/librte_hash/Makefile
index a449ec2..17778ba 100644
--- a/lib/librte_hash/Makefile
+++ b/lib/librte_hash/Makefile
@@ -39,6 +39,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_hash/rte_hash_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) := rte_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += rte_fbk_hash.c
diff --git a/lib/librte_ip_frag/Makefile b/lib/librte_ip_frag/Makefile
index ede5a89..6b496dc 100644
--- a/lib/librte_ip_frag/Makefile
+++ b/lib/librte_ip_frag/Makefile
@@ -39,6 +39,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_ip_frag/rte_ipfrag_version.map
 
+LIBABIVER := 1
+
 #source files
 SRCS-$(CONFIG_RTE_LIBRTE_IP_FRAG) += rte_ipv4_fragmentation.c
 SRCS-$(CONFIG_RTE_LIBRTE_IP_FRAG) += rte_ipv4_reassembly.c
diff --git a/lib/librte_ivshmem/Makefile b/lib/librte_ivshmem/Makefile
index be6f21a..7c8dc17 100644
--- a/lib/librte_ivshmem/Makefile
+++ b/lib/librte_ivshmem/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAPS := $(RTE_SDK)/lib/librte_ivshmem/rte_ivshmem_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_IVSHMEM) := rte_ivshmem.c
 
diff --git a/lib/librte_kni/Makefile b/lib/librte_kni/Makefile
index c119fc1..59abd85 100644
--- a/lib/librte_kni/Makefile
+++ b/lib/librte_kni/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_kni/rte_kni_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_KNI) := rte_kni.c
 
diff --git a/lib/librte_kvargs/Makefile b/lib/librte_kvargs/Makefile
index 83a42b1..10713db 100644
--- a/lib/librte_kvargs/Makefile
+++ b/lib/librte_kvargs/Makefile
@@ -40,6 +40,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_kvargs/rte_kvargs_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_KVARGS) := rte_kvargs.c
 
diff --git a/lib/librte_lpm/Makefile b/lib/librte_lpm/Makefile
index 05de8d9..c99bfbd 100644
--- a/lib/librte_lpm/Makefile
+++ b/lib/librte_lpm/Makefile
@@ -39,6 +39,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_lpm/rte_lpm_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_LPM) := rte_lpm.c rte_lpm6.c
 
diff --git a/lib/librte_malloc/Makefile b/lib/librte_malloc/Makefile
index 1a5c288..3bb7a99 100644
--- a/lib/librte_malloc/Makefile
+++ b/lib/librte_malloc/Makefile
@@ -34,6 +34,8 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_malloc.a
 
+LIBABIVER := 1
+
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_malloc/rte_malloc_version.map
diff --git a/lib/librte_mbuf/Makefile b/lib/librte_mbuf/Makefile
index 5cd4941..3cf94d1 100644
--- a/lib/librte_mbuf/Makefile
+++ b/lib/librte_mbuf/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_mbuf/rte_mbuf_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_MBUF) := rte_mbuf.c
 
diff --git a/lib/librte_mempool/Makefile b/lib/librte_mempool/Makefile
index 07b5b4e..2c2a6e8 100644
--- a/lib/librte_mempool/Makefile
+++ b/lib/librte_mempool/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_mempool/rte_mempool_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_MEMPOOL) +=  rte_mempool.c
 ifeq ($(CONFIG_RTE_LIBRTE_XEN_DOM0),y)
diff --git a/lib/librte_meter/Makefile b/lib/librte_meter/Makefile
index 0778690..f58822e 100644
--- a/lib/librte_meter/Makefile
+++ b/lib/librte_meter/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_meter/rte_meter_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pipeline/Makefile b/lib/librte_pipeline/Makefile
index 5465d00..df44f51 100644
--- a/lib/librte_pipeline/Makefile
+++ b/lib/librte_pipeline/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pipeline/rte_pipeline_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_bond/Makefile b/lib/librte_pmd_bond/Makefile
index 5b14ce2..2f1e83b 100644
--- a/lib/librte_pmd_bond/Makefile
+++ b/lib/librte_pmd_bond/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_bond/rte_eth_bond_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_e1000/Makefile b/lib/librte_pmd_e1000/Makefile
index e225bfe..a5e3b66 100644
--- a/lib/librte_pmd_e1000/Makefile
+++ b/lib/librte_pmd_e1000/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_e1000/rte_pmd_e1000_version.map
 
+LIBABIVER := 1
+
 ifeq ($(CC), icc)
 #
 # CFLAGS for icc
diff --git a/lib/librte_pmd_i40e/Makefile b/lib/librte_pmd_i40e/Makefile
index cfbe816..d59967a 100644
--- a/lib/librte_pmd_i40e/Makefile
+++ b/lib/librte_pmd_i40e/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_i40e/rte_pmd_i40e_version.map
 
+LIBABIVER := 1
+
 #
 # Add extra flags for base driver files (also known as shared code)
 # to disable warnings
diff --git a/lib/librte_pmd_ixgbe/Makefile b/lib/librte_pmd_ixgbe/Makefile
index 1dd14a6..fd17c09 100644
--- a/lib/librte_pmd_ixgbe/Makefile
+++ b/lib/librte_pmd_ixgbe/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_ixgbe/rte_pmd_ixgbe_version.map
 
+LIBABIVER := 1
+
 ifeq ($(CC), icc)
 #
 # CFLAGS for icc
diff --git a/lib/librte_pmd_pcap/Makefile b/lib/librte_pmd_pcap/Makefile
index fff5572..8f05c2c 100644
--- a/lib/librte_pmd_pcap/Makefile
+++ b/lib/librte_pmd_pcap/Makefile
@@ -42,6 +42,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_pcap/rte_pmd_pcap_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_ring/Makefile b/lib/librte_pmd_ring/Makefile
index 25ad27f..24c57fc 100644
--- a/lib/librte_pmd_ring/Makefile
+++ b/lib/librte_pmd_ring/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_ring/rte_eth_ring_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_virtio/Makefile b/lib/librte_pmd_virtio/Makefile
index bf51bd9..d0bec84 100644
--- a/lib/librte_pmd_virtio/Makefile
+++ b/lib/librte_pmd_virtio/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_virtio/rte_pmd_virtio_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_vmxnet3/Makefile b/lib/librte_pmd_vmxnet3/Makefile
index e5a1c6b..2b418f4 100644
--- a/lib/librte_pmd_vmxnet3/Makefile
+++ b/lib/librte_pmd_vmxnet3/Makefile
@@ -68,6 +68,8 @@ VPATH += $(RTE_SDK)/lib/librte_pmd_vmxnet3/vmxnet3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_vmxnet3/rte_pmd_vmxnet3_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_pmd_xenvirt/Makefile b/lib/librte_pmd_xenvirt/Makefile
index 0a08b1b..6132c1c 100644
--- a/lib/librte_pmd_xenvirt/Makefile
+++ b/lib/librte_pmd_xenvirt/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_pmd_xenvirt/rte_eth_xenvirt_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_port/Makefile b/lib/librte_port/Makefile
index e812bda..828692f 100644
--- a/lib/librte_port/Makefile
+++ b/lib/librte_port/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_port/rte_port_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 26ee542..3261176 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_power/rte_power_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c
 
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 0adaa00..fa697ea 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_ring/rte_ring_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
diff --git a/lib/librte_sched/Makefile b/lib/librte_sched/Makefile
index 205fb7a..1a54bf9 100644
--- a/lib/librte_sched/Makefile
+++ b/lib/librte_sched/Makefile
@@ -43,6 +43,8 @@ CFLAGS_rte_red.o := -D_GNU_SOURCE
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_sched/rte_sched_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_table/Makefile b/lib/librte_table/Makefile
index 5b54acc..29b768c 100644
--- a/lib/librte_table/Makefile
+++ b/lib/librte_table/Makefile
@@ -41,6 +41,8 @@ CFLAGS += $(WERROR_FLAGS)
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_table/rte_table_version.map
 
+LIBABIVER := 1
+
 #
 # all source are stored in SRCS-y
 #
diff --git a/lib/librte_timer/Makefile b/lib/librte_timer/Makefile
index f703e5f..01772c7 100644
--- a/lib/librte_timer/Makefile
+++ b/lib/librte_timer/Makefile
@@ -38,6 +38,8 @@ CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
 
 EXPORT_MAP := $(RTE_SDK)/lib/librte_timer/rte_timer_version.map
 
+LIBABIVER := 1
+
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_TIMER) := rte_timer.c
 
diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
index 82ac309..4d55cc9 100644
--- a/mk/rte.lib.mk
+++ b/mk/rte.lib.mk
@@ -37,10 +37,8 @@ include $(RTE_SDK)/mk/internal/rte.depdirs-pre.mk
 
 # VPATH contains at least SRCDIR
 VPATH += $(SRCDIR)
-
 ifeq ($(RTE_BUILD_SHARED_LIB),y)
-LIB := $(patsubst %.a,%.so,$(LIB))
-
+LIB := $(patsubst %.a,%.so.$(LIBABIVER),$(LIB))
 CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
 
 endif
@@ -63,6 +61,7 @@ build: _postbuild
 
 exe2cmd = $(strip $(call dotfile,$(patsubst %,%.cmd,$(1))))
 
+
 ifeq ($(LINK_USING_CC),1)
 # Override the definition of LD here, since we're linking with CC
 LD := $(CC)
@@ -112,6 +111,10 @@ lib_dir = [ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib;
 #
 ifeq ($(RTE_BUILD_SHARED_LIB),y)
 $(LIB): $(OBJS-y) $(DEP_$(LIB)) FORCE
+ifeq ($(LIBABIVER),)
+	@echo "Must Specify a $(LIB) ABI version"
+	@exit 1
+endif
 	@[ -d $(dir $@) ] || mkdir -p $(dir $@)
 	$(if $(D),\
 		@echo -n "$< -> $@ " ; \
@@ -125,6 +128,7 @@ $(LIB): $(OBJS-y) $(DEP_$(LIB)) FORCE
 		$(depfile_missing),\
 		$(depfile_newer)),\
 		$(O_TO_S_DO))
+
 ifeq ($(RTE_BUILD_COMBINE_LIBS),y)
 	$(if $(or \
         $(file_missing),\
@@ -162,10 +166,12 @@ endif
 # install lib in $(RTE_OUTPUT)/lib
 #
 $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
+	$(eval LIBSONAME := $(basename $(LIB)))
 	@echo "  INSTALL-LIB $(LIB)"
 	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
 ifneq ($(LIB),)
 	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
+	$(Q)ln -s -f $(RTE_OUTPUT)/lib/$(LIB) $(RTE_OUTPUT)/lib/$(LIBSONAME)
 endif
 
 #
-- 
1.9.3
^ permalink raw reply	[relevance 7%]
* [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning
  2014-09-15 19:23  4% [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Neil Horman
@ 2014-09-15 19:23  4% ` Neil Horman
  2014-09-23 10:39  0%   ` Sergio Gonzalez Monroy
  2014-09-25 18:52  4%   ` [dpdk-dev] [PATCH 1/4 v2] " Neil Horman
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 86+ results
From: Neil Horman @ 2014-09-15 19:23 UTC (permalink / raw)
  To: dev
Add initial pass header files to support symbol versioning.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
CC: "Richardson, Bruce" <bruce.richardson@intel.com>
---
 lib/Makefile                   |  1 +
 lib/librte_compat/Makefile     | 38 +++++++++++++++++++
 lib/librte_compat/rte_compat.h | 86 ++++++++++++++++++++++++++++++++++++++++++
 mk/rte.lib.mk                  |  6 +++
 4 files changed, 131 insertions(+)
 create mode 100644 lib/librte_compat/Makefile
 create mode 100644 lib/librte_compat/rte_compat.h
diff --git a/lib/Makefile b/lib/Makefile
index 10c5bb3..a85b55b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -32,6 +32,7 @@
 include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_LIBC) += libc
+DIRS-y += librte_compat
 DIRS-$(CONFIG_RTE_LIBRTE_EAL) += librte_eal
 DIRS-$(CONFIG_RTE_LIBRTE_MALLOC) += librte_malloc
 DIRS-$(CONFIG_RTE_LIBRTE_RING) += librte_ring
diff --git a/lib/librte_compat/Makefile b/lib/librte_compat/Makefile
new file mode 100644
index 0000000..a61511a
--- /dev/null
+++ b/lib/librte_compat/Makefile
@@ -0,0 +1,38 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+
+# install includes
+SYMLINK-y-include := rte_compat.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_compat/rte_compat.h b/lib/librte_compat/rte_compat.h
new file mode 100644
index 0000000..6d65a53
--- /dev/null
+++ b/lib/librte_compat/rte_compat.h
@@ -0,0 +1,86 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_COMPAT_H_
+#define _RTE_COMPAT_H_
+
+/*
+ * This is just a stringification macro for use below.
+ */
+#define SA(x) #x
+
+#ifdef RTE_SYMBOL_VERSIONING
+
+/*
+ * Provides backwards compatibility when updating exported functions.
+ * When a symol is exported from a library to provide an API, it also provides a
+ * calling convention (ABI) that is embodied in its name, return type,
+ * arguments, etc.  On occasion that function may need to change to accomodate
+ * new functionality, behavior, etc.  When that occurs, it is desireable to
+ * allow for backwards compatibility for a time with older binaries that are
+ * dynamically linked to the dpdk.  to support that the __vsym and
+ * VERSION_SYMBOL macros are created.  They, in conjunction with the
+ * <library>_version.map file for a given library allow for multiple versions of
+ * a symbol to exist in a shared library so that older binaries need not be
+ * immediately recompiled. Their use is outlined in the following example:
+ * Assumptions: DPDK 1.(X) contains a function int foo(char *string)
+ *              DPDK 1.(X+1) needs to change foo to be int foo(int index)
+ *
+ * To accomplish this:
+ * 1) Edit lib/<library>/library_version.map to add a DPDK_1.8 node, in which
+ * foo is exported as a global symbol
+ *
+ * 2) rename the existing function int foo(char *string) to 
+ * 	int __vsym foo_v18(char *string)
+ *
+ * 3) Add this macro immediately below the function
+ * 	VERSION_SYMBOL(foo, _v18, 1.8);
+ *
+ */
+#define VERSION_SYMBOL(b, e, v) __asm__(".symver " SA(b) SA(e) ", "SA(b)"@DPDK_"SA(v))
+#define __vsym __attribute__((used))
+
+#else
+/*
+ * No symbol versioning in use
+ */
+#define VERSION_SYMBOL(b, e, v)
+#define __vsym
+
+/*
+ * RTE_SYMBOL_VERSIONING
+ */
+#endif
+
+
+#endif /* _RTE_COMPAT_H_ */
diff --git a/mk/rte.lib.mk b/mk/rte.lib.mk
index f458258..82ac309 100644
--- a/mk/rte.lib.mk
+++ b/mk/rte.lib.mk
@@ -40,8 +40,12 @@ VPATH += $(SRCDIR)
 
 ifeq ($(RTE_BUILD_SHARED_LIB),y)
 LIB := $(patsubst %.a,%.so,$(LIB))
+
+CPU_LDFLAGS += --version-script=$(EXPORT_MAP)
+
 endif
 
+
 _BUILD = $(LIB)
 _INSTALL = $(INSTALL-FILES-y) $(SYMLINK-FILES-y) $(RTE_OUTPUT)/lib/$(LIB)
 _CLEAN = doclean
@@ -160,7 +164,9 @@ endif
 $(RTE_OUTPUT)/lib/$(LIB): $(LIB)
 	@echo "  INSTALL-LIB $(LIB)"
 	@[ -d $(RTE_OUTPUT)/lib ] || mkdir -p $(RTE_OUTPUT)/lib
+ifneq ($(LIB),)
 	$(Q)cp -f $(LIB) $(RTE_OUTPUT)/lib
+endif
 
 #
 # Clean all generated files
-- 
1.9.3
^ permalink raw reply	[relevance 4%]
* [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility
@ 2014-09-15 19:23  4% Neil Horman
  2014-09-15 19:23  4% ` [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning Neil Horman
                   ` (4 more replies)
  0 siblings, 5 replies; 86+ results
From: Neil Horman @ 2014-09-15 19:23 UTC (permalink / raw)
  To: dev
The DPDK ABI develops and changes quickly, which makes it difficult for
applications to keep up with the latest version of the library, especially when
it (the DPDK) is built as a set of shared objects, as applications may be built
against an older version of the library.
To mitigate this, this patch series introduces support for library and symbol
versioning when the DPDK is built as a DSO.  Specifically, it does 4 things:
1) Adds initial support for library versioning.  Each library now has a version
map that explicitly calls out what symbols are exported to using applications,
and assigns version(s) to them
2) Adds support macros so that when libraries create incompatible ABI's,
multiple versions may be supported so that applications linked against older
DPDK releases can continue to function
3) Adds library soname versioning suffixes so that when ABI's must be broken in
a fashion that requires a rebuild of older applications, they will break at load
time, rather than cause unexpected issues at run time.
4) Adds documentation for ABI policy, and provides space to document deprecated
ABI versions, so that applications might be warned of impending changes.
With these elements in place the DPDK has some support to allow for the extended
maintenence of older API's while still allowing the freedom to develop new and
improved API's.
Implementing this feature will require some additional effort on the part of
developers and reviewers.  When reviewing patches, must be checked against
existing exports to ensure that the function prototypes are not changing.  If
they are, the versioning macros must be used, and the library export map should
be updated to reflect the new version of the function.
When data structures change, if those structures are application accessible,
apis that accept or return instances of those data structures should have new
versions created so that users of the old data structure version might co-exist
at the same time.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
CC: "Richardson, Bruce" <bruce.richardson@intel.com>
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCHv5] librte_acl make it build/work for 'default' target
  2014-09-02 13:43  0% ` Neil Horman
@ 2014-09-03  1:29  0%   ` Thomas Monjalon
  0 siblings, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-09-03  1:29 UTC (permalink / raw)
  To: Neil Horman, Konstantin Ananyev; +Cc: dev
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> > 
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> >  points to.
> > 
> > V3 Changes
> >  Updated classify pointer to be a function so as to better preserve ABI
> >  REmoved macro definitions for match check functions to make them static inline
> > 
> > V4 Changes
> >  Rewrote classification selection mechanim to use a function table, so that we
> > can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
> > access works.  I understand that leaves us with an extra load instruction, but I
> > think thats ok, because it also allows...
> > 
> >  Addition of a new function rte_acl_classify_alg.  This function lets you
> > specify an enum value to override the acl contexts default algorith when doing a
> > classification.  This allows an application to specify a classification
> > algorithm without needing to pulicize each method.  I know there was concern
> > over keeping those methods public, but we don't have a static ABI at the moment,
> > so this seems to me a reasonable thing to do, as it gives us less of an ABI
> > surface to worry about.
> > 
> >  Fixed misc missed static declarations
> >  Removed acl_match_check.h and moved match_check function to acl_run.h
> >  typdeffed function pointer to match check.
> > 
> > V5 Changes
> >  Updated examples/l3fwd-acl to comply with latest changes.
> >  Applied other code review comments (mostly style changes).
> > 
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Thanks Konstantin!
Applied for version 1.7.1.
Thanks a lot
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv5] librte_acl make it build/work for 'default' target
  2014-09-01 15:28  1% [dpdk-dev] [PATCHv5] " Konstantin Ananyev
@ 2014-09-02 13:43  0% ` Neil Horman
  2014-09-03  1:29  0%   ` Thomas Monjalon
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-09-02 13:43 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev
On Mon, Sep 01, 2014 at 04:28:44PM +0100, Konstantin Ananyev wrote:
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI
>  REmoved macro definitions for match check functions to make them static inline
> 
> V4 Changes
>  Rewrote classification selection mechanim to use a function table, so that we
> can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
> access works.  I understand that leaves us with an extra load instruction, but I
> think thats ok, because it also allows...
> 
>  Addition of a new function rte_acl_classify_alg.  This function lets you
> specify an enum value to override the acl contexts default algorith when doing a
> classification.  This allows an application to specify a classification
> algorithm without needing to pulicize each method.  I know there was concern
> over keeping those methods public, but we don't have a static ABI at the moment,
> so this seems to me a reasonable thing to do, as it gives us less of an ABI
> surface to worry about.
> 
>  Fixed misc missed static declarations
>  Removed acl_match_check.h and moved match_check function to acl_run.h
>  typdeffed function pointer to match check.
> 
> V5 Changes
>  Updated examples/l3fwd-acl to comply with latest changes.
>  Applied other code review comments (mostly style changes).
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Thanks Konstantin!
Neil
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCHv5] librte_acl make it build/work for 'default' target
@ 2014-09-01 15:28  1% Konstantin Ananyev
  2014-09-02 13:43  0% ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Konstantin Ananyev @ 2014-09-01 15:28 UTC (permalink / raw)
  To: dev, dev
Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.
v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.
V3 Changes
 Updated classify pointer to be a function so as to better preserve ABI
 REmoved macro definitions for match check functions to make them static inline
V4 Changes
 Rewrote classification selection mechanim to use a function table, so that we
can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
access works.  I understand that leaves us with an extra load instruction, but I
think thats ok, because it also allows...
 Addition of a new function rte_acl_classify_alg.  This function lets you
specify an enum value to override the acl contexts default algorith when doing a
classification.  This allows an application to specify a classification
algorithm without needing to pulicize each method.  I know there was concern
over keeping those methods public, but we don't have a static ABI at the moment,
so this seems to me a reasonable thing to do, as it gives us less of an ABI
surface to worry about.
 Fixed misc missed static declarations
 Removed acl_match_check.h and moved match_check function to acl_run.h
 typdeffed function pointer to match check.
V5 Changes
 Updated examples/l3fwd-acl to comply with latest changes.
 Applied other code review comments (mostly style changes).
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c             |  20 +-
 app/test/test_acl.c             |  19 +-
 examples/l3fwd-acl/main.c       |  22 +-
 lib/librte_acl/Makefile         |   5 +-
 lib/librte_acl/acl.h            |  15 +
 lib/librte_acl/acl_bld.c        |   5 +-
 lib/librte_acl/acl_run.c        | 944 ----------------------------------------
 lib/librte_acl/acl_run.h        | 268 ++++++++++++
 lib/librte_acl/acl_run_scalar.c | 193 ++++++++
 lib/librte_acl/acl_run_sse.c    | 626 ++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        |  55 +++
 lib/librte_acl/rte_acl.h        |  56 ++-
 12 files changed, 1239 insertions(+), 989 deletions(-)
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c
diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..44add10 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -772,6 +772,15 @@ acx_init(void)
 	if (config.acx == NULL)
 		rte_exit(rte_errno, "failed to create ACL context\n");
 
+	/* set default classify method to scalar for this context. */
+	if (config.scalar) {
+		ret = rte_acl_set_ctx_classify(config.acx,
+			RTE_ACL_CLASSIFY_SCALAR);
+		if (ret != 0)
+			rte_exit(ret, "failed to setup classify method "
+				"for ACL context\n");
+	}
+
 	/* add ACL rules. */
 	f = fopen(config.rule_file, "r");
 	if (f == NULL)
@@ -780,7 +789,7 @@ acx_init(void)
 
 	ret = add_cb_rules(f, config.acx);
 	if (ret != 0)
-		rte_exit(rte_errno, "failed to add rules into ACL context\n");
+		rte_exit(ret, "failed to add rules into ACL context\n");
 
 	fclose(f);
 
@@ -815,13 +824,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index c6b3f86..356d620 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -146,8 +146,9 @@ test_classify_run(struct rte_acl_ctx *acx)
 	}
 
 	/* make a quick check for scalar */
-	ret = rte_acl_classify_scalar(acx, data, results,
-			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
+	ret = rte_acl_classify_alg(acx, data, results,
+			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES,
+			RTE_ACL_CLASSIFY_SCALAR);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
 		goto err;
@@ -341,8 +342,8 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples */
-	ret = rte_acl_classify(acx, data, results,
-			RTE_DIM(results), 1);
+	ret = rte_acl_classify_alg(acx, data, results,
+			RTE_DIM(results), 1, RTE_ACL_CLASSIFY_SCALAR);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
 		rte_acl_free(acx);
@@ -360,8 +361,9 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples (scalar) */
-	ret = rte_acl_classify_scalar(acx, data, results,
-			RTE_DIM(results), 1);
+	ret = rte_acl_classify_alg(acx, data, results, RTE_DIM(results), 1,
+		RTE_ACL_CLASSIFY_SCALAR);
+
 	if (ret != 0) {
 		printf("Line %i: Scalar classify failed!\n", __LINE__);
 		rte_acl_free(acx);
@@ -848,7 +850,8 @@ test_invalid_parameters(void)
 	/* scalar classify test */
 
 	/* cover zero categories in classify (should not fail) */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+	result = rte_acl_classify_alg(acx, NULL, NULL, 0, 0,
+		RTE_ACL_CLASSIFY_SCALAR);
 	if (result != 0) {
 		printf("Line %i: Scalar classify with zero categories "
 				"failed!\n", __LINE__);
@@ -857,7 +860,7 @@ test_invalid_parameters(void)
 	}
 
 	/* cover invalid but positive categories in classify */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
 	if (result == 0) {
 		printf("Line %i: Scalar classify with 3 categories "
 				"should have failed!\n", __LINE__);
diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c
index 9b2c21b..eac0eab 100644
--- a/examples/l3fwd-acl/main.c
+++ b/examples/l3fwd-acl/main.c
@@ -278,15 +278,6 @@ send_single_packet(struct rte_mbuf *m, uint8_t port);
 	(in) = end + 1;                                         \
 } while (0)
 
-#define CLASSIFY(context, data, res, num, cat) do {		\
-	if (scalar)						\
-		rte_acl_classify_scalar((context), (data),	\
-		(res), (num), (cat));				\
-	else							\
-		rte_acl_classify((context), (data),		\
-		(res), (num), (cat));				\
-} while (0)
-
 /*
   * ACL rules should have higher priorities than route ones to ensure ACL rule
   * always be found when input packets have multi-matches in the database.
@@ -1216,6 +1207,11 @@ setup_acl(struct rte_acl_rule *route_base,
 	if ((context = rte_acl_create(&acl_param)) == NULL)
 		rte_exit(EXIT_FAILURE, "Failed to create ACL context\n");
 
+	if (parm_config.scalar && rte_acl_set_ctx_classify(context,
+			RTE_ACL_CLASSIFY_SCALAR) != 0)
+		rte_exit(EXIT_FAILURE,
+			"Failed to setup classify method for  ACL context\n");
+
 	if (rte_acl_add_rules(context, route_base, route_num) < 0)
 			rte_exit(EXIT_FAILURE, "add rules failed\n");
 
@@ -1436,10 +1432,8 @@ main_loop(__attribute__((unused)) void *dummy)
 	int socketid;
 	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1)
 			/ US_PER_S * BURST_TX_DRAIN_US;
-	int scalar = parm_config.scalar;
 
 	prev_tsc = 0;
-
 	lcore_id = rte_lcore_id();
 	qconf = &lcore_conf[lcore_id];
 	socketid = rte_lcore_to_socket_id(lcore_id);
@@ -1503,7 +1497,8 @@ main_loop(__attribute__((unused)) void *dummy)
 					nb_rx);
 
 				if (acl_search.num_ipv4) {
-					CLASSIFY(acl_config.acx_ipv4[socketid],
+					rte_acl_classify(
+						acl_config.acx_ipv4[socketid],
 						acl_search.data_ipv4,
 						acl_search.res_ipv4,
 						acl_search.num_ipv4,
@@ -1515,7 +1510,8 @@ main_loop(__attribute__((unused)) void *dummy)
 				}
 
 				if (acl_search.num_ipv6) {
-					CLASSIFY(acl_config.acx_ipv6[socketid],
+					rte_acl_classify(
+						acl_config.acx_ipv6[socketid],
 						acl_search.data_ipv6,
 						acl_search.res_ipv6,
 						acl_search.num_ipv6,
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index b9d63fd..102fa51 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -153,6 +153,7 @@ struct rte_acl_ctx {
 	/** Name of the ACL context. */
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
+	enum rte_acl_classify_alg alg;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
@@ -174,6 +175,20 @@ int rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
 	uint32_t num_categories, uint32_t data_index_sz, int match_num);
 
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+/*
+ * Different implementations of ACL classify.
+ */
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..c191053
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,268 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+typedef void (*resolve_priority_t)
+(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories);
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, resolve_priority_t resolve_priority)
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			resolve_priority(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..43c8fc3
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,193 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check(transition0,
+				0, ctx, parms, &flows, resolve_priority_scalar);
+			transition1 = acl_match_check(transition1,
+				1, ctx, parms, &flows, resolve_priority_scalar);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..4f3f115
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,626 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..ea23220 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,6 +38,58 @@
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+static const rte_acl_classify_t classify_fns[] = {
+	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+};
+
+/* by default, use always avaialbe scalar code path. */
+static enum rte_acl_classify_alg rte_acl_default_classify =
+	RTE_ACL_CLASSIFY_SCALAR;
+
+static void
+rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+{
+	rte_acl_default_classify = alg;
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	ctx->alg = alg;
+	return 0;
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		alg = RTE_ACL_CLASSIFY_SSE;
+
+	rte_acl_set_default_classify(alg);
+}
+
+int
+rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	return classify_fns[ctx->alg](ctx, data, results, num, categories);
+}
+
+int
+rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories,
+	enum rte_acl_classify_alg alg)
+{
+	return classify_fns[alg](ctx, data, results, num, categories);
+}
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
@@ -165,6 +217,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
+		ctx->alg = rte_acl_default_classify;
 		snprintf(ctx->name, sizeof(ctx->name), "%s", param->name);
 
 		te->data = (void *) ctx;
@@ -261,6 +314,8 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	if (!ctx)
 		return;
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
+	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
+	printf("  alg=%"PRId32"\n", ctx->alg);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..0e82339 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -259,7 +259,16 @@ void
 rte_acl_reset(struct rte_acl_ctx *ctx);
 
 /**
- * Search for a matching ACL rule for each input data buffer.
+ *  Avaialble implementations of ACL classify.
+ */
+enum rte_acl_classify_alg {
+	RTE_ACL_CLASSIFY_DEFAULT = 0,
+	RTE_ACL_CLASSIFY_SCALAR = 1,  /**< generic implementation. */
+	RTE_ACL_CLASSIFY_SSE = 2,     /**< requries SSE4.1 support. */
+};
+
+/**
+ * Perform search for a matching ACL rule for each input data buffer.
  * Each input data buffer can have up to *categories* matches.
  * That implies that results array should be big enough to hold
  * (categories * num) elements.
@@ -267,7 +276,7 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
  * If more than one rule is applicable for given input buffer and
  * given category, then rule with highest priority will be returned as a match.
- * Note, that it is a caller responsibility to ensure that input parameters
+ * Note, that it is a caller's responsibility to ensure that input parameters
  * are valid and point to correct memory locations.
  *
  * @param ctx
@@ -287,15 +296,15 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+extern int
+rte_acl_classify(const struct rte_acl_ctx *ctx,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories);
 
 /**
- * Perform scalar search for a matching ACL rule for each input data buffer.
- * Note, that while the search itself will avoid explicit use of SSE/AVX
- * intrinsics, code for comparing matching results/priorities sill might use
- * vector intrinsics (for  categories > 1).
+ * Perform search using specified algorithm for a matching ACL rule for
+ * each input data buffer.
  * Each input data buffer can have up to *categories* matches.
  * That implies that results array should be big enough to hold
  * (categories * num) elements.
@@ -319,13 +328,36 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
  * @param categories
  *   Number of maximum possible matches for each input buffer, one possible
  *   match per category.
+ * @param alg
+ *   Algorithm to be used for the search.
+ *   It is the caller responibility to ensure that the value refers to the
+ *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+extern int
+rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories,
+		 enum rte_acl_classify_alg alg);
+
+/*
+ * Override the default classifier function for a given ACL context.
+ * @param ctx
+ *   ACL context to change classify function for.
+ * @param alg
+ *   New default classify algorithm for given ACL context.
+ *   It is the caller responibility to ensure that the value refers to the
+ *   existing algorithm, and that it could be run on the given CPU.
+ * @return
+ *   - -EINVAL if the parameters are invalid.
+ *   - Zero if operation completed successfully.
+ */
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx,
+	enum rte_acl_classify_alg alg);
 
 /**
  * Dump an ACL context structure to the console.
-- 
1.8.5.3
^ permalink raw reply	[relevance 1%]
* Re: [dpdk-dev] [PATCHv4] librte_acl make it build/work for 'default' target
  2014-08-28 20:38  1% ` [dpdk-dev] [PATCHv4] " Neil Horman
@ 2014-08-29 17:58  0%   ` Ananyev, Konstantin
  0 siblings, 0 replies; 86+ results
From: Ananyev, Konstantin @ 2014-08-29 17:58 UTC (permalink / raw)
  To: Neil Horman, dev
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 28, 2014 9:38 PM
> To: dev@dpdk.org
> Cc: Neil Horman; Ananyev, Konstantin; thomas.monjalon@6wind.com
> Subject: [PATCHv4] librte_acl make it build/work for 'default' target
> 
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI
>  REmoved macro definitions for match check functions to make them static inline
> 
> V4 Changes
>  Rewrote classification selection mechanim to use a function table, so that we
> can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
> access works.  I understand that leaves us with an extra load instruction, but I
> think thats ok, because it also allows...
> 
>  Addition of a new function rte_acl_classify_alg.  This function lets you
> specify an enum value to override the acl contexts default algorith when doing a
> classification.  This allows an application to specify a classification
> algorithm without needing to pulicize each method.  I know there was concern
> over keeping those methods public, but we don't have a static ABI at the moment,
> so this seems to me a reasonable thing to do, as it gives us less of an ABI
> surface to worry about.
Good way to overcome the problem.
>From what I am seeing it adds a tiny slowdown (as expected) ... 
Though it provides a good flexibility and I don't have any better ideas.
So I'd say let stick with that approach.
Below are few technical comments.
Thanks
Konstantin
> 
>  Fixed misc missed static declarations
> 
>  Removed acl_match_check.h and moved match_check function to acl_run.h
> 
>  typdeffed function pointer to match check.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: konstantin.ananyev@intel.com
> CC: thomas.monjalon@6wind.com
> ---
>  app/test-acl/main.c             |  13 +-
>  app/test/test_acl.c             |  10 +-
>  lib/librte_acl/Makefile         |   5 +-
>  lib/librte_acl/acl.h            |   1 +
>  lib/librte_acl/acl_bld.c        |   5 +-
>  lib/librte_acl/acl_run.c        | 944 ----------------------------------------
>  lib/librte_acl/acl_run.h        | 271 ++++++++++++
>  lib/librte_acl/acl_run_scalar.c | 197 +++++++++
>  lib/librte_acl/acl_run_sse.c    | 630 +++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c        |  62 +++
>  lib/librte_acl/rte_acl.h        |  66 ++-
>  11 files changed, 1208 insertions(+), 996 deletions(-)
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 869f6d3..2169f59 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -859,7 +859,7 @@ test_invalid_parameters(void)
>  	}
> 
>  	/* cover invalid but positive categories in classify */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
Typo, should be:
rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, NULL, NULL, 0, 3); 
> diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
> index b9d63fd..9236b7b 100644
> --- a/lib/librte_acl/acl.h
> +++ b/lib/librte_acl/acl.h
> @@ -168,6 +168,7 @@ struct rte_acl_ctx {
>  	void               *mem;
>  	size_t              mem_sz;
>  	struct rte_acl_config config; /* copy of build config. */
> +	enum rte_acl_classify_alg alg;
>  };
Each rte_acl_build() will reset all fields of rte_acl_ctx starting from num_categories and below.
So we need to move alg somewhere above num_categories:
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -153,6 +153,7 @@ struct rte_acl_ctx {
        /** Name of the ACL context. */
        int32_t             socket_id;
        /** Socket ID to allocate memory from. */
+       enum rte_acl_classify_alg alg;
        void               *rules;
        uint32_t            max_rules;
        uint32_t            rule_sz;
@@ -168,9 +169,11 @@ struct rte_acl_ctx {
        void               *mem;
        size_t              mem_sz;
        struct rte_acl_config config; /* copy of build config. */
-       enum rte_acl_classify_alg alg;
 };
> diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
> new file mode 100644
> index 0000000..4bf58c7
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_scalar.c
> @@ -0,0 +1,197 @@
> +
> +#include "acl_run.h"
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
No need to put this declaration here.
I think you can put both rte_acl_classify_sse(), rte_acl_classify_scalar() into acl.h (it is internal lib header).
And remove another declarations of these functions from rte_acl.c.
> diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
> new file mode 100644
> index 0000000..7ae63dd
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_sse.c
> +#include "acl_run.h"
> +
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
Move to acl.h.
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..741bed4 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -33,11 +33,72 @@
> 
>  #include <rte_acl.h>
>  #include "acl.h"
> +#include "acl_run.h"
acl_run.h contains defintions for a lot of functions and should be included only by acl_run_*.c. 
I  think it is better to move typedef int (*rte_acl_classify_t) into acl.h and don't include acl_run.h here.
> 
>  #define	BIT_SIZEOF(x)	(sizeof(x) * CHAR_BIT)
> 
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> 
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +extern int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
As above: I think it is safe to move these declarations into acl.h.
> +static rte_acl_classify_t classify_fns[] = {
> +	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
> +	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
> +	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
> +};
static const static rte_acl_classify_t classify_fns[]
?
> +
> +
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
Duplicate.
> +
> +/* by default, use always avaialbe scalar code path. */
> +static enum rte_acl_classify_alg rte_acl_default_classify = RTE_ACL_CLASSIFY_SCALAR;
Line is longer than 80 chars?
> +
> +void rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
> +{
> +	rte_acl_default_classify = alg;
> +}
void
rte_acls_set_default_classify(...)
Though, I am not sure why do we need it to be public now.
Users can setup ALG per context.
> +
> +void rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
> +{
> +	ctx->alg = alg;
> +}
Same as above:
void
rte_acl_set_ctx_classify(...)
Plus probably add checking that alg is a valid argument:
If ((uint32_t)alg < RTE_DIM(classify_fns)) {ctx->alg=alg; return 0;}
return -EINVAL; 
> +
> +static void __attribute__((constructor))
> +rte_acl_init(void)
> +{
> +	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +		alg = RTE_ACL_CLASSIFY_SSE;
> +
> +	rte_acl_set_default_classify(alg);
> +}
> +
> +int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +		     const uint8_t **data,
> +		     uint32_t *results, uint32_t num,
> +		     uint32_t categories)
> +{
> +	return classify_fns[ctx->alg](ctx, data, results, num, categories);
> +}
> +
> +int rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
> +			 enum rte_acl_classify_alg alg,
> +			 const uint8_t **data,
> +			 uint32_t *results, uint32_t num,
> +			 uint32_t categories)
> +{
> +	return classify_fns[alg](ctx, data, results, num, categories);
> +}
Can you move alg argument to be the last one?
That would prevent copying parameters between registers.
Plus same thing with the function definition style. 
> +
>  struct rte_acl_ctx *
>  rte_acl_find_existing(const char *name)
>  {
> @@ -165,6 +226,7 @@ rte_acl_create(const struct rte_acl_param *param)
>  		ctx->max_rules = param->max_rule_num;
>  		ctx->rule_sz = param->rule_size;
>  		ctx->socket_id = param->socket_id;
> +		ctx->alg = rte_acl_default_classify;
>  		snprintf(ctx->name, sizeof(ctx->name), "%s", param->name);
> 
>  		te->data = (void *) ctx;
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
> index afc0f69..c092a49 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -259,39 +259,6 @@ void
>  rte_acl_reset(struct rte_acl_ctx *ctx);
> 
>  /**
> - * Search for a matching ACL rule for each input data buffer.
> - * Each input data buffer can have up to *categories* matches.
> - * That implies that results array should be big enough to hold
> - * (categories * num) elements.
> - * Also categories parameter should be either one or multiple of
> - * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
> - * If more than one rule is applicable for given input buffer and
> - * given category, then rule with highest priority will be returned as a match.
> - * Note, that it is a caller responsibility to ensure that input parameters
> - * are valid and point to correct memory locations.
> - *
> - * @param ctx
> - *   ACL context to search with.
> - * @param data
> - *   Array of pointers to input data buffers to perform search.
> - *   Note that all fields in input data buffers supposed to be in network
> - *   byte order (MSB).
> - * @param results
> - *   Array of search results, *categories* results per each input data buffer.
> - * @param num
> - *   Number of elements in the input data buffers array.
> - * @param categories
> - *   Number of maximum possible matches for each input buffer, one possible
> - *   match per category.
> - * @return
> - *   zero on successful completion.
> - *   -EINVAL for incorrect arguments.
> - */
> -int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);
> -
> -/**
>   * Perform scalar search for a matching ACL rule for each input data buffer.
>   * Note, that while the search itself will avoid explicit use of SSE/AVX
>   * intrinsics, code for comparing matching results/priorities sill might use
> @@ -323,9 +290,36 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
>   */
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);
> +
> +enum rte_acl_classify_alg {
> +	RTE_ACL_CLASSIFY_DEFAULT = 0,
> +	RTE_ACL_CLASSIFY_SCALAR = 1,
> +	RTE_ACL_CLASSIFY_SSE = 2,
> +};
> +
I think you removed the wrong comment.
All public API function declaration supposed to be preceded by formal doxygen style comment:
Brief explanation, parameters and return value description, etc.
Please restore the proper comment for it.
BTW,  two new functions above - they need a formal comments too.
> +extern int
> +rte_acl_classify(const struct rte_acl_ctx *ctx,
> +		 const uint8_t **data,
> +		 uint32_t *results, uint32_t num,
> +		 uint32_t categories);
> +
> +extern int
> +rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
> +		 enum rte_acl_classify_alg alg,
> +		 const uint8_t **data,
> +		 uint32_t *results, uint32_t num,
> +		 uint32_t categories);
> +/*
> + * Set the default classify algorithm for newly allocated classify contexts
> + */
> +extern void
> +rte_acl_set_default_classify(enum rte_acl_classify_alg alg);
> +
> +/*
> + * Override the default classifier function for a given ctx
> + */
> +extern void
> +rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg);
> 
>  /**
>   * Dump an ACL context structure to the console.
> --
> 1.9.3
Also, need to update examples/l3fwd-acl/ (remove rte_acl_classify_scalar() calls).
Something like:
diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c
index 9b2c21b..8cbf202 100644
--- a/examples/l3fwd-acl/main.c
+++ b/examples/l3fwd-acl/main.c
@@ -278,15 +278,6 @@ send_single_packet(struct rte_mbuf *m, uint8_t port);
        (in) = end + 1;                                         \
 } while (0)
-#define CLASSIFY(context, data, res, num, cat) do {            \
-       if (scalar)                                             \
-               rte_acl_classify_scalar((context), (data),      \
-               (res), (num), (cat));                           \
-       else                                                    \
-               rte_acl_classify((context), (data),             \
-               (res), (num), (cat));                           \
-} while (0)
-
 /*
   * ACL rules should have higher priorities than route ones to ensure ACL rule
   * always be found when input packets have multi-matches in the database.
@@ -1253,6 +1244,9 @@ app_acl_init(void)
        dump_acl_config();
+       if (parm_config.scalar)
+                rte_acl_set_default_classify(RTE_ACL_CLASSIFY_SCALAR);
+
        /* Load  rules from the input file */
        if (add_rules(parm_config.rule_ipv4_name, &route_base_ipv4,
                        &route_num_ipv4, &acl_base_ipv4, &acl_num_ipv4,
@@ -1436,10 +1430,8 @@ main_loop(__attribute__((unused)) void *dummy)
        int socketid;
        const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1)
                        / US_PER_S * BURST_TX_DRAIN_US;
-       int scalar = parm_config.scalar;
        prev_tsc = 0;
-
        lcore_id = rte_lcore_id();
        qconf = &lcore_conf[lcore_id];
        socketid = rte_lcore_to_socket_id(lcore_id);
@@ -1503,7 +1495,8 @@ main_loop(__attribute__((unused)) void *dummy)
                                        nb_rx);
                                if (acl_search.num_ipv4) {
-                                       CLASSIFY(acl_config.acx_ipv4[socketid],
+                                       rte_acl_classify(
+                                               acl_config.acx_ipv4[socketid],
                                                acl_search.data_ipv4,
                                                acl_search.res_ipv4,
                                                acl_search.num_ipv4,
@@ -1515,7 +1508,8 @@ main_loop(__attribute__((unused)) void *dummy)
                                }
                                if (acl_search.num_ipv6) {
-                                       CLASSIFY(acl_config.acx_ipv6[socketid],
+                                       rte_acl_classify(
+                                               acl_config.acx_ipv6[socketid],
                                                acl_search.data_ipv6,
                                                acl_search.res_ipv6,
                                                acl_search.num_ipv6,
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCHv4] librte_acl make it build/work for 'default' target
    2014-08-07 20:11  4% ` Neil Horman
  2014-08-21 20:15  1% ` [dpdk-dev] [PATCHv3] " Neil Horman
@ 2014-08-28 20:38  1% ` Neil Horman
  2014-08-29 17:58  0%   ` Ananyev, Konstantin
  2 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-28 20:38 UTC (permalink / raw)
  To: dev
Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.
v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.
V3 Changes
 Updated classify pointer to be a function so as to better preserve ABI
 REmoved macro definitions for match check functions to make them static inline
V4 Changes
 Rewrote classification selection mechanim to use a function table, so that we
can just store the preferred alg in the rte_acl_ctx struct so that multiprocess
access works.  I understand that leaves us with an extra load instruction, but I
think thats ok, because it also allows...
 Addition of a new function rte_acl_classify_alg.  This function lets you
specify an enum value to override the acl contexts default algorith when doing a
classification.  This allows an application to specify a classification
algorithm without needing to pulicize each method.  I know there was concern
over keeping those methods public, but we don't have a static ABI at the moment,
so this seems to me a reasonable thing to do, as it gives us less of an ABI
surface to worry about.
 Fixed misc missed static declarations
 Removed acl_match_check.h and moved match_check function to acl_run.h
 typdeffed function pointer to match check.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: konstantin.ananyev@intel.com
CC: thomas.monjalon@6wind.com
---
 app/test-acl/main.c             |  13 +-
 app/test/test_acl.c             |  10 +-
 lib/librte_acl/Makefile         |   5 +-
 lib/librte_acl/acl.h            |   1 +
 lib/librte_acl/acl_bld.c        |   5 +-
 lib/librte_acl/acl_run.c        | 944 ----------------------------------------
 lib/librte_acl/acl_run.h        | 271 ++++++++++++
 lib/librte_acl/acl_run_scalar.c | 197 +++++++++
 lib/librte_acl/acl_run_sse.c    | 630 +++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        |  62 +++
 lib/librte_acl/rte_acl.h        |  66 ++-
 11 files changed, 1208 insertions(+), 996 deletions(-)
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c
diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..6551918 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
 	/* perform build. */
 	ret = rte_acl_build(config.acx, &cfg);
 
+	/* setup default rte_acl_classify */
+	if (config.scalar)
+		rte_acl_set_default_classify(RTE_ACL_CLASSIFY_SCALAR);
+
 	dump_verbose(DUMP_NONE, stdout,
 		"rte_acl_build(%u) finished with %d\n",
 		config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 869f6d3..2169f59 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -148,7 +148,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 	}
 
 	/* make a quick check for scalar */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -343,7 +343,7 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples */
-	ret = rte_acl_classify(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -362,7 +362,7 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples (scalar) */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	ret = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: Scalar classify failed!\n", __LINE__);
@@ -850,7 +850,7 @@ test_invalid_parameters(void)
 	/* scalar classify test */
 
 	/* cover zero categories in classify (should not fail) */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+	result = rte_acl_classify_alg(acx, RTE_ACL_CLASSIFY_SCALAR, NULL, NULL, 0, 0);
 	if (result != 0) {
 		printf("Line %i: Scalar classify with zero categories "
 				"failed!\n", __LINE__);
@@ -859,7 +859,7 @@ test_invalid_parameters(void)
 	}
 
 	/* cover invalid but positive categories in classify */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
 	if (result == 0) {
 		printf("Line %i: Scalar classify with 3 categories "
 				"should have failed!\n", __LINE__);
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index b9d63fd..9236b7b 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -168,6 +168,7 @@ struct rte_acl_ctx {
 	void               *mem;
 	size_t              mem_sz;
 	struct rte_acl_config config; /* copy of build config. */
+	enum rte_acl_classify_alg alg;
 };
 
 int rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..5009188
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,271 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+typedef void (*resolve_priority_t)
+(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+        struct parms *parms, const struct rte_acl_match_results *p,
+        uint32_t categories);
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, resolve_priority_t resolve_priority)
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			resolve_priority(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..4bf58c7
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,197 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check(transition0,
+				0, ctx, parms, &flows, resolve_priority_scalar);
+			transition1 = acl_match_check(transition1,
+				1, ctx, parms, &flows, resolve_priority_scalar);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..7ae63dd
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,630 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..741bed4 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -33,11 +33,72 @@
 
 #include <rte_acl.h>
 #include "acl.h"
+#include "acl_run.h"
 
 #define	BIT_SIZEOF(x)	(sizeof(x) * CHAR_BIT)
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+extern int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+static rte_acl_classify_t classify_fns[] = {
+	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
+	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+};
+
+
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/* by default, use always avaialbe scalar code path. */
+static enum rte_acl_classify_alg rte_acl_default_classify = RTE_ACL_CLASSIFY_SCALAR;
+
+void rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+{
+	rte_acl_default_classify = alg;
+}
+
+void rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	ctx->alg = alg;
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		alg = RTE_ACL_CLASSIFY_SSE;
+
+	rte_acl_set_default_classify(alg);
+}
+
+int rte_acl_classify(const struct rte_acl_ctx *ctx,
+		     const uint8_t **data,
+		     uint32_t *results, uint32_t num,
+		     uint32_t categories)
+{
+	return classify_fns[ctx->alg](ctx, data, results, num, categories);
+}
+
+int rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
+			 enum rte_acl_classify_alg alg,
+			 const uint8_t **data,
+			 uint32_t *results, uint32_t num,
+			 uint32_t categories)
+{
+	return classify_fns[alg](ctx, data, results, num, categories);
+}
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
@@ -165,6 +226,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
+		ctx->alg = rte_acl_default_classify;
 		snprintf(ctx->name, sizeof(ctx->name), "%s", param->name);
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..c092a49 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -259,39 +259,6 @@ void
 rte_acl_reset(struct rte_acl_ctx *ctx);
 
 /**
- * Search for a matching ACL rule for each input data buffer.
- * Each input data buffer can have up to *categories* matches.
- * That implies that results array should be big enough to hold
- * (categories * num) elements.
- * Also categories parameter should be either one or multiple of
- * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
- * If more than one rule is applicable for given input buffer and
- * given category, then rule with highest priority will be returned as a match.
- * Note, that it is a caller responsibility to ensure that input parameters
- * are valid and point to correct memory locations.
- *
- * @param ctx
- *   ACL context to search with.
- * @param data
- *   Array of pointers to input data buffers to perform search.
- *   Note that all fields in input data buffers supposed to be in network
- *   byte order (MSB).
- * @param results
- *   Array of search results, *categories* results per each input data buffer.
- * @param num
- *   Number of elements in the input data buffers array.
- * @param categories
- *   Number of maximum possible matches for each input buffer, one possible
- *   match per category.
- * @return
- *   zero on successful completion.
- *   -EINVAL for incorrect arguments.
- */
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
-
-/**
  * Perform scalar search for a matching ACL rule for each input data buffer.
  * Note, that while the search itself will avoid explicit use of SSE/AVX
  * intrinsics, code for comparing matching results/priorities sill might use
@@ -323,9 +290,36 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+
+enum rte_acl_classify_alg {
+	RTE_ACL_CLASSIFY_DEFAULT = 0,
+	RTE_ACL_CLASSIFY_SCALAR = 1,
+	RTE_ACL_CLASSIFY_SSE = 2,
+};
+
+extern int
+rte_acl_classify(const struct rte_acl_ctx *ctx,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories);
+
+extern int
+rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
+		 enum rte_acl_classify_alg alg,
+		 const uint8_t **data,
+		 uint32_t *results, uint32_t num,
+		 uint32_t categories);
+/*
+ * Set the default classify algorithm for newly allocated classify contexts
+ */
+extern void
+rte_acl_set_default_classify(enum rte_acl_classify_alg alg);
+
+/*
+ * Override the default classifier function for a given ctx
+ */
+extern void
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg);
 
 /**
  * Dump an ACL context structure to the console.
-- 
1.9.3
^ permalink raw reply	[relevance 1%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 19:18  0%           ` Ananyev, Konstantin
  2014-08-28  9:02  0%             ` Richardson, Bruce
@ 2014-08-28 15:55  0%             ` Neil Horman
  1 sibling, 0 replies; 86+ results
From: Neil Horman @ 2014-08-28 15:55 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Wed, Aug 27, 2014 at 07:18:44PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, August 27, 2014 7:57 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > > To: dev@dpdk.org
> > > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > > >
> > > > > > Make ACL library to build/work on 'default' architecture:
> > > > > > - make rte_acl_classify_scalar really scalar
> > > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > > - Provide two versions of rte_acl_classify code path:
> > > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > > >   and upper, return -ENOTSUP on lower arch.
> > > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > > >   on all systems.
> > > > > > - keep common code shared between these two codepaths.
> > > > > >
> > > > > > v2 chages:
> > > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > > >  By default the highest supprted one is selected.
> > > > > >  User can still override that selection by manually assigning new value to
> > > > > >  the global function pointer rte_acl_default_classify.
> > > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > > >  points to.
> > > > > >
> > > > >
> > > > > I see you decided not to wait for me and fix everything by yourself :)
> > > > >
> > > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > > > had been about 2 weeks, so I figured I'd just take care of it.
> > >
> > > No worries. I admit that it was a long delay from my side.
> > >
> > > >
> > > > > > V3 Changes
> > > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > > >
> > > > > As I said in my previous mail it generates extra jump...
> > > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > > So I suppose, I don't have a good enough reason to object :)
> > > > >
> > > > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > > > have been smart enough to see that the rte_acl_classify function was small and
> > > > in-linable, but apparently it won't do that.  As you note however the
> > > > performance change is minor (I'm guessing within a standard deviation of your
> > > > results).
> > > >
> > > > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > > >
> > > > I'm not exactly opposed to this, though it seems odd to me that a user might
> > > > want to call a particular version of the classifier directly.  But I certainly
> > > > can't predict everything a consumer wants to do.  If we really need to keep it
> > > > public then, it begs the question, is providing a generic entry point even
> > > > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > > > directly so the application can just embody the intellegence to select the best
> > > > path?  That saves us having to maintain another API point.  I can go with
> > > > consensus on that.
> > > >
> > > > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > > > >   old_alg = rte_acl_get_classify();
> > > > >   rte_acl_select_classify(new_alg);
> > > > >   ...
> > > > >   rte_acl_select_classify(old_alg);
> > > > >
> > > > We could attach the classification method to the acl context, so each
> > > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > > remove the global issues you point out above.
> > >
> > > I thought about that approach too.
> > > But there is one implication with DPDK MP model:
> > > Same ACL context can be shared by different DPDK processes,
> > > while acl_classify() could be loaded to the different addresses.
> > > Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> > > store inside ACL ctx alg instead of actual function pointer.
> > > But that means extra overhead of at least two loads per classify() call.
> > >
> > Hmm, how is the context shared around between processes?  Is it just shared as a
> > common cow data page resulting from a fork?  If so, then we should be good
> > because the DSO text will be at the same address (i.e. the pointer will still be
> > valid).  If you do some sort of message passing, then, yes, thats a problem.
> > 
> 
> No, it is not parent-child relationship.
> There could be a group of  independently spawned processes.
> One of them should be 'primary' (starts first), other 'secondary's'.
> All hugepage memory pages mapped by the primary process, supposed to be mapped to the same VAs by each secondary.    
> So all stuff that is allocated from hugepage memory is shared between all processes in the group.
> More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-1.7.0.pdf, section 23.
> 
Ugh, so because you explicitly share heap memory space accross all processes, we
can never guarantee any pointers to statically allocated symbols, like functions
or global data.  Great.  Ok, I'll try rework this.
Neil
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 19:18  0%           ` Ananyev, Konstantin
@ 2014-08-28  9:02  0%             ` Richardson, Bruce
  2014-08-28 15:55  0%             ` Neil Horman
  1 sibling, 0 replies; 86+ results
From: Richardson, Bruce @ 2014-08-28  9:02 UTC (permalink / raw)
  To: Ananyev, Konstantin, Neil Horman; +Cc: dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Wednesday, August 27, 2014 8:19 PM
> To: Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default'
> target
> 
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, August 27, 2014 7:57 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> >
> > On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > > Hi Neil,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > > To: dev@dpdk.org
> > > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > > >
> > > > > > Make ACL library to build/work on 'default' architecture:
> > > > > > - make rte_acl_classify_scalar really scalar
> > > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > > - Provide two versions of rte_acl_classify code path:
> > > > > >   rte_acl_classify_sse() - could be build and used only on systems with
> sse4.2
> > > > > >   and upper, return -ENOTSUP on lower arch.
> > > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > > >   on all systems.
> > > > > > - keep common code shared between these two codepaths.
> > > > > >
> > > > > > v2 chages:
> > > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > > >  By default the highest supprted one is selected.
> > > > > >  User can still override that selection by manually assigning new value
> to
> > > > > >  the global function pointer rte_acl_default_classify.
> > > > > >  rte_acl_classify() becomes a macro calling whatever
> rte_acl_default_classify
> > > > > >  points to.
> > > > > >
> > > > >
> > > > > I see you decided not to wait for me and fix everything by yourself :)
> > > > >
> > > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora,
> and it
> > > > had been about 2 weeks, so I figured I'd just take care of it.
> > >
> > > No worries. I admit that it was a long delay from my side.
> > >
> > > >
> > > > > > V3 Changes
> > > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > > >
> > > > > As I said in my previous mail it generates extra jump...
> > > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > > So I suppose, I don't have a good enough reason to object :)
> > > > >
> > > > Yeah, I just don't see a way around it.  I was hoping that the compiler
> would
> > > > have been smart enough to see that the rte_acl_classify function was small
> and
> > > > in-linable, but apparently it won't do that.  As you note however the
> > > > performance change is minor (I'm guessing within a standard deviation of
> your
> > > > results).
> > > >
> > > > > Though I still think we better keep  rte_acl_classify_scalar() publically
> available (same as we do for rte acl_classify_sse()):
> > > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > > Also, as I remember, one of the customers explicitly asked for scalar
> version and they planned to call it directly.
> > > > > Plus using rte_acl_select_classify() to always switch between
> implementations is not always handy:
> > > >
> > > > I'm not exactly opposed to this, though it seems odd to me that a user
> might
> > > > want to call a particular version of the classifier directly.  But I certainly
> > > > can't predict everything a consumer wants to do.  If we really need to keep
> it
> > > > public then, it begs the question, is providing a generic entry point even
> > > > worthwhile?  Is it just as easy to expose the scalar/sse and any future
> versions
> > > > directly so the application can just embody the intellegence to select the
> best
> > > > path?  That saves us having to maintain another API point.  I can go with
> > > > consensus on that.
> > > >
> > > > > -  it is global, which means that we can't simultaneously use
> classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > > - to properly support such switching we then will need to support
> something like (see app/test/test_acl.c below):
> > > > >   old_alg = rte_acl_get_classify();
> > > > >   rte_acl_select_classify(new_alg);
> > > > >   ...
> > > > >   rte_acl_select_classify(old_alg);
> > > > >
> > > > We could attach the classification method to the acl context, so each
> > > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > > remove the global issues you point out above.
> > >
> > > I thought about that approach too.
> > > But there is one implication with DPDK MP model:
> > > Same ACL context can be shared by different DPDK processes,
> > > while acl_classify() could be loaded to the different addresses.
> > > Of course we can overcome it by creating a global table of function pointers
> indexed by calssify_alg and
> > > store inside ACL ctx alg instead of actual function pointer.
> > > But that means extra overhead of at least two loads per classify() call.
> > >
> > Hmm, how is the context shared around between processes?  Is it just shared
> as a
> > common cow data page resulting from a fork?  If so, then we should be good
> > because the DSO text will be at the same address (i.e. the pointer will still be
> > valid).  If you do some sort of message passing, then, yes, thats a problem.
> >
> 
> No, it is not parent-child relationship.
> There could be a group of  independently spawned processes.
> One of them should be 'primary' (starts first), other 'secondary's'.
> All hugepage memory pages mapped by the primary process, supposed to be
> mapped to the same VAs by each secondary.
> So all stuff that is allocated from hugepage memory is shared between all
> processes in the group.
> More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-
> 1.7.0.pdf, section 23.
> 
Function pointers just don't work easily with multiprocess.  Again some history, since today seems to be my Intel DPDK history day...
For the PMDs, originally we allowed NIC access only by the primary process, but later removed that limitation by having the secondary processes do a driver load and pci scan on startup, and having the ethdev structure split between the function pointer part which is not shared and configured independently in the secondary process as part of the pci scan, and the data part which is in hugepage memory and is shared across all processes. For the hash library, we needed a different approach and we looked at having tables of functions, but discarded the idea as largely unworkable when we took user-specified functions into account. What we ended up doing was provide separate api's to call the add/delete/lookup function with a pre-computed hash, so that multi-process apps could explicitly call the hash function without using a fn pointer and then pass in the computed value to the rest of the API calls.
Apologies for the digression from the immediate topic at hand, but I think it's something that is good to make people generally aware of when working with DPDK libs.
Regards,
/Bruce
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 18:56  0%         ` Neil Horman
@ 2014-08-27 19:18  0%           ` Ananyev, Konstantin
  2014-08-28  9:02  0%             ` Richardson, Bruce
  2014-08-28 15:55  0%             ` Neil Horman
  0 siblings, 2 replies; 86+ results
From: Ananyev, Konstantin @ 2014-08-27 19:18 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, August 27, 2014 7:57 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Tuesday, August 26, 2014 6:45 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > >
> > > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > > Hi Neil,
> > > >
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > > To: dev@dpdk.org
> > > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > > >
> > > > > Make ACL library to build/work on 'default' architecture:
> > > > > - make rte_acl_classify_scalar really scalar
> > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > - Provide two versions of rte_acl_classify code path:
> > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > >   and upper, return -ENOTSUP on lower arch.
> > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > >   on all systems.
> > > > > - keep common code shared between these two codepaths.
> > > > >
> > > > > v2 chages:
> > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > >  By default the highest supprted one is selected.
> > > > >  User can still override that selection by manually assigning new value to
> > > > >  the global function pointer rte_acl_default_classify.
> > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > >  points to.
> > > > >
> > > >
> > > > I see you decided not to wait for me and fix everything by yourself :)
> > > >
> > > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > > had been about 2 weeks, so I figured I'd just take care of it.
> >
> > No worries. I admit that it was a long delay from my side.
> >
> > >
> > > > > V3 Changes
> > > > >  Updated classify pointer to be a function so as to better preserve ABI
> > > >
> > > > As I said in my previous mail it generates extra jump...
> > > > Though from numbers I got the performance impact is negligible: < 1%.
> > > > So I suppose, I don't have a good enough reason to object :)
> > > >
> > > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > > have been smart enough to see that the rte_acl_classify function was small and
> > > in-linable, but apparently it won't do that.  As you note however the
> > > performance change is minor (I'm guessing within a standard deviation of your
> > > results).
> > >
> > > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > >
> > > I'm not exactly opposed to this, though it seems odd to me that a user might
> > > want to call a particular version of the classifier directly.  But I certainly
> > > can't predict everything a consumer wants to do.  If we really need to keep it
> > > public then, it begs the question, is providing a generic entry point even
> > > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > > directly so the application can just embody the intellegence to select the best
> > > path?  That saves us having to maintain another API point.  I can go with
> > > consensus on that.
> > >
> > > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > > >   old_alg = rte_acl_get_classify();
> > > >   rte_acl_select_classify(new_alg);
> > > >   ...
> > > >   rte_acl_select_classify(old_alg);
> > > >
> > > We could attach the classification method to the acl context, so each
> > > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > > remove the global issues you point out above.
> >
> > I thought about that approach too.
> > But there is one implication with DPDK MP model:
> > Same ACL context can be shared by different DPDK processes,
> > while acl_classify() could be loaded to the different addresses.
> > Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> > store inside ACL ctx alg instead of actual function pointer.
> > But that means extra overhead of at least two loads per classify() call.
> >
> Hmm, how is the context shared around between processes?  Is it just shared as a
> common cow data page resulting from a fork?  If so, then we should be good
> because the DSO text will be at the same address (i.e. the pointer will still be
> valid).  If you do some sort of message passing, then, yes, thats a problem.
> 
No, it is not parent-child relationship.
There could be a group of  independently spawned processes.
One of them should be 'primary' (starts first), other 'secondary's'.
All hugepage memory pages mapped by the primary process, supposed to be mapped to the same VAs by each secondary.    
So all stuff that is allocated from hugepage memory is shared between all processes in the group.
More  detailed  description: http://dpdk.org/doc/intel/dpdk-prog-guide-1.7.0.pdf, section 23.
> 
> > >  Or alternatively we can just not
> > > provide a generic entry point and let each user select a specific function.
> >
> > I wonder can we have sort of mixed approach:
> > 1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
> > 2. Let each user use a specific function if he wants too.
> >
> > i.e:
> > - keep classify_scalar/classify_sse/classify_... public.
> > - keep your current implementation of rte_acl_classify()
> > BTW in that way, we probably can make acl_select_classify() static.
> >
> Agreed, depending on your answer above, this might be the best solution.
> 
> > So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
> > For users who need to use a particular classify()  version - they can call it directly.
> >
> It does seem reasonable.  Let me know what the ctx sharing mechanism is from
> above, and we can settle this.
> 
> > >
> > >
> > > > >  REmoved macro definitions for match check functions to make them static inline
> > > >
> > > > More comments inlined below.
> > > >snip>
> > > > >
> > > > >  	/* make a quick check for scalar */
> > > > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > > > +	ret = rte_acl_classify(acx, data, results,
> > > > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> > > >
> > > >
> > > > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to
> the
> > > original value.
> > > > To support it properly, we need to:
> > > > old_alg = rte_acl_get_classify();
> > > >  rte_acl_select_classify(new_alg);
> > > >  ...
> > > >  rte_acl_select_classify(old_alg);
> > > >
> > > So, for the purposes of this test application, I don't see that as being needed.
> > > Every call to rte_acl_classify is preceded by a setting of the classifier
> > > function, so you're safe.
> >
> > Not every, that's a problem.
> > As I can see, in test/test_acl.c you replaced
> > rte_acl_classify_scalar();
> > with
> > rte_acl_select_classify(SCALAR);
> > rte_acl_classify();
> >
> > And never restore previous value of rte_acl_default_classify.
> > Right now rte_acl_default_classify is global, so after first:
> > rte_acl_select_classify(SCALAR);
> > all subsequent rte_acl_classify() will actually use scalar version.
> >
> Hmm, ok, I'll take a closer look at it.
> 
> > >  If you're concerned about other processes using the
> > > dpdk library at the same time, you're still safe, as despite being a global
> > > variable, data pages in a DSO are Copy on Write, so each process gets their own
> > > copy of the global variable.
> >
> > No, my concern here is only about  app/test here.
> >
> > >
> > > Multiple threads within the same process are problematic, I agree, and thats
> > > solvable with the per-acl-context mechanism that I described above, though that
> > > shouldn't be needed here as this seems to be a single threaded program.
> > >
> > > > Make all this just to keep UT valid seems like a big hassle to me.
> > > > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> > > >
> > > That works for me too, though the per-context mechanism seems kind of nice to
> > > me.  Let me know what you prefer.
> > >
> > > ><snip>
> > > > >
> > > > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > > > new file mode 100644
> > > > > index 0000000..4dc1982
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_acl/acl_match_check.h
> > > >
> > > > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> > > >
> > > Agreed, I can move that to acl_run.h.
> > >
> > > ><snip>
> > > > > + */
> > > > > +static inline uint64_t
> > > > > +acl_match_check(uint64_t transition, int slot,
> > > > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > > > +	uint32_t categories))
> > > >
> > > > Ugh, that's really hard to read.
> > > > Can we create a typedef for resolve_priority function type:
> > > > typedef void (*resolve_priority_t)(uint64_t, int,
> > > >         const struct rte_acl_ctx *ctx, struct parms *,
> > > >         const struct rte_acl_match_results *, uint32_t);
> > > > And use it here?
> > > >
> > > Sure, I'm fine with doing that.
> > >
> > > ><snip>
> > > > > +
> > > > > +/* by default, use always avaialbe scalar code path. */
> > > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > >
> > > > Why not 'static'?
> > > > I thought you'd like to hide it  from external world.
> > > >
> > > Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> > > the header file but I forgot to declare the variable static.  I'll fix that.
> > >
> > > > > +
> > > > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > > > +{
> > > > > +
> > > > > +	switch(alg)
> > > > > +	{
> > > > > +		case ACL_CLASSIFY_DEFAULT:
> > > > > +		case ACL_CLASSIFY_SCALAR:
> > > > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > > > +			break;
> > > > > +		case ACL_CLASSIFY_SSE:
> > > > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > > > +			break;
> > > > > +	}
> > > > > +
> > > > > +}
> > > >
> > > > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> > > value, if not.
> > > >
> > > Not sure I follow what you're saying above, are you suggesting that we add a
> > > rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> > >
> > > ><snip>
> > > > >   *
> > > > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > > > >   * @return
> > > > >   *   zero on successful completion.
> > > > >   *   -EINVAL for incorrect arguments.
> > > > > + *   -ENOTSUP for unsupported platforms.
> > > >
> > > > Please remove the line above: current implementation doesn't return ENOTSUP
> > > > (I think that was left from v1).
> > > >
> > > Yup, probably was.  I'll remove it.
> > >
> > > > >   */
> > > > >  int
> > > > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > > > >
> > > > >  /**
> > > > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > >   *   zero on successful completion.
> > > > >   *   -EINVAL for incorrect arguments.
> > > > >   */
> > > > > -int
> > > > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > > -	uint32_t *results, uint32_t num, uint32_t categories);
> > > >
> > > >
> > > > As I said above we'd better keep it.
> > > >
> > > Ok, can do.
> > >
> > > > > +
> > > > > +enum acl_classify_alg {
> > > > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > > > +	ACL_CLASSIFY_SCALAR = 1,
> > > > > +	ACL_CLASSIFY_SSE = 2,
> > > > > +};
> > > >
> > > > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> > > >
> > > Sure, done.
> > >
> > > > > +
> > > > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > > > +				   const uint8_t **data,
> > > > > +				   uint32_t *results, uint32_t num,
> > > > > +				   uint32_t categories);
> > > >
> > > > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > > > extern nt
> > > > rte_acl_classify(...);
> > > >
> > > Ok
> > >
> > > I'll produce another version based on your feedback regarding the
> > > per-context-calssifier method vs. just removing the generic classifier.
> > >
> > > Regards
> > > Neil
> >
> >
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-27 11:25  0%       ` Ananyev, Konstantin
@ 2014-08-27 18:56  0%         ` Neil Horman
  2014-08-27 19:18  0%           ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-27 18:56 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Wed, Aug 27, 2014 at 11:25:04AM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Tuesday, August 26, 2014 6:45 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> > Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > > Hi Neil,
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, August 21, 2014 9:15 PM
> > > > To: dev@dpdk.org
> > > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > > >
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > >   and upper, return -ENOTSUP on lower arch.
> > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > >   on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > >  run-time selection of most appropriate code-path for given ISA.
> > > >  By default the highest supprted one is selected.
> > > >  User can still override that selection by manually assigning new value to
> > > >  the global function pointer rte_acl_default_classify.
> > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > >  points to.
> > > >
> > >
> > > I see you decided not to wait for me and fix everything by yourself :)
> > >
> > Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> > had been about 2 weeks, so I figured I'd just take care of it.
> 
> No worries. I admit that it was a long delay from my side.
> 
> > 
> > > > V3 Changes
> > > >  Updated classify pointer to be a function so as to better preserve ABI
> > >
> > > As I said in my previous mail it generates extra jump...
> > > Though from numbers I got the performance impact is negligible: < 1%.
> > > So I suppose, I don't have a good enough reason to object :)
> > >
> > Yeah, I just don't see a way around it.  I was hoping that the compiler would
> > have been smart enough to see that the rte_acl_classify function was small and
> > in-linable, but apparently it won't do that.  As you note however the
> > performance change is minor (I'm guessing within a standard deviation of your
> > results).
> > 
> > > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> > 
> > I'm not exactly opposed to this, though it seems odd to me that a user might
> > want to call a particular version of the classifier directly.  But I certainly
> > can't predict everything a consumer wants to do.  If we really need to keep it
> > public then, it begs the question, is providing a generic entry point even
> > worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> > directly so the application can just embody the intellegence to select the best
> > path?  That saves us having to maintain another API point.  I can go with
> > consensus on that.
> > 
> > > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> > >   old_alg = rte_acl_get_classify();
> > >   rte_acl_select_classify(new_alg);
> > >   ...
> > >   rte_acl_select_classify(old_alg);
> > >
> > We could attach the classification method to the acl context, so each
> > rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> > remove the global issues you point out above.
> 
> I thought about that approach too.
> But there is one implication with DPDK MP model: 
> Same ACL context can be shared by different DPDK processes,
> while acl_classify() could be loaded to the different addresses.
> Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
> store inside ACL ctx alg instead of actual function pointer.
> But that means extra overhead of at least two loads per classify() call.
> 
Hmm, how is the context shared around between processes?  Is it just shared as a
common cow data page resulting from a fork?  If so, then we should be good
because the DSO text will be at the same address (i.e. the pointer will still be
valid).  If you do some sort of message passing, then, yes, thats a problem.
> >  Or alternatively we can just not
> > provide a generic entry point and let each user select a specific function.
> 
> I wonder can we have sort of mixed approach:
> 1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
> 2. Let each user use a specific function if he wants too.
> 
> i.e: 
> - keep classify_scalar/classify_sse/classify_... public.
> - keep your current implementation of rte_acl_classify()
> BTW in that way, we probably can make acl_select_classify() static. 
> 
Agreed, depending on your answer above, this might be the best solution.
> So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
> For users who need to use a particular classify()  version - they can call it directly.
> 
It does seem reasonable.  Let me know what the ctx sharing mechanism is from
above, and we can settle this.
> > 
> > 
> > > >  REmoved macro definitions for match check functions to make them static inline
> > >
> > > More comments inlined below.
> > >snip>
> > > >
> > > >  	/* make a quick check for scalar */
> > > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > > +	ret = rte_acl_classify(acx, data, results,
> > > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> > >
> > >
> > > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the
> > original value.
> > > To support it properly, we need to:
> > > old_alg = rte_acl_get_classify();
> > >  rte_acl_select_classify(new_alg);
> > >  ...
> > >  rte_acl_select_classify(old_alg);
> > >
> > So, for the purposes of this test application, I don't see that as being needed.
> > Every call to rte_acl_classify is preceded by a setting of the classifier
> > function, so you're safe.
> 
> Not every, that's a problem.
> As I can see, in test/test_acl.c you replaced
> rte_acl_classify_scalar();
> with
> rte_acl_select_classify(SCALAR);
> rte_acl_classify();
> 
> And never restore previous value of rte_acl_default_classify.
> Right now rte_acl_default_classify is global, so after first:
> rte_acl_select_classify(SCALAR);
> all subsequent rte_acl_classify() will actually use scalar version.
> 
Hmm, ok, I'll take a closer look at it.
> >  If you're concerned about other processes using the
> > dpdk library at the same time, you're still safe, as despite being a global
> > variable, data pages in a DSO are Copy on Write, so each process gets their own
> > copy of the global variable.
> 
> No, my concern here is only about  app/test here. 
> 
> > 
> > Multiple threads within the same process are problematic, I agree, and thats
> > solvable with the per-acl-context mechanism that I described above, though that
> > shouldn't be needed here as this seems to be a single threaded program.
> > 
> > > Make all this just to keep UT valid seems like a big hassle to me.
> > > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> > >
> > That works for me too, though the per-context mechanism seems kind of nice to
> > me.  Let me know what you prefer.
> > 
> > ><snip>
> > > >
> > > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > > new file mode 100644
> > > > index 0000000..4dc1982
> > > > --- /dev/null
> > > > +++ b/lib/librte_acl/acl_match_check.h
> > >
> > > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> > >
> > Agreed, I can move that to acl_run.h.
> > 
> > ><snip>
> > > > + */
> > > > +static inline uint64_t
> > > > +acl_match_check(uint64_t transition, int slot,
> > > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > > +	uint32_t categories))
> > >
> > > Ugh, that's really hard to read.
> > > Can we create a typedef for resolve_priority function type:
> > > typedef void (*resolve_priority_t)(uint64_t, int,
> > >         const struct rte_acl_ctx *ctx, struct parms *,
> > >         const struct rte_acl_match_results *, uint32_t);
> > > And use it here?
> > >
> > Sure, I'm fine with doing that.
> > 
> > ><snip>
> > > > +
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > >
> > > Why not 'static'?
> > > I thought you'd like to hide it  from external world.
> > >
> > Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> > the header file but I forgot to declare the variable static.  I'll fix that.
> > 
> > > > +
> > > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > > +{
> > > > +
> > > > +	switch(alg)
> > > > +	{
> > > > +		case ACL_CLASSIFY_DEFAULT:
> > > > +		case ACL_CLASSIFY_SCALAR:
> > > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +			break;
> > > > +		case ACL_CLASSIFY_SSE:
> > > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > > +			break;
> > > > +	}
> > > > +
> > > > +}
> > >
> > > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> > value, if not.
> > >
> > Not sure I follow what you're saying above, are you suggesting that we add a
> > rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> > 
> > ><snip>
> > > >   *
> > > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > > >   * @return
> > > >   *   zero on successful completion.
> > > >   *   -EINVAL for incorrect arguments.
> > > > + *   -ENOTSUP for unsupported platforms.
> > >
> > > Please remove the line above: current implementation doesn't return ENOTSUP
> > > (I think that was left from v1).
> > >
> > Yup, probably was.  I'll remove it.
> > 
> > > >   */
> > > >  int
> > > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > > >
> > > >  /**
> > > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > >   *   zero on successful completion.
> > > >   *   -EINVAL for incorrect arguments.
> > > >   */
> > > > -int
> > > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > -	uint32_t *results, uint32_t num, uint32_t categories);
> > >
> > >
> > > As I said above we'd better keep it.
> > >
> > Ok, can do.
> > 
> > > > +
> > > > +enum acl_classify_alg {
> > > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > > +	ACL_CLASSIFY_SCALAR = 1,
> > > > +	ACL_CLASSIFY_SSE = 2,
> > > > +};
> > >
> > > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> > >
> > Sure, done.
> > 
> > > > +
> > > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > > +				   const uint8_t **data,
> > > > +				   uint32_t *results, uint32_t num,
> > > > +				   uint32_t categories);
> > >
> > > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > > extern nt
> > > rte_acl_classify(...);
> > >
> > Ok
> > 
> > I'll produce another version based on your feedback regarding the
> > per-context-calssifier method vs. just removing the generic classifier.
> > 
> > Regards
> > Neil
> 
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-26 17:44  0%     ` Neil Horman
@ 2014-08-27 11:25  0%       ` Ananyev, Konstantin
  2014-08-27 18:56  0%         ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Ananyev, Konstantin @ 2014-08-27 11:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, August 26, 2014 6:45 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org; thomas.monjalon@6wind.com
> Subject: Re: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> > Hi Neil,
> >
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 21, 2014 9:15 PM
> > > To: dev@dpdk.org
> > > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > >
> > > Make ACL library to build/work on 'default' architecture:
> > > - make rte_acl_classify_scalar really scalar
> > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > - Provide two versions of rte_acl_classify code path:
> > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > >   and upper, return -ENOTSUP on lower arch.
> > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > >   on all systems.
> > > - keep common code shared between these two codepaths.
> > >
> > > v2 chages:
> > >  run-time selection of most appropriate code-path for given ISA.
> > >  By default the highest supprted one is selected.
> > >  User can still override that selection by manually assigning new value to
> > >  the global function pointer rte_acl_default_classify.
> > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > >  points to.
> > >
> >
> > I see you decided not to wait for me and fix everything by yourself :)
> >
> Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
> had been about 2 weeks, so I figured I'd just take care of it.
No worries. I admit that it was a long delay from my side.
> 
> > > V3 Changes
> > >  Updated classify pointer to be a function so as to better preserve ABI
> >
> > As I said in my previous mail it generates extra jump...
> > Though from numbers I got the performance impact is negligible: < 1%.
> > So I suppose, I don't have a good enough reason to object :)
> >
> Yeah, I just don't see a way around it.  I was hoping that the compiler would
> have been smart enough to see that the rte_acl_classify function was small and
> in-linable, but apparently it won't do that.  As you note however the
> performance change is minor (I'm guessing within a standard deviation of your
> results).
> 
> > Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> > First of all keep  rte_acl_classify_scalar() is already part of our public API.
> > Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> > Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
> 
> I'm not exactly opposed to this, though it seems odd to me that a user might
> want to call a particular version of the classifier directly.  But I certainly
> can't predict everything a consumer wants to do.  If we really need to keep it
> public then, it begs the question, is providing a generic entry point even
> worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
> directly so the application can just embody the intellegence to select the best
> path?  That saves us having to maintain another API point.  I can go with
> consensus on that.
> 
> > -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.
> > - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
> >   old_alg = rte_acl_get_classify();
> >   rte_acl_select_classify(new_alg);
> >   ...
> >   rte_acl_select_classify(old_alg);
> >
> We could attach the classification method to the acl context, so each
> rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
> remove the global issues you point out above.
I thought about that approach too.
But there is one implication with DPDK MP model: 
Same ACL context can be shared by different DPDK processes,
while acl_classify() could be loaded to the different addresses.
Of course we can overcome it by creating a global table of function pointers indexed by calssify_alg and
store inside ACL ctx alg instead of actual function pointer.
But that means extra overhead of at least two loads per classify() call.
>  Or alternatively we can just not
> provide a generic entry point and let each user select a specific function.
I wonder can we have sort of mixed approach:
1. provide a generic entry point that would be set to the best (from our knowledge) available classify function.
2. Let each user use a specific function if he wants too.
i.e: 
- keep classify_scalar/classify_sse/classify_... public.
- keep your current implementation of rte_acl_classify()
BTW in that way, we probably can make acl_select_classify() static. 
So most users would just use generic entry point and wouldn't need to write their own code wrappers around it.
For users who need to use a particular classify()  version - they can call it directly.
> 
> 
> > >  REmoved macro definitions for match check functions to make them static inline
> >
> > More comments inlined below.
> >snip>
> > >
> > >  	/* make a quick check for scalar */
> > > -	ret = rte_acl_classify_scalar(acx, data, results,
> > > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > > +	ret = rte_acl_classify(acx, data, results,
> > >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> >
> >
> > As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the
> original value.
> > To support it properly, we need to:
> > old_alg = rte_acl_get_classify();
> >  rte_acl_select_classify(new_alg);
> >  ...
> >  rte_acl_select_classify(old_alg);
> >
> So, for the purposes of this test application, I don't see that as being needed.
> Every call to rte_acl_classify is preceded by a setting of the classifier
> function, so you're safe.
Not every, that's a problem.
As I can see, in test/test_acl.c you replaced
rte_acl_classify_scalar();
with
rte_acl_select_classify(SCALAR);
rte_acl_classify();
And never restore previous value of rte_acl_default_classify.
Right now rte_acl_default_classify is global, so after first:
rte_acl_select_classify(SCALAR);
all subsequent rte_acl_classify() will actually use scalar version.
>  If you're concerned about other processes using the
> dpdk library at the same time, you're still safe, as despite being a global
> variable, data pages in a DSO are Copy on Write, so each process gets their own
> copy of the global variable.
No, my concern here is only about  app/test here. 
> 
> Multiple threads within the same process are problematic, I agree, and thats
> solvable with the per-acl-context mechanism that I described above, though that
> shouldn't be needed here as this seems to be a single threaded program.
> 
> > Make all this just to keep UT valid seems like a big hassle to me.
> > So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> >
> That works for me too, though the per-context mechanism seems kind of nice to
> me.  Let me know what you prefer.
> 
> ><snip>
> > >
> > > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > > new file mode 100644
> > > index 0000000..4dc1982
> > > --- /dev/null
> > > +++ b/lib/librte_acl/acl_match_check.h
> >
> > As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> >
> Agreed, I can move that to acl_run.h.
> 
> ><snip>
> > > + */
> > > +static inline uint64_t
> > > +acl_match_check(uint64_t transition, int slot,
> > > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > > +	struct parms *parms, const struct rte_acl_match_results *p,
> > > +	uint32_t categories))
> >
> > Ugh, that's really hard to read.
> > Can we create a typedef for resolve_priority function type:
> > typedef void (*resolve_priority_t)(uint64_t, int,
> >         const struct rte_acl_ctx *ctx, struct parms *,
> >         const struct rte_acl_match_results *, uint32_t);
> > And use it here?
> >
> Sure, I'm fine with doing that.
> 
> ><snip>
> > > +
> > > +/* by default, use always avaialbe scalar code path. */
> > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> >
> > Why not 'static'?
> > I thought you'd like to hide it  from external world.
> >
> Doh!  I didn't do the one thing that I really meant to do.  I removed it from
> the header file but I forgot to declare the variable static.  I'll fix that.
> 
> > > +
> > > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > > +{
> > > +
> > > +	switch(alg)
> > > +	{
> > > +		case ACL_CLASSIFY_DEFAULT:
> > > +		case ACL_CLASSIFY_SCALAR:
> > > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > > +			break;
> > > +		case ACL_CLASSIFY_SSE:
> > > +			rte_acl_default_classify = rte_acl_classify_sse;
> > > +			break;
> > > +	}
> > > +
> > > +}
> >
> > As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return
> value, if not.
> >
> Not sure I follow what you're saying above, are you suggesting that we add a
> rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
> 
> ><snip>
> > >   *
> > > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> > >   * @return
> > >   *   zero on successful completion.
> > >   *   -EINVAL for incorrect arguments.
> > > + *   -ENOTSUP for unsupported platforms.
> >
> > Please remove the line above: current implementation doesn't return ENOTSUP
> > (I think that was left from v1).
> >
> Yup, probably was.  I'll remove it.
> 
> > >   */
> > >  int
> > > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > >  	uint32_t *results, uint32_t num, uint32_t categories);
> > >
> > >  /**
> > > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > >   *   zero on successful completion.
> > >   *   -EINVAL for incorrect arguments.
> > >   */
> > > -int
> > > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > -	uint32_t *results, uint32_t num, uint32_t categories);
> >
> >
> > As I said above we'd better keep it.
> >
> Ok, can do.
> 
> > > +
> > > +enum acl_classify_alg {
> > > +	ACL_CLASSIFY_DEFAULT = 0,
> > > +	ACL_CLASSIFY_SCALAR = 1,
> > > +	ACL_CLASSIFY_SSE = 2,
> > > +};
> >
> > As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> >
> Sure, done.
> 
> > > +
> > > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > > +				   const uint8_t **data,
> > > +				   uint32_t *results, uint32_t num,
> > > +				   uint32_t categories);
> >
> > Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> > extern nt
> > rte_acl_classify(...);
> >
> Ok
> 
> I'll produce another version based on your feedback regarding the
> per-context-calssifier method vs. just removing the generic classifier.
> 
> Regards
> Neil
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-25 16:30  0%   ` Ananyev, Konstantin
@ 2014-08-26 17:44  0%     ` Neil Horman
  2014-08-27 11:25  0%       ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-26 17:44 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Mon, Aug 25, 2014 at 04:30:05PM +0000, Ananyev, Konstantin wrote:
> Hi Neil,
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, August 21, 2014 9:15 PM
> > To: dev@dpdk.org
> > Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> > Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> > 
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> > 
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> >  points to.
> > 
> 
> I see you decided not to wait for me and fix everything by yourself :)
> 
Yeah, sorry, I'm getting pinged about enabling these features in Fedora, and it
had been about 2 weeks, so I figured I'd just take care of it.
> > V3 Changes
> >  Updated classify pointer to be a function so as to better preserve ABI
> 
> As I said in my previous mail it generates extra jump...
> Though from numbers I got the performance impact is negligible: < 1%.
> So I suppose, I don't have a good enough reason to object :)
> 
Yeah, I just don't see a way around it.  I was hoping that the compiler would
have been smart enough to see that the rte_acl_classify function was small and
in-linable, but apparently it won't do that.  As you note however the
performance change is minor (I'm guessing within a standard deviation of your
results).
> Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
> First of all keep  rte_acl_classify_scalar() is already part of our public API.
> Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
> Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
I'm not exactly opposed to this, though it seems odd to me that a user might
want to call a particular version of the classifier directly.  But I certainly
can't predict everything a consumer wants to do.  If we really need to keep it
public then, it begs the question, is providing a generic entry point even
worthwhile?  Is it just as easy to expose the scalar/sse and any future versions
directly so the application can just embody the intellegence to select the best
path?  That saves us having to maintain another API point.  I can go with
consensus on that.
> -  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.  
> - to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
>   old_alg = rte_acl_get_classify();
>   rte_acl_select_classify(new_alg);
>   ...
>   rte_acl_select_classify(old_alg); 
>   
We could attach the classification method to the acl context, so each
rte_acl_ctx can point to whatever classifier funtion it wants to.  That would
remove the global issues you point out above.  Or alternatively we can just not
provide a generic entry point and let each user select a specific function.
> >  REmoved macro definitions for match check functions to make them static inline
> 
> More comments inlined below.
>snip> 
> > 
> >  	/* make a quick check for scalar */
> > -	ret = rte_acl_classify_scalar(acx, data, results,
> > +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> > +	ret = rte_acl_classify(acx, data, results,
> >  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
> 
> 
> As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the original value.
> To support it properly, we need to:
> old_alg = rte_acl_get_classify();
>  rte_acl_select_classify(new_alg);
>  ...
>  rte_acl_select_classify(old_alg);
> 
So, for the purposes of this test application, I don't see that as being needed.
Every call to rte_acl_classify is preceded by a setting of the classifier
function, so you're safe.  If you're concerned about other processes using the
dpdk library at the same time, you're still safe, as despite being a global
variable, data pages in a DSO are Copy on Write, so each process gets their own
copy of the global variable.
Multiple threads within the same process are problematic, I agree, and thats
solvable with the per-acl-context mechanism that I described above, though that
shouldn't be needed here as this seems to be a single threaded program.
> Make all this just to keep UT valid seems like a big hassle to me.
> So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
> 
That works for me too, though the per-context mechanism seems kind of nice to
me.  Let me know what you prefer.
><snip>
> > 
> > diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> > new file mode 100644
> > index 0000000..4dc1982
> > --- /dev/null
> > +++ b/lib/librte_acl/acl_match_check.h
> 
> As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> 
Agreed, I can move that to acl_run.h.
><snip>
> > + */
> > +static inline uint64_t
> > +acl_match_check(uint64_t transition, int slot,
> > +	const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, void (*resolve_priority)(
> > +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> > +	struct parms *parms, const struct rte_acl_match_results *p,
> > +	uint32_t categories))
> 
> Ugh, that's really hard to read.
> Can we create a typedef for resolve_priority function type:
> typedef void (*resolve_priority_t)(uint64_t, int,
>         const struct rte_acl_ctx *ctx, struct parms *,
>         const struct rte_acl_match_results *, uint32_t);
> And use it here?
> 
Sure, I'm fine with doing that.
><snip>
> > +
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> 
> Why not 'static'?
> I thought you'd like to hide it  from external world.
> 
Doh!  I didn't do the one thing that I really meant to do.  I removed it from
the header file but I forgot to declare the variable static.  I'll fix that.
> > +
> > +void rte_acl_select_classify(enum acl_classify_alg alg)
> > +{
> > +
> > +	switch(alg)
> > +	{
> > +		case ACL_CLASSIFY_DEFAULT:
> > +		case ACL_CLASSIFY_SCALAR:
> > +			rte_acl_default_classify = rte_acl_classify_scalar;
> > +			break;
> > +		case ACL_CLASSIFY_SSE:
> > +			rte_acl_default_classify = rte_acl_classify_sse;
> > +			break;
> > +	}
> > +
> > +}
> 
> As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return value, if not.  
> 
Not sure I follow what you're saying above, are you suggesting that we add a
rte_cpu_get_flag_enabled check to rte_acl_select_classify above?
><snip>
> >   *
> > @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
> >   * @return
> >   *   zero on successful completion.
> >   *   -EINVAL for incorrect arguments.
> > + *   -ENOTSUP for unsupported platforms.
> 
> Please remove the line above: current implementation doesn't return ENOTSUP
> (I think that was left from v1).
> 
Yup, probably was.  I'll remove it.
> >   */
> >  int
> > -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> >  	uint32_t *results, uint32_t num, uint32_t categories);
> > 
> >  /**
> > @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> >   *   zero on successful completion.
> >   *   -EINVAL for incorrect arguments.
> >   */
> > -int
> > -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > -	uint32_t *results, uint32_t num, uint32_t categories);
> 
> 
> As I said above we'd better keep it.  
> 
Ok, can do.
> > +
> > +enum acl_classify_alg {
> > +	ACL_CLASSIFY_DEFAULT = 0,
> > +	ACL_CLASSIFY_SCALAR = 1,
> > +	ACL_CLASSIFY_SSE = 2,
> > +};
> 
> As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> 
Sure, done.
> > +
> > +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> > +				   const uint8_t **data,
> > +				   uint32_t *results, uint32_t num,
> > +				   uint32_t categories);
> 
> Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
> extern nt
> rte_acl_classify(...);
> 
Ok
I'll produce another version based on your feedback regarding the
per-context-calssifier method vs. just removing the generic classifier.
Regards
Neil
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
  2014-08-21 20:15  1% ` [dpdk-dev] [PATCHv3] " Neil Horman
@ 2014-08-25 16:30  0%   ` Ananyev, Konstantin
  2014-08-26 17:44  0%     ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Ananyev, Konstantin @ 2014-08-25 16:30 UTC (permalink / raw)
  To: Neil Horman, dev
Hi Neil,
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 21, 2014 9:15 PM
> To: dev@dpdk.org
> Cc: Ananyev, Konstantin; thomas.monjalon@6wind.com; Neil Horman
> Subject: [PATCHv3] librte_acl make it build/work for 'default' target
> 
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
I see you decided not to wait for me and fix everything by yourself :)
> V3 Changes
>  Updated classify pointer to be a function so as to better preserve ABI
As I said in my previous mail it generates extra jump...
Though from numbers I got the performance impact is negligible: < 1%.
So I suppose, I don't have a good enough reason to object :)
Though I still think we better keep  rte_acl_classify_scalar() publically available (same as we do for rte acl_classify_sse()):
First of all keep  rte_acl_classify_scalar() is already part of our public API.
Also, as I remember, one of the customers explicitly asked for scalar version and they planned to call it directly.
Plus using rte_acl_select_classify() to always switch between implementations is not always handy:
-  it is global, which means that we can't simultaneously use classify_scalar() and classify_sse() for 2 different ACL contexts.  
- to properly support such switching we then will need to support something like (see app/test/test_acl.c below):
  old_alg = rte_acl_get_classify();
  rte_acl_select_classify(new_alg);
  ...
  rte_acl_select_classify(old_alg); 
  
>  REmoved macro definitions for match check functions to make them static inline
More comments inlined below.
Thanks
Konstantin
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  app/test-acl/main.c              |  13 +-
>  app/test/test_acl.c              |  12 +-
>  lib/librte_acl/Makefile          |   5 +-
>  lib/librte_acl/acl_bld.c         |   5 +-
>  lib/librte_acl/acl_match_check.h |  83 ++++
>  lib/librte_acl/acl_run.c         | 944 ---------------------------------------
>  lib/librte_acl/acl_run.h         | 220 +++++++++
>  lib/librte_acl/acl_run_scalar.c  | 198 ++++++++
>  lib/librte_acl/acl_run_sse.c     | 627 ++++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c         |  46 ++
>  lib/librte_acl/rte_acl.h         |  26 +-
>  11 files changed, 1216 insertions(+), 963 deletions(-)
>  create mode 100644 lib/librte_acl/acl_match_check.h
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..a77f47d 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
>  	/* perform build. */
>  	ret = rte_acl_build(config.acx, &cfg);
> 
> +	/* setup default rte_acl_classify */
> +	if (config.scalar)
> +		rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +
>  	dump_verbose(DUMP_NONE, stdout,
>  		"rte_acl_build(%u) finished with %d\n",
>  		config.bld_categories, ret);
> @@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
>  			v += config.trace_sz;
>  		}
> 
> -		if (scalar != 0)
> -			ret = rte_acl_classify_scalar(config.acx, data,
> -				results, n, categories);
> -
> -		else
> -			ret = rte_acl_classify(config.acx, data,
> -				results, n, categories);
> +		ret = rte_acl_classify(config.acx, data, results,
> +			n, categories);
> 
>  		if (ret != 0)
>  			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 869f6d3..2fcef6e 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
>  	}
> 
>  	/* make a quick check for scalar */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
As I said above, that doesn't seem correct: we set rte_acl_default_classify = rte_acl_classify_scalar and never restore it back to the original value.
To support it properly, we need to:
old_alg = rte_acl_get_classify();
 rte_acl_select_classify(new_alg);
 ...
 rte_acl_select_classify(old_alg);
Make all this just to keep UT valid seems like a big hassle to me.
So I said above - probably better just leave it to call rte_acl_classify_scalar() directly.
>  	if (ret != 0) {
>  		printf("Line %i: SSE classify failed!\n", __LINE__);
> @@ -362,7 +363,8 @@ test_invalid_layout(void)
>  	}
> 
>  	/* classify tuples (scalar) */
> -	ret = rte_acl_classify_scalar(acx, data, results,
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	ret = rte_acl_classify(acx, data, results,
>  			RTE_DIM(results), 1);
>  	if (ret != 0) {
>  		printf("Line %i: Scalar classify failed!\n", __LINE__);
> @@ -850,7 +852,8 @@ test_invalid_parameters(void)
>  	/* scalar classify test */
> 
>  	/* cover zero categories in classify (should not fail) */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 0);
>  	if (result != 0) {
>  		printf("Line %i: Scalar classify with zero categories "
>  				"failed!\n", __LINE__);
> @@ -859,7 +862,8 @@ test_invalid_parameters(void)
>  	}
> 
>  	/* cover invalid but positive categories in classify */
> -	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
> +	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
> +	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
>  	if (result == 0) {
>  		printf("Line %i: Scalar classify with 3 categories "
>  				"should have failed!\n", __LINE__);
> diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
> index 4fe4593..65e566d 100644
> --- a/lib/librte_acl/Makefile
> +++ b/lib/librte_acl/Makefile
> @@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
>  SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
> -SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
> +SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
> +
> +CFLAGS_acl_run_sse.o += -msse4.1
> 
>  # install this header file
>  SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
> diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> index 873447b..09d58ea 100644
> --- a/lib/librte_acl/acl_bld.c
> +++ b/lib/librte_acl/acl_bld.c
> @@ -31,7 +31,6 @@
>   *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>   */
> 
> -#include <nmmintrin.h>
>  #include <rte_acl.h>
>  #include "tb_mem.h"
>  #include "acl.h"
> @@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
> 
>  			switch (rule->config->defs[n].type) {
>  			case RTE_ACL_FIELD_TYPE_BITMASK:
> -				wild = (size -
> -					_mm_popcnt_u32(fld->mask_range.u8)) /
> +				wild = (size - __builtin_popcount(
> +					fld->mask_range.u8)) /
>  					size;
>  				break;
> 
> diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
> new file mode 100644
> index 0000000..4dc1982
> --- /dev/null
> +++ b/lib/librte_acl/acl_match_check.h
As a nit: we probably don't need a special header just for one function and can place it inside acl_run.h.
> @@ -0,0 +1,83 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _ACL_MATCH_CHECK_H_
> +#define _ACL_MATCH_CHECK_H_
> +
> +/*
> + * Detect matches. If a match node transition is found, then this trie
> + * traversal is complete and fill the slot with the next trie
> + * to be processed.
> + */
> +static inline uint64_t
> +acl_match_check(uint64_t transition, int slot,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, void (*resolve_priority)(
> +	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories))
Ugh, that's really hard to read.
Can we create a typedef for resolve_priority function type:
typedef void (*resolve_priority_t)(uint64_t, int,
        const struct rte_acl_ctx *ctx, struct parms *,
        const struct rte_acl_match_results *, uint32_t);
And use it here?
> +{
> +	const struct rte_acl_match_results *p;
> +
> +	p = (const struct rte_acl_match_results *)
> +		(flows->trans + ctx->match_index);
> +
> +	if (transition & RTE_ACL_NODE_MATCH) {
> +
> +		/* Remove flags from index and decrement active traversals */
> +		transition &= RTE_ACL_NODE_INDEX;
> +		flows->started--;
> +
> +		/* Resolve priorities for this trie and running results */
> +		if (flows->categories == 1)
> +			resolve_single_priority(transition, slot, ctx,
> +				parms, p);
> +		else
> +			resolve_priority(transition, slot, ctx, parms,
> +				p, flows->categories);
> +
> +		/* Count down completed tries for this search request */
> +		parms[slot].cmplt->count--;
> +
> +		/* Fill the slot with the next trie or idle trie */
> +		transition = acl_start_next_trie(flows, parms, slot, ctx);
> +
> +	} else if (transition == ctx->idle) {
> +		/* reset indirection table for idle slots */
> +		parms[slot].data_index = idle;
> +	}
> +
> +	return transition;
> +}
> +
> +#endif
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - *   BSD LICENSE
> - *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - *   All rights reserved.
> - *
> - *   Redistribution and use in source and binary forms, with or without
> - *   modification, are permitted provided that the following conditions
> - *   are met:
> - *
> - *     * Redistributions of source code must retain the above copyright
> - *       notice, this list of conditions and the following disclaimer.
> - *     * Redistributions in binary form must reproduce the above copyright
> - *       notice, this list of conditions and the following disclaimer in
> - *       the documentation and/or other materials provided with the
> - *       distribution.
> - *     * Neither the name of Intel Corporation nor the names of its
> - *       contributors may be used to endorse or promote products derived
> - *       from this software without specific prior written permission.
> - *
> - *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> - *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> - *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> - *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> - *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> - *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> - *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> - *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> - *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> - *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> - *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> - */
> -
> -#include <rte_acl.h>
> -#include "acl_vect.h"
> -#include "acl.h"
> -
> -#define MAX_SEARCHES_SSE8	8
> -#define MAX_SEARCHES_SSE4	4
> -#define MAX_SEARCHES_SSE2	2
> -#define MAX_SEARCHES_SCALAR	2
> -
> -#define GET_NEXT_4BYTES(prm, idx)	\
> -	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> -
> -
> -#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> -
> -#define	SCALAR_QRANGE_MULT	0x01010101
> -#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> -#define	SCALAR_QRANGE_MIN	0x80808080
> -
> -enum {
> -	SHUFFLE32_SLOT1 = 0xe5,
> -	SHUFFLE32_SLOT2 = 0xe6,
> -	SHUFFLE32_SLOT3 = 0xe7,
> -	SHUFFLE32_SWAP64 = 0x4e,
> -};
> -
> -/*
> - * Structure to manage N parallel trie traversals.
> - * The runtime trie traversal routines can process 8, 4, or 2 tries
> - * in parallel. Each packet may require multiple trie traversals (up to 4).
> - * This structure is used to fill the slots (0 to n-1) for parallel processing
> - * with the trie traversals needed for each packet.
> - */
> -struct acl_flow_data {
> -	uint32_t            num_packets;
> -	/* number of packets processed */
> -	uint32_t            started;
> -	/* number of trie traversals in progress */
> -	uint32_t            trie;
> -	/* current trie index (0 to N-1) */
> -	uint32_t            cmplt_size;
> -	uint32_t            total_packets;
> -	uint32_t            categories;
> -	/* number of result categories per packet. */
> -	/* maximum number of packets to process */
> -	const uint64_t     *trans;
> -	const uint8_t     **data;
> -	uint32_t           *results;
> -	struct completion  *last_cmplt;
> -	struct completion  *cmplt_array;
> -};
> -
> -/*
> - * Structure to maintain running results for
> - * a single packet (up to 4 tries).
> - */
> -struct completion {
> -	uint32_t *results;                          /* running results. */
> -	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> -	uint32_t  count;                            /* num of remaining tries */
> -	/* true for allocated struct */
> -} __attribute__((aligned(XMM_SIZE)));
> -
> -/*
> - * One parms structure for each slot in the search engine.
> - */
> -struct parms {
> -	const uint8_t              *data;
> -	/* input data for this packet */
> -	const uint32_t             *data_index;
> -	/* data indirection for this trie */
> -	struct completion          *cmplt;
> -	/* completion data for this packet */
> -};
> -
> -/*
> - * Define an global idle node for unused engine slots
> - */
> -static const uint32_t idle[UINT8_MAX + 1];
> -
> -static const rte_xmm_t mm_type_quad_range = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -	},
> -};
> -
> -static const rte_xmm_t mm_type_quad_range64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_QRANGE,
> -		RTE_ACL_NODE_QRANGE,
> -		0,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input = {
> -	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> -};
> -
> -static const rte_xmm_t mm_shuffle_input64 = {
> -	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> -};
> -
> -static const rte_xmm_t mm_ones_16 = {
> -	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> -};
> -
> -static const rte_xmm_t mm_bytes = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> -};
> -
> -static const rte_xmm_t mm_bytes64 = {
> -	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> -};
> -
> -static const rte_xmm_t mm_match_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -		RTE_ACL_NODE_MATCH,
> -	},
> -};
> -
> -static const rte_xmm_t mm_match_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -		RTE_ACL_NODE_MATCH,
> -		0,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -	},
> -};
> -
> -static const rte_xmm_t mm_index_mask64 = {
> -	.u32 = {
> -		RTE_ACL_NODE_INDEX,
> -		RTE_ACL_NODE_INDEX,
> -		0,
> -		0,
> -	},
> -};
> -
> -/*
> - * Allocate a completion structure to manage the tries for a packet.
> - */
> -static inline struct completion *
> -alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> -	uint32_t *results)
> -{
> -	uint32_t n;
> -
> -	for (n = 0; n < size; n++) {
> -
> -		if (p[n].count == 0) {
> -
> -			/* mark as allocated and set number of tries. */
> -			p[n].count = tries;
> -			p[n].results = results;
> -			return &(p[n]);
> -		}
> -	}
> -
> -	/* should never get here */
> -	return NULL;
> -}
> -
> -/*
> - * Resolve priority for a single result trie.
> - */
> -static inline void
> -resolve_single_priority(uint64_t transition, int n,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	const struct rte_acl_match_results *p)
> -{
> -	if (parms[n].cmplt->count == ctx->num_tries ||
> -			parms[n].cmplt->priority[0] <=
> -			p[transition].priority[0]) {
> -
> -		parms[n].cmplt->priority[0] = p[transition].priority[0];
> -		parms[n].cmplt->results[0] = p[transition].results[0];
> -	}
> -
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Resolve priority for multiple results. This consists comparing
> - * the priority of the current traversal with the running set of
> - * results for the packet. For each result, keep a running array of
> - * the result (rule number) and its priority for each category.
> - */
> -static inline void
> -resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, const struct rte_acl_match_results *p,
> -	uint32_t categories)
> -{
> -	uint32_t x;
> -	xmm_t results, priority, results1, priority1, selector;
> -	xmm_t *saved_results, *saved_priority;
> -
> -	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> -
> -		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> -		saved_priority =
> -			(xmm_t *)(&parms[n].cmplt->priority[x]);
> -
> -		/* get results and priorities for completed trie */
> -		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> -		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> -
> -		/* if this is not the first completed trie */
> -		if (parms[n].cmplt->count != ctx->num_tries) {
> -
> -			/* get running best results and their priorities */
> -			results1 = MM_LOADU(saved_results);
> -			priority1 = MM_LOADU(saved_priority);
> -
> -			/* select results that are highest priority */
> -			selector = MM_CMPGT32(priority1, priority);
> -			results = MM_BLENDV8(results, results1, selector);
> -			priority = MM_BLENDV8(priority, priority1, selector);
> -		}
> -
> -		/* save running best results and their priorities */
> -		MM_STOREU(saved_results, results);
> -		MM_STOREU(saved_priority, priority);
> -	}
> -
> -	/* Count down completed tries for this search request */
> -	parms[n].cmplt->count--;
> -}
> -
> -/*
> - * Routine to fill a slot in the parallel trie traversal array (parms) from
> - * the list of packets (flows).
> - */
> -static inline uint64_t
> -acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> -	const struct rte_acl_ctx *ctx)
> -{
> -	uint64_t transition;
> -
> -	/* if there are any more packets to process */
> -	if (flows->num_packets < flows->total_packets) {
> -		parms[n].data = flows->data[flows->num_packets];
> -		parms[n].data_index = ctx->trie[flows->trie].data_index;
> -
> -		/* if this is the first trie for this packet */
> -		if (flows->trie == 0) {
> -			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> -				flows->cmplt_size, ctx->num_tries,
> -				flows->results +
> -				flows->num_packets * flows->categories);
> -		}
> -
> -		/* set completion parameters and starting index for this slot */
> -		parms[n].cmplt = flows->last_cmplt;
> -		transition =
> -			flows->trans[parms[n].data[*parms[n].data_index++] +
> -			ctx->trie[flows->trie].root_index];
> -
> -		/*
> -		 * if this is the last trie for this packet,
> -		 * then setup next packet.
> -		 */
> -		flows->trie++;
> -		if (flows->trie >= ctx->num_tries) {
> -			flows->trie = 0;
> -			flows->num_packets++;
> -		}
> -
> -		/* keep track of number of active trie traversals */
> -		flows->started++;
> -
> -	/* no more tries to process, set slot to an idle position */
> -	} else {
> -		transition = ctx->idle;
> -		parms[n].data = (const uint8_t *)idle;
> -		parms[n].data_index = idle;
> -	}
> -	return transition;
> -}
> -
> -/*
> - * Detect matches. If a match node transition is found, then this trie
> - * traversal is complete and fill the slot with the next trie
> - * to be processed.
> - */
> -static inline uint64_t
> -acl_match_check_transition(uint64_t transition, int slot,
> -	const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows)
> -{
> -	const struct rte_acl_match_results *p;
> -
> -	p = (const struct rte_acl_match_results *)
> -		(flows->trans + ctx->match_index);
> -
> -	if (transition & RTE_ACL_NODE_MATCH) {
> -
> -		/* Remove flags from index and decrement active traversals */
> -		transition &= RTE_ACL_NODE_INDEX;
> -		flows->started--;
> -
> -		/* Resolve priorities for this trie and running results */
> -		if (flows->categories == 1)
> -			resolve_single_priority(transition, slot, ctx,
> -				parms, p);
> -		else
> -			resolve_priority(transition, slot, ctx, parms, p,
> -				flows->categories);
> -
> -		/* Fill the slot with the next trie or idle trie */
> -		transition = acl_start_next_trie(flows, parms, slot, ctx);
> -
> -	} else if (transition == ctx->idle) {
> -		/* reset indirection table for idle slots */
> -		parms[slot].data_index = idle;
> -	}
> -
> -	return transition;
> -}
> -
> -/*
> - * Extract transitions from an XMM register and check for any matches
> - */
> -static void
> -acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> -	struct parms *parms, struct acl_flow_data *flows)
> -{
> -	uint64_t transition1, transition2;
> -
> -	/* extract transition from low 64 bits. */
> -	transition1 = MM_CVT64(*indicies);
> -
> -	/* extract transition from high 64 bits. */
> -	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> -	transition2 = MM_CVT64(*indicies);
> -
> -	transition1 = acl_match_check_transition(transition1, slot, ctx,
> -		parms, flows);
> -	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
> -		parms, flows);
> -
> -	/* update indicies with new transitions. */
> -	*indicies = MM_SET64(transition2, transition1);
> -}
> -
> -/*
> - * Check for a match in 2 transitions (contained in SSE register)
> - */
> -static inline void
> -acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	temp = MM_AND(match_mask, *indicies);
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies, slot, ctx, parms, flows);
> -		temp = MM_AND(match_mask, *indicies);
> -	}
> -}
> -
> -/*
> - * Check for any match in 4 transitions (contained in 2 SSE registers)
> - */
> -static inline void
> -acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> -	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> -	xmm_t match_mask)
> -{
> -	xmm_t temp;
> -
> -	/* put low 32 bits of each transition into one register */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	/* test for match node */
> -	temp = MM_AND(match_mask, temp);
> -
> -	while (!MM_TESTZ(temp, temp)) {
> -		acl_process_matches(indicies1, slot, ctx, parms, flows);
> -		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> -
> -		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -					(__m128)*indicies2,
> -					0x88);
> -		temp = MM_AND(match_mask, temp);
> -	}
> -}
> -
> -/*
> - * Calculate the address of the next transition for
> - * all types of nodes. Note that only DFA nodes and range
> - * nodes actually transition to another node. Match
> - * nodes don't move.
> - */
> -static inline xmm_t
> -acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr, node_types, temp;
> -
> -	/*
> -	 * Note that no transition is done for a match
> -	 * node and therefore a stream freezes when
> -	 * it reaches a match.
> -	 */
> -
> -	/* Shuffle low 32 into temp and high 32 into indicies2 */
> -	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> -		0x88);
> -	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> -		(__m128)*indicies2, 0xdd);
> -
> -	/* Calc node type and node addr */
> -	node_types = MM_ANDNOT(index_mask, temp);
> -	addr = MM_AND(index_mask, temp);
> -
> -	/*
> -	 * Calc addr for DFAs - addr = dfa_index + input_byte
> -	 */
> -
> -	/* mask for DFA type (0) nodes */
> -	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> -
> -	/* add input byte to DFA position */
> -	temp = MM_AND(temp, bytes);
> -	temp = MM_AND(temp, next_input);
> -	addr = MM_ADD32(addr, temp);
> -
> -	/*
> -	 * Calc addr for Range nodes -> range_index + range(input)
> -	 */
> -	node_types = MM_CMPEQ32(node_types, type_quad_range);
> -
> -	/*
> -	 * Calculate number of range boundaries that are less than the
> -	 * input value. Range boundaries for each node are in signed 8 bit,
> -	 * ordered from -128 to 127 in the indicies2 register.
> -	 * This is effectively a popcnt of bytes that are greater than the
> -	 * input byte.
> -	 */
> -
> -	/* shuffle input byte to all 4 positions of 32 bit value */
> -	temp = MM_SHUFFLE8(next_input, shuffle_input);
> -
> -	/* check ranges */
> -	temp = MM_CMPGT8(temp, *indicies2);
> -
> -	/* convert -1 to 1 (bytes greater than input byte */
> -	temp = MM_SIGN8(temp, temp);
> -
> -	/* horizontal add pairs of bytes into words */
> -	temp = MM_MADD8(temp, temp);
> -
> -	/* horizontal add pairs of words into dwords */
> -	temp = MM_MADD16(temp, ones_16);
> -
> -	/* mask to range type nodes */
> -	temp = MM_AND(temp, node_types);
> -
> -	/* add index into node position */
> -	return MM_ADD32(addr, temp);
> -}
> -
> -/*
> - * Process 4 transitions (in 2 SIMD registers) in parallel
> - */
> -static inline xmm_t
> -transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> -{
> -	xmm_t addr;
> -	uint64_t trans0, trans2;
> -
> -	 /* Calculate the address (array index) for all 4 transitions. */
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, indicies2);
> -
> -	 /* Gather 64 bit transitions and pack back into 2 registers. */
> -
> -	trans0 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 2 */
> -
> -	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> -	trans2 = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -
> -	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> -
> -	/* get slot 3 */
> -
> -	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> -	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -static inline void
> -acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> -	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> -	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> -{
> -	flows->num_packets = 0;
> -	flows->started = 0;
> -	flows->trie = 0;
> -	flows->last_cmplt = NULL;
> -	flows->cmplt_array = cmplt;
> -	flows->total_packets = data_num;
> -	flows->categories = categories;
> -	flows->cmplt_size = cmplt_size;
> -	flows->data = data;
> -	flows->results = results;
> -	flows->trans = trans;
> -}
> -
> -/*
> - * Execute trie traversal with 8 traversals in parallel
> - */
> -static inline void
> -search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE8];
> -	struct completion cmplt[MAX_SEARCHES_SSE8];
> -	struct parms parms[MAX_SEARCHES_SSE8];
> -	xmm_t input0, input1;
> -	xmm_t indicies1, indicies2, indicies3, indicies4;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	/*
> -	 * indicies1 contains index_array[0,1]
> -	 * indicies2 contains index_array[2,3]
> -	 * indicies3 contains index_array[4,5]
> -	 * indicies4 contains index_array[6,7]
> -	 */
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> -	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> -
> -	 /* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -	acl_match_check_x4(4, ctx, parms, &flows,
> -		&indicies3, &indicies4, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> -			0);
> -		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> -			0);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> -
> -		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> -		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> -
> -		 /* Process the 4 bytes of input on each stream. */
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		input0 = transition4(mm_index_mask.m, input0,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		input1 = transition4(mm_index_mask.m, input1,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies3, &indicies4);
> -
> -		 /* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -		acl_match_check_x4(4, ctx, parms, &flows,
> -			&indicies3, &indicies4, mm_match_mask.m);
> -	}
> -}
> -
> -/*
> - * Execute trie traversal with 4 traversals in parallel
> - */
> -static inline void
> -search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	 uint32_t *results, int total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE4];
> -	struct completion cmplt[MAX_SEARCHES_SSE4];
> -	struct parms parms[MAX_SEARCHES_SSE4];
> -	xmm_t input, indicies1, indicies2;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> -	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x4(0, ctx, parms, &flows,
> -		&indicies1, &indicies2, mm_match_mask.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -		input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		 input = transition4(mm_index_mask.m, input,
> -			mm_shuffle_input.m, mm_ones_16.m,
> -			mm_bytes.m, mm_type_quad_range.m,
> -			flows.trans, &indicies1, &indicies2);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x4(0, ctx, parms, &flows,
> -			&indicies1, &indicies2, mm_match_mask.m);
> -	}
> -}
> -
> -static inline xmm_t
> -transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> -	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> -	const uint64_t *trans, xmm_t *indicies1)
> -{
> -	uint64_t t;
> -	xmm_t addr, indicies2;
> -
> -	indicies2 = MM_XOR(ones_16, ones_16);
> -
> -	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> -		bytes, type_quad_range, indicies1, &indicies2);
> -
> -	/* Gather 64 bit transitions and pack 2 per register. */
> -
> -	t = trans[MM_CVT32(addr)];
> -
> -	/* get slot 1 */
> -	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> -	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> -
> -	return MM_SRL32(next_input, 8);
> -}
> -
> -/*
> - * Execute trie traversal with 2 traversals in parallel.
> - */
> -static inline void
> -search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t total_packets, uint32_t categories)
> -{
> -	int n;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SSE2];
> -	struct completion cmplt[MAX_SEARCHES_SSE2];
> -	struct parms parms[MAX_SEARCHES_SSE2];
> -	xmm_t input, indicies;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> -		total_packets, categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> -
> -	/* Check for any matches. */
> -	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> -
> -	while (flows.started > 0) {
> -
> -		/* Gather 4 bytes of input data for each stream. */
> -		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> -		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> -
> -		/* Process the 4 bytes of input on each stream. */
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		input = transition2(mm_index_mask64.m, input,
> -			mm_shuffle_input64.m, mm_ones_16.m,
> -			mm_bytes64.m, mm_type_quad_range64.m,
> -			flows.trans, &indicies);
> -
> -		/* Check for any matches. */
> -		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> -			mm_match_mask64.m);
> -	}
> -}
> -
> -/*
> - * When processing the transition, rather than using if/else
> - * construct, the offset is calculated for DFA and QRANGE and
> - * then conditionally added to the address based on node type.
> - * This is done to avoid branch mis-predictions. Since the
> - * offset is rather simple calculation it is more efficient
> - * to do the calculation and do a condition move rather than
> - * a conditional branch to determine which calculation to do.
> - */
> -static inline uint32_t
> -scan_forward(uint32_t input, uint32_t max)
> -{
> -	return (input == 0) ? max : rte_bsf32(input);
> -}
> -
> -static inline uint64_t
> -scalar_transition(const uint64_t *trans_table, uint64_t transition,
> -	uint8_t input)
> -{
> -	uint32_t addr, index, ranges, x, a, b, c;
> -
> -	/* break transition into component parts */
> -	ranges = transition >> (sizeof(index) * CHAR_BIT);
> -
> -	/* calc address for a QRANGE node */
> -	c = input * SCALAR_QRANGE_MULT;
> -	a = ranges | SCALAR_QRANGE_MIN;
> -	index = transition & ~RTE_ACL_NODE_INDEX;
> -	a -= (c & SCALAR_QRANGE_MASK);
> -	b = c & SCALAR_QRANGE_MIN;
> -	addr = transition ^ index;
> -	a &= SCALAR_QRANGE_MIN;
> -	a ^= (ranges ^ b) & (a ^ b);
> -	x = scan_forward(a, 32) >> 3;
> -	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> -
> -	/* pickup next transition */
> -	transition = *(trans_table + addr);
> -	return transition;
> -}
> -
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	int n;
> -	uint64_t transition0, transition1;
> -	uint32_t input0, input1;
> -	struct acl_flow_data flows;
> -	uint64_t index_array[MAX_SEARCHES_SCALAR];
> -	struct completion cmplt[MAX_SEARCHES_SCALAR];
> -	struct parms parms[MAX_SEARCHES_SCALAR];
> -
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> -		categories, ctx->trans_table);
> -
> -	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> -		cmplt[n].count = 0;
> -		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> -	}
> -
> -	transition0 = index_array[0];
> -	transition1 = index_array[1];
> -
> -	while (flows.started > 0) {
> -
> -		input0 = GET_NEXT_4BYTES(parms, 0);
> -		input1 = GET_NEXT_4BYTES(parms, 1);
> -
> -		for (n = 0; n < 4; n++) {
> -			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> -				transition0 = scalar_transition(flows.trans,
> -					transition0, (uint8_t)input0);
> -
> -			input0 >>= CHAR_BIT;
> -
> -			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> -				transition1 = scalar_transition(flows.trans,
> -					transition1, (uint8_t)input1);
> -
> -			input1 >>= CHAR_BIT;
> -
> -		}
> -		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> -			transition0 = acl_match_check_transition(transition0,
> -				0, ctx, parms, &flows);
> -			transition1 = acl_match_check_transition(transition1,
> -				1, ctx, parms, &flows);
> -
> -		}
> -	}
> -	return 0;
> -}
> -
> -int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories)
> -{
> -	if (categories != 1 &&
> -		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> -		return -EINVAL;
> -
> -	if (likely(num >= MAX_SEARCHES_SSE8))
> -		search_sse_8(ctx, data, results, num, categories);
> -	else if (num >= MAX_SEARCHES_SSE4)
> -		search_sse_4(ctx, data, results, num, categories);
> -	else
> -		search_sse_2(ctx, data, results, num, categories);
> -
> -	return 0;
> -}
> diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
> new file mode 100644
> index 0000000..c39650e
> --- /dev/null
> +++ b/lib/librte_acl/acl_run.h
> @@ -0,0 +1,220 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef	_ACL_RUN_H_
> +#define	_ACL_RUN_H_
> +
> +#include <rte_acl.h>
> +#include "acl_vect.h"
> +#include "acl.h"
> +
> +#define MAX_SEARCHES_SSE8	8
> +#define MAX_SEARCHES_SSE4	4
> +#define MAX_SEARCHES_SSE2	2
> +#define MAX_SEARCHES_SCALAR	2
> +
> +#define GET_NEXT_4BYTES(prm, idx)	\
> +	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
> +
> +
> +#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
> +
> +#define	SCALAR_QRANGE_MULT	0x01010101
> +#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
> +#define	SCALAR_QRANGE_MIN	0x80808080
> +
> +/*
> + * Structure to manage N parallel trie traversals.
> + * The runtime trie traversal routines can process 8, 4, or 2 tries
> + * in parallel. Each packet may require multiple trie traversals (up to 4).
> + * This structure is used to fill the slots (0 to n-1) for parallel processing
> + * with the trie traversals needed for each packet.
> + */
> +struct acl_flow_data {
> +	uint32_t            num_packets;
> +	/* number of packets processed */
> +	uint32_t            started;
> +	/* number of trie traversals in progress */
> +	uint32_t            trie;
> +	/* current trie index (0 to N-1) */
> +	uint32_t            cmplt_size;
> +	uint32_t            total_packets;
> +	uint32_t            categories;
> +	/* number of result categories per packet. */
> +	/* maximum number of packets to process */
> +	const uint64_t     *trans;
> +	const uint8_t     **data;
> +	uint32_t           *results;
> +	struct completion  *last_cmplt;
> +	struct completion  *cmplt_array;
> +};
> +
> +/*
> + * Structure to maintain running results for
> + * a single packet (up to 4 tries).
> + */
> +struct completion {
> +	uint32_t *results;                          /* running results. */
> +	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
> +	uint32_t  count;                            /* num of remaining tries */
> +	/* true for allocated struct */
> +} __attribute__((aligned(XMM_SIZE)));
> +
> +/*
> + * One parms structure for each slot in the search engine.
> + */
> +struct parms {
> +	const uint8_t              *data;
> +	/* input data for this packet */
> +	const uint32_t             *data_index;
> +	/* data indirection for this trie */
> +	struct completion          *cmplt;
> +	/* completion data for this packet */
> +};
> +
> +/*
> + * Define an global idle node for unused engine slots
> + */
> +static const uint32_t idle[UINT8_MAX + 1];
> +
> +/*
> + * Allocate a completion structure to manage the tries for a packet.
> + */
> +static inline struct completion *
> +alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
> +	uint32_t *results)
> +{
> +	uint32_t n;
> +
> +	for (n = 0; n < size; n++) {
> +
> +		if (p[n].count == 0) {
> +
> +			/* mark as allocated and set number of tries. */
> +			p[n].count = tries;
> +			p[n].results = results;
> +			return &(p[n]);
> +		}
> +	}
> +
> +	/* should never get here */
> +	return NULL;
> +}
> +
> +/*
> + * Resolve priority for a single result trie.
> + */
> +static inline void
> +resolve_single_priority(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p)
> +{
> +	if (parms[n].cmplt->count == ctx->num_tries ||
> +			parms[n].cmplt->priority[0] <=
> +			p[transition].priority[0]) {
> +
> +		parms[n].cmplt->priority[0] = p[transition].priority[0];
> +		parms[n].cmplt->results[0] = p[transition].results[0];
> +	}
> +}
> +
> +/*
> + * Routine to fill a slot in the parallel trie traversal array (parms) from
> + * the list of packets (flows).
> + */
> +static inline uint64_t
> +acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
> +	const struct rte_acl_ctx *ctx)
> +{
> +	uint64_t transition;
> +
> +	/* if there are any more packets to process */
> +	if (flows->num_packets < flows->total_packets) {
> +		parms[n].data = flows->data[flows->num_packets];
> +		parms[n].data_index = ctx->trie[flows->trie].data_index;
> +
> +		/* if this is the first trie for this packet */
> +		if (flows->trie == 0) {
> +			flows->last_cmplt = alloc_completion(flows->cmplt_array,
> +				flows->cmplt_size, ctx->num_tries,
> +				flows->results +
> +				flows->num_packets * flows->categories);
> +		}
> +
> +		/* set completion parameters and starting index for this slot */
> +		parms[n].cmplt = flows->last_cmplt;
> +		transition =
> +			flows->trans[parms[n].data[*parms[n].data_index++] +
> +			ctx->trie[flows->trie].root_index];
> +
> +		/*
> +		 * if this is the last trie for this packet,
> +		 * then setup next packet.
> +		 */
> +		flows->trie++;
> +		if (flows->trie >= ctx->num_tries) {
> +			flows->trie = 0;
> +			flows->num_packets++;
> +		}
> +
> +		/* keep track of number of active trie traversals */
> +		flows->started++;
> +
> +	/* no more tries to process, set slot to an idle position */
> +	} else {
> +		transition = ctx->idle;
> +		parms[n].data = (const uint8_t *)idle;
> +		parms[n].data_index = idle;
> +	}
> +	return transition;
> +}
> +
> +static inline void
> +acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
> +	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
> +	uint32_t data_num, uint32_t categories, const uint64_t *trans)
> +{
> +	flows->num_packets = 0;
> +	flows->started = 0;
> +	flows->trie = 0;
> +	flows->last_cmplt = NULL;
> +	flows->cmplt_array = cmplt;
> +	flows->total_packets = data_num;
> +	flows->categories = categories;
> +	flows->cmplt_size = cmplt_size;
> +	flows->data = data;
> +	flows->results = results;
> +	flows->trans = trans;
> +}
> +
> +#endif /* _ACL_RUN_H_ */
> diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
> new file mode 100644
> index 0000000..a59ff17
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_scalar.c
> @@ -0,0 +1,198 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/*
> + * Resolve priority for multiple results (scalar version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_scalar(uint64_t transition, int n,
> +	const struct rte_acl_ctx *ctx, struct parms *parms,
> +	const struct rte_acl_match_results *p, uint32_t categories)
> +{
> +	uint32_t i;
> +	int32_t *saved_priority;
> +	uint32_t *saved_results;
> +	const int32_t *priority;
> +	const uint32_t *results;
> +
> +	saved_results = parms[n].cmplt->results;
> +	saved_priority = parms[n].cmplt->priority;
> +
> +	/* results and priorities for completed trie */
> +	results = p[transition].results;
> +	priority = p[transition].priority;
> +
> +	/* if this is not the first completed trie */
> +	if (parms[n].cmplt->count != ctx->num_tries) {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +			if (saved_priority[i] <= priority[i]) {
> +				saved_priority[i] = priority[i];
> +				saved_results[i] = results[i];
> +			}
> +			if (saved_priority[i + 1] <= priority[i + 1]) {
> +				saved_priority[i + 1] = priority[i + 1];
> +				saved_results[i + 1] = results[i + 1];
> +			}
> +			if (saved_priority[i + 2] <= priority[i + 2]) {
> +				saved_priority[i + 2] = priority[i + 2];
> +				saved_results[i + 2] = results[i + 2];
> +			}
> +			if (saved_priority[i + 3] <= priority[i + 3]) {
> +				saved_priority[i + 3] = priority[i + 3];
> +				saved_results[i + 3] = results[i + 3];
> +			}
> +		}
> +	} else {
> +		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
> +			saved_priority[i] = priority[i];
> +			saved_priority[i + 1] = priority[i + 1];
> +			saved_priority[i + 2] = priority[i + 2];
> +			saved_priority[i + 3] = priority[i + 3];
> +
> +			saved_results[i] = results[i];
> +			saved_results[i + 1] = results[i + 1];
> +			saved_results[i + 2] = results[i + 2];
> +			saved_results[i + 3] = results[i + 3];
> +		}
> +	}
> +}
> +
> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> +	return (input == 0) ? max : rte_bsf32(input);
> +}
> +
> +static inline uint64_t
> +scalar_transition(const uint64_t *trans_table, uint64_t transition,
> +	uint8_t input)
> +{
> +	uint32_t addr, index, ranges, x, a, b, c;
> +
> +	/* break transition into component parts */
> +	ranges = transition >> (sizeof(index) * CHAR_BIT);
> +
> +	/* calc address for a QRANGE node */
> +	c = input * SCALAR_QRANGE_MULT;
> +	a = ranges | SCALAR_QRANGE_MIN;
> +	index = transition & ~RTE_ACL_NODE_INDEX;
> +	a -= (c & SCALAR_QRANGE_MASK);
> +	b = c & SCALAR_QRANGE_MIN;
> +	addr = transition ^ index;
> +	a &= SCALAR_QRANGE_MIN;
> +	a ^= (ranges ^ b) & (a ^ b);
> +	x = scan_forward(a, 32) >> 3;
> +	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
> +
> +	/* pickup next transition */
> +	transition = *(trans_table + addr);
> +	return transition;
> +}
> +
> +int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	int n;
> +	uint64_t transition0, transition1;
> +	uint32_t input0, input1;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SCALAR];
> +	struct completion cmplt[MAX_SEARCHES_SCALAR];
> +	struct parms parms[MAX_SEARCHES_SCALAR];
> +
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
> +		categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	transition0 = index_array[0];
> +	transition1 = index_array[1];
> +
> +	while (flows.started > 0) {
> +
> +		input0 = GET_NEXT_4BYTES(parms, 0);
> +		input1 = GET_NEXT_4BYTES(parms, 1);
> +
> +		for (n = 0; n < 4; n++) {
> +			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
> +				transition0 = scalar_transition(flows.trans,
> +					transition0, (uint8_t)input0);
> +
> +			input0 >>= CHAR_BIT;
> +
> +			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
> +				transition1 = scalar_transition(flows.trans,
> +					transition1, (uint8_t)input1);
> +
> +			input1 >>= CHAR_BIT;
> +
> +		}
> +		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
> +			transition0 = acl_match_check(transition0,
> +				0, ctx, parms, &flows, resolve_priority_scalar);
> +			transition1 = acl_match_check(transition1,
> +				1, ctx, parms, &flows, resolve_priority_scalar);
> +
> +		}
> +	}
> +	return 0;
> +}
> diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
> new file mode 100644
> index 0000000..3f5c721
> --- /dev/null
> +++ b/lib/librte_acl/acl_run_sse.c
> @@ -0,0 +1,627 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include "acl_run.h"
> +#include "acl_match_check.h"
> +
> +enum {
> +	SHUFFLE32_SLOT1 = 0xe5,
> +	SHUFFLE32_SLOT2 = 0xe6,
> +	SHUFFLE32_SLOT3 = 0xe7,
> +	SHUFFLE32_SWAP64 = 0x4e,
> +};
> +
> +static const rte_xmm_t mm_type_quad_range = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +	},
> +};
> +
> +static const rte_xmm_t mm_type_quad_range64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_QRANGE,
> +		RTE_ACL_NODE_QRANGE,
> +		0,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input = {
> +	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
> +};
> +
> +static const rte_xmm_t mm_shuffle_input64 = {
> +	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
> +};
> +
> +static const rte_xmm_t mm_ones_16 = {
> +	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
> +};
> +
> +static const rte_xmm_t mm_bytes = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
> +};
> +
> +static const rte_xmm_t mm_bytes64 = {
> +	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
> +};
> +
> +static const rte_xmm_t mm_match_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +		RTE_ACL_NODE_MATCH,
> +	},
> +};
> +
> +static const rte_xmm_t mm_match_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +		RTE_ACL_NODE_MATCH,
> +		0,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +	},
> +};
> +
> +static const rte_xmm_t mm_index_mask64 = {
> +	.u32 = {
> +		RTE_ACL_NODE_INDEX,
> +		RTE_ACL_NODE_INDEX,
> +		0,
> +		0,
> +	},
> +};
> +
> +
> +/*
> + * Resolve priority for multiple results (sse version).
> + * This consists comparing the priority of the current traversal with the
> + * running set of results for the packet.
> + * For each result, keep a running array of the result (rule number) and
> + * its priority for each category.
> + */
> +static inline void
> +resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, const struct rte_acl_match_results *p,
> +	uint32_t categories)
> +{
> +	uint32_t x;
> +	xmm_t results, priority, results1, priority1, selector;
> +	xmm_t *saved_results, *saved_priority;
> +
> +	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
> +
> +		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
> +		saved_priority =
> +			(xmm_t *)(&parms[n].cmplt->priority[x]);
> +
> +		/* get results and priorities for completed trie */
> +		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
> +		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
> +
> +		/* if this is not the first completed trie */
> +		if (parms[n].cmplt->count != ctx->num_tries) {
> +
> +			/* get running best results and their priorities */
> +			results1 = MM_LOADU(saved_results);
> +			priority1 = MM_LOADU(saved_priority);
> +
> +			/* select results that are highest priority */
> +			selector = MM_CMPGT32(priority1, priority);
> +			results = MM_BLENDV8(results, results1, selector);
> +			priority = MM_BLENDV8(priority, priority1, selector);
> +		}
> +
> +		/* save running best results and their priorities */
> +		MM_STOREU(saved_results, results);
> +		MM_STOREU(saved_priority, priority);
> +	}
> +}
> +
> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, struct acl_flow_data *flows)
> +{
> +	uint64_t transition1, transition2;
> +
> +	/* extract transition from low 64 bits. */
> +	transition1 = MM_CVT64(*indicies);
> +
> +	/* extract transition from high 64 bits. */
> +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> +	transition2 = MM_CVT64(*indicies);
> +
> +	transition1 = acl_match_check(transition1, slot, ctx,
> +		parms, flows, resolve_priority_sse);
> +	transition2 = acl_match_check(transition2, slot + 1, ctx,
> +		parms, flows, resolve_priority_sse);
> +
> +	/* update indicies with new transitions. */
> +	*indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	temp = MM_AND(match_mask, *indicies);
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies, slot, ctx, parms, flows);
> +		temp = MM_AND(match_mask, *indicies);
> +	}
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> +	xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	/* put low 32 bits of each transition into one register */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	/* test for match node */
> +	temp = MM_AND(match_mask, temp);
> +
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +					(__m128)*indicies2,
> +					0x88);
> +		temp = MM_AND(match_mask, temp);
> +	}
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr, node_types, temp;
> +
> +	/*
> +	 * Note that no transition is done for a match
> +	 * node and therefore a stream freezes when
> +	 * it reaches a match.
> +	 */
> +
> +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +		(__m128)*indicies2, 0xdd);
> +
> +	/* Calc node type and node addr */
> +	node_types = MM_ANDNOT(index_mask, temp);
> +	addr = MM_AND(index_mask, temp);
> +
> +	/*
> +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> +	 */
> +
> +	/* mask for DFA type (0) nodes */
> +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> +	/* add input byte to DFA position */
> +	temp = MM_AND(temp, bytes);
> +	temp = MM_AND(temp, next_input);
> +	addr = MM_ADD32(addr, temp);
> +
> +	/*
> +	 * Calc addr for Range nodes -> range_index + range(input)
> +	 */
> +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> +	/*
> +	 * Calculate number of range boundaries that are less than the
> +	 * input value. Range boundaries for each node are in signed 8 bit,
> +	 * ordered from -128 to 127 in the indicies2 register.
> +	 * This is effectively a popcnt of bytes that are greater than the
> +	 * input byte.
> +	 */
> +
> +	/* shuffle input byte to all 4 positions of 32 bit value */
> +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> +	/* check ranges */
> +	temp = MM_CMPGT8(temp, *indicies2);
> +
> +	/* convert -1 to 1 (bytes greater than input byte */
> +	temp = MM_SIGN8(temp, temp);
> +
> +	/* horizontal add pairs of bytes into words */
> +	temp = MM_MADD8(temp, temp);
> +
> +	/* horizontal add pairs of words into dwords */
> +	temp = MM_MADD16(temp, ones_16);
> +
> +	/* mask to range type nodes */
> +	temp = MM_AND(temp, node_types);
> +
> +	/* add index into node position */
> +	return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr;
> +	uint64_t trans0, trans2;
> +
> +	 /* Calculate the address (array index) for all 4 transitions. */
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, indicies2);
> +
> +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> +	trans0 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 2 */
> +
> +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> +	trans2 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +
> +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> +	/* get slot 3 */
> +
> +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE8];
> +	struct completion cmplt[MAX_SEARCHES_SSE8];
> +	struct parms parms[MAX_SEARCHES_SSE8];
> +	xmm_t input0, input1;
> +	xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	/*
> +	 * indicies1 contains index_array[0,1]
> +	 * indicies2 contains index_array[2,3]
> +	 * indicies3 contains index_array[4,5]
> +	 * indicies4 contains index_array[6,7]
> +	 */
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> +	 /* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +	acl_match_check_x4(4, ctx, parms, &flows,
> +		&indicies3, &indicies4, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> +			0);
> +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> +			0);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> +		 /* Process the 4 bytes of input on each stream. */
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		 /* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +		acl_match_check_x4(4, ctx, parms, &flows,
> +			&indicies3, &indicies4, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	 uint32_t *results, int total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE4];
> +	struct completion cmplt[MAX_SEARCHES_SSE4];
> +	struct parms parms[MAX_SEARCHES_SSE4];
> +	xmm_t input, indicies1, indicies2;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +		input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1)
> +{
> +	uint64_t t;
> +	xmm_t addr, indicies2;
> +
> +	indicies2 = MM_XOR(ones_16, ones_16);
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, &indicies2);
> +
> +	/* Gather 64 bit transitions and pack 2 per register. */
> +
> +	t = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE2];
> +	struct completion cmplt[MAX_SEARCHES_SSE2];
> +	struct parms parms[MAX_SEARCHES_SSE2];
> +	xmm_t input, indicies;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> +			mm_match_mask64.m);
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	if (likely(num >= MAX_SEARCHES_SSE8))
> +		return search_sse_8(ctx, data, results, num, categories);
> +	else if (num >= MAX_SEARCHES_SSE4)
> +		return search_sse_4(ctx, data, results, num, categories);
> +	else
> +		return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..b9173c1 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,52 @@
> 
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> 
> +typedef int (*rte_acl_classify_t)
> +(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
> +
> +extern int
> +rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +        uint32_t *results, uint32_t num, uint32_t categories);
> +
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
Why not 'static'?
I thought you'd like to hide it  from external world.
> +
> +void rte_acl_select_classify(enum acl_classify_alg alg)
> +{
> +
> +	switch(alg)
> +	{
> +		case ACL_CLASSIFY_DEFAULT:
> +		case ACL_CLASSIFY_SCALAR:
> +			rte_acl_default_classify = rte_acl_classify_scalar;
> +			break;
> +		case ACL_CLASSIFY_SSE:
> +			rte_acl_default_classify = rte_acl_classify_sse;
> +			break;
> +	}
> +
> +}
As this is init phase function, I suppose we can add check that alg has a valid(supported) value, and return some error as return value, if not.  
> +
> +static void __attribute__((constructor))
> +rte_acl_init(void)
> +{
> +	enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
> +
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> +		alg = ACL_CLASSIFY_SSE;
> +
> +	rte_acl_select_classify(alg);
> +}
> +
> +inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +                            const uint8_t **data,
> +                            uint32_t *results, uint32_t num,
> +                            uint32_t categories)
> +{
> +	return rte_acl_default_classify(ctx, data, results, num, categories);
> +}
> +
> +
>  struct rte_acl_ctx *
>  rte_acl_find_existing(const char *name)
>  {
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
> index afc0f69..650b306 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
>   * If more than one rule is applicable for given input buffer and
>   * given category, then rule with highest priority will be returned as a match.
> + * Note, that this function could be run only on CPUs with SSE4.1 support.
> + * It is up to the caller to make sure that this function is only invoked on
> + * a machine that supports SSE4.1 ISA.
>   * Note, that it is a caller responsibility to ensure that input parameters
>   * are valid and point to correct memory locations.
>   *
> @@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
>   * @return
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
> + *   -ENOTSUP for unsupported platforms.
Please remove the line above: current implementation doesn't return ENOTSUP
(I think that was left from v1).
>   */
>  int
> -rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
>  	uint32_t *results, uint32_t num, uint32_t categories);
> 
>  /**
> @@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
>   *   zero on successful completion.
>   *   -EINVAL for incorrect arguments.
>   */
> -int
> -rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
> -	uint32_t *results, uint32_t num, uint32_t categories);
As I said above we'd better keep it.  
> +
> +enum acl_classify_alg {
> +	ACL_CLASSIFY_DEFAULT = 0,
> +	ACL_CLASSIFY_SCALAR = 1,
> +	ACL_CLASSIFY_SSE = 2,
> +};
As a nit: as this emum is part of public API, I think it is better to add rte_ prefix: enum rte_acl_classify_alg
> +
> +extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
> +				   const uint8_t **data,
> +				   uint32_t *results, uint32_t num,
> +				   uint32_t categories);
Again as a nit: here and everywhere can we keep same style through the whole DPDK - function name from the new line:
extern nt
rte_acl_classify(...);
> +/**
> + * Analyze ISA of the current CPU and points rte_acl_default_classify
> + * to the highest applicable version of classify function.
> + */
> +extern void
> +rte_acl_select_classify(enum acl_classify_alg alg);
> 
>  /**
>   * Dump an ACL context structure to the console.
> --
> 1.9.3
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCHv3] librte_acl make it build/work for 'default' target
    2014-08-07 20:11  4% ` Neil Horman
@ 2014-08-21 20:15  1% ` Neil Horman
  2014-08-25 16:30  0%   ` Ananyev, Konstantin
  2014-08-28 20:38  1% ` [dpdk-dev] [PATCHv4] " Neil Horman
  2 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-21 20:15 UTC (permalink / raw)
  To: dev
Make ACL library to build/work on 'default' architecture:
- make rte_acl_classify_scalar really scalar
 (make sure it wouldn't use sse4 instrincts through resolve_priority()).
- Provide two versions of rte_acl_classify code path:
  rte_acl_classify_sse() - could be build and used only on systems with sse4.2
  and upper, return -ENOTSUP on lower arch.
  rte_acl_classify_scalar() - a slower version, but could be build and used
  on all systems.
- keep common code shared between these two codepaths.
v2 chages:
 run-time selection of most appropriate code-path for given ISA.
 By default the highest supprted one is selected.
 User can still override that selection by manually assigning new value to
 the global function pointer rte_acl_default_classify.
 rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
 points to.
V3 Changes
 Updated classify pointer to be a function so as to better preserve ABI
 REmoved macro definitions for match check functions to make them static inline
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 app/test-acl/main.c              |  13 +-
 app/test/test_acl.c              |  12 +-
 lib/librte_acl/Makefile          |   5 +-
 lib/librte_acl/acl_bld.c         |   5 +-
 lib/librte_acl/acl_match_check.h |  83 ++++
 lib/librte_acl/acl_run.c         | 944 ---------------------------------------
 lib/librte_acl/acl_run.h         | 220 +++++++++
 lib/librte_acl/acl_run_scalar.c  | 198 ++++++++
 lib/librte_acl/acl_run_sse.c     | 627 ++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c         |  46 ++
 lib/librte_acl/rte_acl.h         |  26 +-
 11 files changed, 1216 insertions(+), 963 deletions(-)
 create mode 100644 lib/librte_acl/acl_match_check.h
 delete mode 100644 lib/librte_acl/acl_run.c
 create mode 100644 lib/librte_acl/acl_run.h
 create mode 100644 lib/librte_acl/acl_run_scalar.c
 create mode 100644 lib/librte_acl/acl_run_sse.c
diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d654409..a77f47d 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -787,6 +787,10 @@ acx_init(void)
 	/* perform build. */
 	ret = rte_acl_build(config.acx, &cfg);
 
+	/* setup default rte_acl_classify */
+	if (config.scalar)
+		rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+
 	dump_verbose(DUMP_NONE, stdout,
 		"rte_acl_build(%u) finished with %d\n",
 		config.bld_categories, ret);
@@ -815,13 +819,8 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 			v += config.trace_sz;
 		}
 
-		if (scalar != 0)
-			ret = rte_acl_classify_scalar(config.acx, data,
-				results, n, categories);
-
-		else
-			ret = rte_acl_classify(config.acx, data,
-				results, n, categories);
+		ret = rte_acl_classify(config.acx, data, results,
+			n, categories);
 
 		if (ret != 0)
 			rte_exit(ret, "classify for ipv%c_5tuples returns %d\n",
diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 869f6d3..2fcef6e 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -148,7 +148,8 @@ test_classify_run(struct rte_acl_ctx *acx)
 	}
 
 	/* make a quick check for scalar */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	ret = rte_acl_classify(acx, data, results,
 			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES);
 	if (ret != 0) {
 		printf("Line %i: SSE classify failed!\n", __LINE__);
@@ -362,7 +363,8 @@ test_invalid_layout(void)
 	}
 
 	/* classify tuples (scalar) */
-	ret = rte_acl_classify_scalar(acx, data, results,
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	ret = rte_acl_classify(acx, data, results,
 			RTE_DIM(results), 1);
 	if (ret != 0) {
 		printf("Line %i: Scalar classify failed!\n", __LINE__);
@@ -850,7 +852,8 @@ test_invalid_parameters(void)
 	/* scalar classify test */
 
 	/* cover zero categories in classify (should not fail) */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 0);
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 0);
 	if (result != 0) {
 		printf("Line %i: Scalar classify with zero categories "
 				"failed!\n", __LINE__);
@@ -859,7 +862,8 @@ test_invalid_parameters(void)
 	}
 
 	/* cover invalid but positive categories in classify */
-	result = rte_acl_classify_scalar(acx, NULL, NULL, 0, 3);
+	rte_acl_select_classify(ACL_CLASSIFY_SCALAR);
+	result = rte_acl_classify(acx, NULL, NULL, 0, 3);
 	if (result == 0) {
 		printf("Line %i: Scalar classify with 3 categories "
 				"should have failed!\n", __LINE__);
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..65e566d 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -43,7 +43,10 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += rte_acl.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_bld.c
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_gen.c
-SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_scalar.c
+SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
+
+CFLAGS_acl_run_sse.o += -msse4.1
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 873447b..09d58ea 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -31,7 +31,6 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <nmmintrin.h>
 #include <rte_acl.h>
 #include "tb_mem.h"
 #include "acl.h"
@@ -1480,8 +1479,8 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 
 			switch (rule->config->defs[n].type) {
 			case RTE_ACL_FIELD_TYPE_BITMASK:
-				wild = (size -
-					_mm_popcnt_u32(fld->mask_range.u8)) /
+				wild = (size - __builtin_popcount(
+					fld->mask_range.u8)) /
 					size;
 				break;
 
diff --git a/lib/librte_acl/acl_match_check.h b/lib/librte_acl/acl_match_check.h
new file mode 100644
index 0000000..4dc1982
--- /dev/null
+++ b/lib/librte_acl/acl_match_check.h
@@ -0,0 +1,83 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _ACL_MATCH_CHECK_H_
+#define _ACL_MATCH_CHECK_H_
+
+/*
+ * Detect matches. If a match node transition is found, then this trie
+ * traversal is complete and fill the slot with the next trie
+ * to be processed.
+ */
+static inline uint64_t
+acl_match_check(uint64_t transition, int slot,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, void (*resolve_priority)(
+	uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories))
+{
+	const struct rte_acl_match_results *p;
+
+	p = (const struct rte_acl_match_results *)
+		(flows->trans + ctx->match_index);
+
+	if (transition & RTE_ACL_NODE_MATCH) {
+
+		/* Remove flags from index and decrement active traversals */
+		transition &= RTE_ACL_NODE_INDEX;
+		flows->started--;
+
+		/* Resolve priorities for this trie and running results */
+		if (flows->categories == 1)
+			resolve_single_priority(transition, slot, ctx,
+				parms, p);
+		else
+			resolve_priority(transition, slot, ctx, parms,
+				p, flows->categories);
+
+		/* Count down completed tries for this search request */
+		parms[slot].cmplt->count--;
+
+		/* Fill the slot with the next trie or idle trie */
+		transition = acl_start_next_trie(flows, parms, slot, ctx);
+
+	} else if (transition == ctx->idle) {
+		/* reset indirection table for idle slots */
+		parms[slot].data_index = idle;
+	}
+
+	return transition;
+}
+
+#endif
diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
deleted file mode 100644
index e3d9fc1..0000000
--- a/lib/librte_acl/acl_run.c
+++ /dev/null
@@ -1,944 +0,0 @@
-/*-
- *   BSD LICENSE
- *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
- *   All rights reserved.
- *
- *   Redistribution and use in source and binary forms, with or without
- *   modification, are permitted provided that the following conditions
- *   are met:
- *
- *     * Redistributions of source code must retain the above copyright
- *       notice, this list of conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright
- *       notice, this list of conditions and the following disclaimer in
- *       the documentation and/or other materials provided with the
- *       distribution.
- *     * Neither the name of Intel Corporation nor the names of its
- *       contributors may be used to endorse or promote products derived
- *       from this software without specific prior written permission.
- *
- *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include <rte_acl.h>
-#include "acl_vect.h"
-#include "acl.h"
-
-#define MAX_SEARCHES_SSE8	8
-#define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
-#define MAX_SEARCHES_SCALAR	2
-
-#define GET_NEXT_4BYTES(prm, idx)	\
-	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
-
-
-#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
-
-#define	SCALAR_QRANGE_MULT	0x01010101
-#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
-#define	SCALAR_QRANGE_MIN	0x80808080
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-/*
- * Structure to manage N parallel trie traversals.
- * The runtime trie traversal routines can process 8, 4, or 2 tries
- * in parallel. Each packet may require multiple trie traversals (up to 4).
- * This structure is used to fill the slots (0 to n-1) for parallel processing
- * with the trie traversals needed for each packet.
- */
-struct acl_flow_data {
-	uint32_t            num_packets;
-	/* number of packets processed */
-	uint32_t            started;
-	/* number of trie traversals in progress */
-	uint32_t            trie;
-	/* current trie index (0 to N-1) */
-	uint32_t            cmplt_size;
-	uint32_t            total_packets;
-	uint32_t            categories;
-	/* number of result categories per packet. */
-	/* maximum number of packets to process */
-	const uint64_t     *trans;
-	const uint8_t     **data;
-	uint32_t           *results;
-	struct completion  *last_cmplt;
-	struct completion  *cmplt_array;
-};
-
-/*
- * Structure to maintain running results for
- * a single packet (up to 4 tries).
- */
-struct completion {
-	uint32_t *results;                          /* running results. */
-	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
-	uint32_t  count;                            /* num of remaining tries */
-	/* true for allocated struct */
-} __attribute__((aligned(XMM_SIZE)));
-
-/*
- * One parms structure for each slot in the search engine.
- */
-struct parms {
-	const uint8_t              *data;
-	/* input data for this packet */
-	const uint32_t             *data_index;
-	/* data indirection for this trie */
-	struct completion          *cmplt;
-	/* completion data for this packet */
-};
-
-/*
- * Define an global idle node for unused engine slots
- */
-static const uint32_t idle[UINT8_MAX + 1];
-
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-/*
- * Allocate a completion structure to manage the tries for a packet.
- */
-static inline struct completion *
-alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
-	uint32_t *results)
-{
-	uint32_t n;
-
-	for (n = 0; n < size; n++) {
-
-		if (p[n].count == 0) {
-
-			/* mark as allocated and set number of tries. */
-			p[n].count = tries;
-			p[n].results = results;
-			return &(p[n]);
-		}
-	}
-
-	/* should never get here */
-	return NULL;
-}
-
-/*
- * Resolve priority for a single result trie.
- */
-static inline void
-resolve_single_priority(uint64_t transition, int n,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	const struct rte_acl_match_results *p)
-{
-	if (parms[n].cmplt->count == ctx->num_tries ||
-			parms[n].cmplt->priority[0] <=
-			p[transition].priority[0]) {
-
-		parms[n].cmplt->priority[0] = p[transition].priority[0];
-		parms[n].cmplt->results[0] = p[transition].results[0];
-	}
-
-	parms[n].cmplt->count--;
-}
-
-/*
- * Resolve priority for multiple results. This consists comparing
- * the priority of the current traversal with the running set of
- * results for the packet. For each result, keep a running array of
- * the result (rule number) and its priority for each category.
- */
-static inline void
-resolve_priority(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-
-	/* Count down completed tries for this search request */
-	parms[n].cmplt->count--;
-}
-
-/*
- * Routine to fill a slot in the parallel trie traversal array (parms) from
- * the list of packets (flows).
- */
-static inline uint64_t
-acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
-	const struct rte_acl_ctx *ctx)
-{
-	uint64_t transition;
-
-	/* if there are any more packets to process */
-	if (flows->num_packets < flows->total_packets) {
-		parms[n].data = flows->data[flows->num_packets];
-		parms[n].data_index = ctx->trie[flows->trie].data_index;
-
-		/* if this is the first trie for this packet */
-		if (flows->trie == 0) {
-			flows->last_cmplt = alloc_completion(flows->cmplt_array,
-				flows->cmplt_size, ctx->num_tries,
-				flows->results +
-				flows->num_packets * flows->categories);
-		}
-
-		/* set completion parameters and starting index for this slot */
-		parms[n].cmplt = flows->last_cmplt;
-		transition =
-			flows->trans[parms[n].data[*parms[n].data_index++] +
-			ctx->trie[flows->trie].root_index];
-
-		/*
-		 * if this is the last trie for this packet,
-		 * then setup next packet.
-		 */
-		flows->trie++;
-		if (flows->trie >= ctx->num_tries) {
-			flows->trie = 0;
-			flows->num_packets++;
-		}
-
-		/* keep track of number of active trie traversals */
-		flows->started++;
-
-	/* no more tries to process, set slot to an idle position */
-	} else {
-		transition = ctx->idle;
-		parms[n].data = (const uint8_t *)idle;
-		parms[n].data_index = idle;
-	}
-	return transition;
-}
-
-/*
- * Detect matches. If a match node transition is found, then this trie
- * traversal is complete and fill the slot with the next trie
- * to be processed.
- */
-static inline uint64_t
-acl_match_check_transition(uint64_t transition, int slot,
-	const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows)
-{
-	const struct rte_acl_match_results *p;
-
-	p = (const struct rte_acl_match_results *)
-		(flows->trans + ctx->match_index);
-
-	if (transition & RTE_ACL_NODE_MATCH) {
-
-		/* Remove flags from index and decrement active traversals */
-		transition &= RTE_ACL_NODE_INDEX;
-		flows->started--;
-
-		/* Resolve priorities for this trie and running results */
-		if (flows->categories == 1)
-			resolve_single_priority(transition, slot, ctx,
-				parms, p);
-		else
-			resolve_priority(transition, slot, ctx, parms, p,
-				flows->categories);
-
-		/* Fill the slot with the next trie or idle trie */
-		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
-	}
-
-	return transition;
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indicies);
-
-	/* extract transition from high 64 bits. */
-	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indicies);
-
-	transition1 = acl_match_check_transition(transition1, slot, ctx,
-		parms, flows);
-	transition2 = acl_match_check_transition(transition2, slot + 1, ctx,
-		parms, flows);
-
-	/* update indicies with new transitions. */
-	*indicies = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indicies);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indicies);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indicies1, slot, ctx, parms, flows);
-		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-					(__m128)*indicies2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr, node_types, temp;
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indicies2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
-		0x88);
-	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
-		(__m128)*indicies2, 0xdd);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
-
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
-
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indicies2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* check ranges */
-	temp = MM_CMPGT8(temp, *indicies2);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, indicies2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-static inline void
-acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
-	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
-	uint32_t data_num, uint32_t categories, const uint64_t *trans)
-{
-	flows->num_packets = 0;
-	flows->started = 0;
-	flows->trie = 0;
-	flows->last_cmplt = NULL;
-	flows->cmplt_array = cmplt;
-	flows->total_packets = data_num;
-	flows->categories = categories;
-	flows->cmplt_size = cmplt_size;
-	flows->data = data;
-	flows->results = results;
-	flows->trans = trans;
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline void
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indicies1, indicies2, indicies3, indicies4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indicies1 contains index_array[0,1]
-	 * indicies2 contains index_array[2,3]
-	 * indicies3 contains index_array[4,5]
-	 * indicies4 contains index_array[6,7]
-	 */
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indicies3, &indicies4, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies3, &indicies4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indicies3, &indicies4, mm_match_mask.m);
-	}
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline void
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indicies1, indicies2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indicies1, &indicies2, mm_match_mask.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
-			flows.trans, &indicies1, &indicies2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indicies1, &indicies2, mm_match_mask.m);
-	}
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indicies1)
-{
-	uint64_t t;
-	xmm_t addr, indicies2;
-
-	indicies2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indicies1, &indicies2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline void
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indicies;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indicies = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
-			flows.trans, &indicies);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
-			mm_match_mask64.m);
-	}
-}
-
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
-static inline uint32_t
-scan_forward(uint32_t input, uint32_t max)
-{
-	return (input == 0) ? max : rte_bsf32(input);
-}
-
-static inline uint64_t
-scalar_transition(const uint64_t *trans_table, uint64_t transition,
-	uint8_t input)
-{
-	uint32_t addr, index, ranges, x, a, b, c;
-
-	/* break transition into component parts */
-	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
-	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
-	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
-
-	/* pickup next transition */
-	transition = *(trans_table + addr);
-	return transition;
-}
-
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	int n;
-	uint64_t transition0, transition1;
-	uint32_t input0, input1;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SCALAR];
-	struct completion cmplt[MAX_SEARCHES_SCALAR];
-	struct parms parms[MAX_SEARCHES_SCALAR];
-
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
-		categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	transition0 = index_array[0];
-	transition1 = index_array[1];
-
-	while (flows.started > 0) {
-
-		input0 = GET_NEXT_4BYTES(parms, 0);
-		input1 = GET_NEXT_4BYTES(parms, 1);
-
-		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
-
-			input0 >>= CHAR_BIT;
-
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
-			input1 >>= CHAR_BIT;
-
-		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
-			transition0 = acl_match_check_transition(transition0,
-				0, ctx, parms, &flows);
-			transition1 = acl_match_check_transition(transition1,
-				1, ctx, parms, &flows);
-
-		}
-	}
-	return 0;
-}
-
-int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
-	if (likely(num >= MAX_SEARCHES_SSE8))
-		search_sse_8(ctx, data, results, num, categories);
-	else if (num >= MAX_SEARCHES_SSE4)
-		search_sse_4(ctx, data, results, num, categories);
-	else
-		search_sse_2(ctx, data, results, num, categories);
-
-	return 0;
-}
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
new file mode 100644
index 0000000..c39650e
--- /dev/null
+++ b/lib/librte_acl/acl_run.h
@@ -0,0 +1,220 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef	_ACL_RUN_H_
+#define	_ACL_RUN_H_
+
+#include <rte_acl.h>
+#include "acl_vect.h"
+#include "acl.h"
+
+#define MAX_SEARCHES_SSE8	8
+#define MAX_SEARCHES_SSE4	4
+#define MAX_SEARCHES_SSE2	2
+#define MAX_SEARCHES_SCALAR	2
+
+#define GET_NEXT_4BYTES(prm, idx)	\
+	(*((const int32_t *)((prm)[(idx)].data + *(prm)[idx].data_index++)))
+
+
+#define RTE_ACL_NODE_INDEX	((uint32_t)~RTE_ACL_NODE_TYPE)
+
+#define	SCALAR_QRANGE_MULT	0x01010101
+#define	SCALAR_QRANGE_MASK	0x7f7f7f7f
+#define	SCALAR_QRANGE_MIN	0x80808080
+
+/*
+ * Structure to manage N parallel trie traversals.
+ * The runtime trie traversal routines can process 8, 4, or 2 tries
+ * in parallel. Each packet may require multiple trie traversals (up to 4).
+ * This structure is used to fill the slots (0 to n-1) for parallel processing
+ * with the trie traversals needed for each packet.
+ */
+struct acl_flow_data {
+	uint32_t            num_packets;
+	/* number of packets processed */
+	uint32_t            started;
+	/* number of trie traversals in progress */
+	uint32_t            trie;
+	/* current trie index (0 to N-1) */
+	uint32_t            cmplt_size;
+	uint32_t            total_packets;
+	uint32_t            categories;
+	/* number of result categories per packet. */
+	/* maximum number of packets to process */
+	const uint64_t     *trans;
+	const uint8_t     **data;
+	uint32_t           *results;
+	struct completion  *last_cmplt;
+	struct completion  *cmplt_array;
+};
+
+/*
+ * Structure to maintain running results for
+ * a single packet (up to 4 tries).
+ */
+struct completion {
+	uint32_t *results;                          /* running results. */
+	int32_t   priority[RTE_ACL_MAX_CATEGORIES]; /* running priorities. */
+	uint32_t  count;                            /* num of remaining tries */
+	/* true for allocated struct */
+} __attribute__((aligned(XMM_SIZE)));
+
+/*
+ * One parms structure for each slot in the search engine.
+ */
+struct parms {
+	const uint8_t              *data;
+	/* input data for this packet */
+	const uint32_t             *data_index;
+	/* data indirection for this trie */
+	struct completion          *cmplt;
+	/* completion data for this packet */
+};
+
+/*
+ * Define an global idle node for unused engine slots
+ */
+static const uint32_t idle[UINT8_MAX + 1];
+
+/*
+ * Allocate a completion structure to manage the tries for a packet.
+ */
+static inline struct completion *
+alloc_completion(struct completion *p, uint32_t size, uint32_t tries,
+	uint32_t *results)
+{
+	uint32_t n;
+
+	for (n = 0; n < size; n++) {
+
+		if (p[n].count == 0) {
+
+			/* mark as allocated and set number of tries. */
+			p[n].count = tries;
+			p[n].results = results;
+			return &(p[n]);
+		}
+	}
+
+	/* should never get here */
+	return NULL;
+}
+
+/*
+ * Resolve priority for a single result trie.
+ */
+static inline void
+resolve_single_priority(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p)
+{
+	if (parms[n].cmplt->count == ctx->num_tries ||
+			parms[n].cmplt->priority[0] <=
+			p[transition].priority[0]) {
+
+		parms[n].cmplt->priority[0] = p[transition].priority[0];
+		parms[n].cmplt->results[0] = p[transition].results[0];
+	}
+}
+
+/*
+ * Routine to fill a slot in the parallel trie traversal array (parms) from
+ * the list of packets (flows).
+ */
+static inline uint64_t
+acl_start_next_trie(struct acl_flow_data *flows, struct parms *parms, int n,
+	const struct rte_acl_ctx *ctx)
+{
+	uint64_t transition;
+
+	/* if there are any more packets to process */
+	if (flows->num_packets < flows->total_packets) {
+		parms[n].data = flows->data[flows->num_packets];
+		parms[n].data_index = ctx->trie[flows->trie].data_index;
+
+		/* if this is the first trie for this packet */
+		if (flows->trie == 0) {
+			flows->last_cmplt = alloc_completion(flows->cmplt_array,
+				flows->cmplt_size, ctx->num_tries,
+				flows->results +
+				flows->num_packets * flows->categories);
+		}
+
+		/* set completion parameters and starting index for this slot */
+		parms[n].cmplt = flows->last_cmplt;
+		transition =
+			flows->trans[parms[n].data[*parms[n].data_index++] +
+			ctx->trie[flows->trie].root_index];
+
+		/*
+		 * if this is the last trie for this packet,
+		 * then setup next packet.
+		 */
+		flows->trie++;
+		if (flows->trie >= ctx->num_tries) {
+			flows->trie = 0;
+			flows->num_packets++;
+		}
+
+		/* keep track of number of active trie traversals */
+		flows->started++;
+
+	/* no more tries to process, set slot to an idle position */
+	} else {
+		transition = ctx->idle;
+		parms[n].data = (const uint8_t *)idle;
+		parms[n].data_index = idle;
+	}
+	return transition;
+}
+
+static inline void
+acl_set_flow(struct acl_flow_data *flows, struct completion *cmplt,
+	uint32_t cmplt_size, const uint8_t **data, uint32_t *results,
+	uint32_t data_num, uint32_t categories, const uint64_t *trans)
+{
+	flows->num_packets = 0;
+	flows->started = 0;
+	flows->trie = 0;
+	flows->last_cmplt = NULL;
+	flows->cmplt_array = cmplt;
+	flows->total_packets = data_num;
+	flows->categories = categories;
+	flows->cmplt_size = cmplt_size;
+	flows->data = data;
+	flows->results = results;
+	flows->trans = trans;
+}
+
+#endif /* _ACL_RUN_H_ */
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
new file mode 100644
index 0000000..a59ff17
--- /dev/null
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -0,0 +1,198 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/*
+ * Resolve priority for multiple results (scalar version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_scalar(uint64_t transition, int n,
+	const struct rte_acl_ctx *ctx, struct parms *parms,
+	const struct rte_acl_match_results *p, uint32_t categories)
+{
+	uint32_t i;
+	int32_t *saved_priority;
+	uint32_t *saved_results;
+	const int32_t *priority;
+	const uint32_t *results;
+
+	saved_results = parms[n].cmplt->results;
+	saved_priority = parms[n].cmplt->priority;
+
+	/* results and priorities for completed trie */
+	results = p[transition].results;
+	priority = p[transition].priority;
+
+	/* if this is not the first completed trie */
+	if (parms[n].cmplt->count != ctx->num_tries) {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			if (saved_priority[i] <= priority[i]) {
+				saved_priority[i] = priority[i];
+				saved_results[i] = results[i];
+			}
+			if (saved_priority[i + 1] <= priority[i + 1]) {
+				saved_priority[i + 1] = priority[i + 1];
+				saved_results[i + 1] = results[i + 1];
+			}
+			if (saved_priority[i + 2] <= priority[i + 2]) {
+				saved_priority[i + 2] = priority[i + 2];
+				saved_results[i + 2] = results[i + 2];
+			}
+			if (saved_priority[i + 3] <= priority[i + 3]) {
+				saved_priority[i + 3] = priority[i + 3];
+				saved_results[i + 3] = results[i + 3];
+			}
+		}
+	} else {
+		for (i = 0; i < categories; i += RTE_ACL_RESULTS_MULTIPLIER) {
+			saved_priority[i] = priority[i];
+			saved_priority[i + 1] = priority[i + 1];
+			saved_priority[i + 2] = priority[i + 2];
+			saved_priority[i + 3] = priority[i + 3];
+
+			saved_results[i] = results[i];
+			saved_results[i + 1] = results[i + 1];
+			saved_results[i + 2] = results[i + 2];
+			saved_results[i + 3] = results[i + 3];
+		}
+	}
+}
+
+/*
+ * When processing the transition, rather than using if/else
+ * construct, the offset is calculated for DFA and QRANGE and
+ * then conditionally added to the address based on node type.
+ * This is done to avoid branch mis-predictions. Since the
+ * offset is rather simple calculation it is more efficient
+ * to do the calculation and do a condition move rather than
+ * a conditional branch to determine which calculation to do.
+ */
+static inline uint32_t
+scan_forward(uint32_t input, uint32_t max)
+{
+	return (input == 0) ? max : rte_bsf32(input);
+}
+
+static inline uint64_t
+scalar_transition(const uint64_t *trans_table, uint64_t transition,
+	uint8_t input)
+{
+	uint32_t addr, index, ranges, x, a, b, c;
+
+	/* break transition into component parts */
+	ranges = transition >> (sizeof(index) * CHAR_BIT);
+
+	/* calc address for a QRANGE node */
+	c = input * SCALAR_QRANGE_MULT;
+	a = ranges | SCALAR_QRANGE_MIN;
+	index = transition & ~RTE_ACL_NODE_INDEX;
+	a -= (c & SCALAR_QRANGE_MASK);
+	b = c & SCALAR_QRANGE_MIN;
+	addr = transition ^ index;
+	a &= SCALAR_QRANGE_MIN;
+	a ^= (ranges ^ b) & (a ^ b);
+	x = scan_forward(a, 32) >> 3;
+	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	/* pickup next transition */
+	transition = *(trans_table + addr);
+	return transition;
+}
+
+int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	int n;
+	uint64_t transition0, transition1;
+	uint32_t input0, input1;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SCALAR];
+	struct completion cmplt[MAX_SEARCHES_SCALAR];
+	struct parms parms[MAX_SEARCHES_SCALAR];
+
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
+		categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SCALAR; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	transition0 = index_array[0];
+	transition1 = index_array[1];
+
+	while (flows.started > 0) {
+
+		input0 = GET_NEXT_4BYTES(parms, 0);
+		input1 = GET_NEXT_4BYTES(parms, 1);
+
+		for (n = 0; n < 4; n++) {
+			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
+				transition0 = scalar_transition(flows.trans,
+					transition0, (uint8_t)input0);
+
+			input0 >>= CHAR_BIT;
+
+			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
+				transition1 = scalar_transition(flows.trans,
+					transition1, (uint8_t)input1);
+
+			input1 >>= CHAR_BIT;
+
+		}
+		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+			transition0 = acl_match_check(transition0,
+				0, ctx, parms, &flows, resolve_priority_scalar);
+			transition1 = acl_match_check(transition1,
+				1, ctx, parms, &flows, resolve_priority_scalar);
+
+		}
+	}
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
new file mode 100644
index 0000000..3f5c721
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.c
@@ -0,0 +1,627 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_match_check.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t mm_type_quad_range = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+	},
+};
+
+static const rte_xmm_t mm_type_quad_range64 = {
+	.u32 = {
+		RTE_ACL_NODE_QRANGE,
+		RTE_ACL_NODE_QRANGE,
+		0,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t mm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t mm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t mm_bytes = {
+	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
+};
+
+static const rte_xmm_t mm_bytes64 = {
+	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
+};
+
+static const rte_xmm_t mm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t mm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t mm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t mm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indicies);
+
+	/* extract transition from high 64 bits. */
+	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indicies);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indicies with new transitions. */
+	*indicies = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indicies);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indicies);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indicies1, slot, ctx, parms, flows);
+		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+					(__m128)*indicies2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline xmm_t
+acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr, node_types, temp;
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indicies2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
+		0x88);
+	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
+		(__m128)*indicies2, 0xdd);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+
+	/* add input byte to DFA position */
+	temp = MM_AND(temp, bytes);
+	temp = MM_AND(temp, next_input);
+	addr = MM_ADD32(addr, temp);
+
+	/*
+	 * Calc addr for Range nodes -> range_index + range(input)
+	 */
+	node_types = MM_CMPEQ32(node_types, type_quad_range);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indicies2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	temp = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* check ranges */
+	temp = MM_CMPGT8(temp, *indicies2);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	temp = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = MM_AND(temp, node_types);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline xmm_t
+transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, indicies2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indicies1, indicies2, indicies3, indicies4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indicies1 contains index_array[0,1]
+	 * indicies2 contains index_array[2,3]
+	 * indicies3 contains index_array[4,5]
+	 * indicies4 contains index_array[6,7]
+	 */
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indicies3, &indicies4, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+			0);
+		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+			0);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		input0 = transition4(mm_index_mask.m, input0,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		input1 = transition4(mm_index_mask.m, input1,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies3, &indicies4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indicies3, &indicies4, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indicies1, indicies2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indicies1, &indicies2, mm_match_mask.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		 input = transition4(mm_index_mask.m, input,
+			mm_shuffle_input.m, mm_ones_16.m,
+			mm_bytes.m, mm_type_quad_range.m,
+			flows.trans, &indicies1, &indicies2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indicies1, &indicies2, mm_match_mask.m);
+	}
+
+	return 0;
+}
+
+static inline xmm_t
+transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
+	const uint64_t *trans, xmm_t *indicies1)
+{
+	uint64_t t;
+	xmm_t addr, indicies2;
+
+	indicies2 = MM_XOR(ones_16, ones_16);
+
+	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
+		bytes, type_quad_range, indicies1, &indicies2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, 8);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indicies;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indicies = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		input = transition2(mm_index_mask64.m, input,
+			mm_shuffle_input64.m, mm_ones_16.m,
+			mm_bytes64.m, mm_type_quad_range64.m,
+			flows.trans, &indicies);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
+			mm_match_mask64.m);
+	}
+
+	return 0;
+}
+
+int
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (categories != 1 &&
+		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
+	if (likely(num >= MAX_SEARCHES_SSE8))
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..b9173c1 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,6 +38,52 @@
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+typedef int (*rte_acl_classify_t)
+(const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
+
+extern int
+rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
+        uint32_t *results, uint32_t num, uint32_t categories);
+
+/* by default, use always avaialbe scalar code path. */
+rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
+
+void rte_acl_select_classify(enum acl_classify_alg alg)
+{
+
+	switch(alg)
+	{
+		case ACL_CLASSIFY_DEFAULT:
+		case ACL_CLASSIFY_SCALAR:
+			rte_acl_default_classify = rte_acl_classify_scalar;
+			break;
+		case ACL_CLASSIFY_SSE:
+			rte_acl_default_classify = rte_acl_classify_sse;
+			break;
+	}
+
+}
+
+static void __attribute__((constructor))
+rte_acl_init(void)
+{
+	enum acl_classify_alg alg = ACL_CLASSIFY_DEFAULT;
+
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		alg = ACL_CLASSIFY_SSE;
+
+	rte_acl_select_classify(alg);
+}
+
+inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+                            const uint8_t **data,
+                            uint32_t *results, uint32_t num,
+                            uint32_t categories)
+{
+	return rte_acl_default_classify(ctx, data, results, num, categories);
+}
+
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index afc0f69..650b306 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -267,6 +267,9 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * RTE_ACL_RESULTS_MULTIPLIER and can't be bigger than RTE_ACL_MAX_CATEGORIES.
  * If more than one rule is applicable for given input buffer and
  * given category, then rule with highest priority will be returned as a match.
+ * Note, that this function could be run only on CPUs with SSE4.1 support.
+ * It is up to the caller to make sure that this function is only invoked on
+ * a machine that supports SSE4.1 ISA.
  * Note, that it is a caller responsibility to ensure that input parameters
  * are valid and point to correct memory locations.
  *
@@ -286,9 +289,10 @@ rte_acl_reset(struct rte_acl_ctx *ctx);
  * @return
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
+ *   -ENOTSUP for unsupported platforms.
  */
 int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
 /**
@@ -323,9 +327,23 @@ rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
  *   zero on successful completion.
  *   -EINVAL for incorrect arguments.
  */
-int
-rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories);
+
+enum acl_classify_alg {
+	ACL_CLASSIFY_DEFAULT = 0,
+	ACL_CLASSIFY_SCALAR = 1,
+	ACL_CLASSIFY_SSE = 2,
+};
+
+extern inline int rte_acl_classify(const struct rte_acl_ctx *ctx,
+				   const uint8_t **data,
+				   uint32_t *results, uint32_t num,
+				   uint32_t categories);
+/**
+ * Analyze ISA of the current CPU and points rte_acl_default_classify
+ * to the highest applicable version of classify function.
+ */
+extern void
+rte_acl_select_classify(enum acl_classify_alg alg);
 
 /**
  * Dump an ACL context structure to the console.
-- 
1.9.3
^ permalink raw reply	[relevance 1%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 14:30  3%         ` Neil Horman
@ 2014-08-11 22:23  0%           ` Thomas Monjalon
  0 siblings, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-08-11 22:23 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
Hi all,
2014-08-08 10:30, Neil Horman:
> On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:
> > > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > > I agree, but both the methods we are advocating for allow that.  Its really just
> > > a question of exposing the mechanism as data or text in the binary.  Exposing it
> > > as data comes with implicit ABI constraints that are less prevalanet when done
> > > as code entry points.
> >  
> > > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> 
> > Of course, he probably will report about it and we probably fix it sooner or later.
> > But with such ability he can switch to the safe implementation immediately
> > without touching the library and then wait for the fix.
> 
> Thats not how users of a binary pacakge from a distribution operate.  If their
> using a binary package they either:
> 
> 1) Don't want to rebuild anything themselves, in which case they file the bug,
> and wait for the developers to fix the issue.
> 
> or 
> 
> 2) Have a staff to help them work around the issue, which will be done by
> rebuilding/fixing the library, not the application.
> 
> With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
> code within dpdk which is built as a shared library within a distribution, and
> they need it fixed immediately, they have a choice of what to do, they can
> either (a), write a custom classifier function and point the dpdk library to it,
> or (b), just fix the bug in the library directly.  Given that, if they can
> accomplish (a), they by all rights can also accompilsh (b), the only decision
> they need to make is one which makes the most sense for them.  The answer is
> (b), because thats where the functionality lives.  i.e. when the fix occurs
> upstream and a new release gets issued, you can go back to using the library
> maintained version, and you don't have to clean up what has become vestigial
> unused code.
I think it's even simpler: thinking API to allow behaviour changes without
rebuilding is not sane. So we should expose all functions?
Please try to reduce API as much as possible.
Thanks
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 13:09  3%       ` Ananyev, Konstantin
@ 2014-08-08 14:30  3%         ` Neil Horman
  2014-08-11 22:23  0%           ` Thomas Monjalon
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-08 14:30 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Fri, Aug 08, 2014 at 01:09:34PM +0000, Ananyev, Konstantin wrote:
> 
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Friday, August 08, 2014 1:25 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > 
> > On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, August 07, 2014 9:12 PM
> > > > To: Ananyev, Konstantin
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > > >
> > > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > > Make ACL library to build/work on 'default' architecture:
> > > > > - make rte_acl_classify_scalar really scalar
> > > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > > - Provide two versions of rte_acl_classify code path:
> > > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > > >   and upper, return -ENOTSUP on lower arch.
> > > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > > >   on all systems.
> > > > > - keep common code shared between these two codepaths.
> > > > >
> > > > > v2 chages:
> > > > >  run-time selection of most appropriate code-path for given ISA.
> > > > >  By default the highest supprted one is selected.
> > > > >  User can still override that selection by manually assigning new value to
> > > > >  the global function pointer rte_acl_default_classify.
> > > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > > >  points to.
> > > > >
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > >
> > > > This is alot better thank you.  A few remaining issues.
> > >
> > > My comments inline too.
> > > Thanks
> > > Konstantin
> > >
> > > >
> > > > > ---
> > > > >  app/test-acl/main.c                |  13 +-
> > > > >  lib/librte_acl/Makefile            |   5 +-
> > > > >  lib/librte_acl/acl_bld.c           |   5 +-
> > > > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > > > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > > > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > > > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > > > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > > > >  lib/librte_acl/rte_acl.c           |  15 +
> > > > >  lib/librte_acl/rte_acl.h           |  24 +-
> > > > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > > > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > > > >  delete mode 100644 lib/librte_acl/acl_run.c
> > > > >  create mode 100644 lib/librte_acl/acl_run.h
> > > > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > > > >
> > > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > > index d654409..45c6fa6 100644
> > > > > --- a/app/test-acl/main.c
> > > > > +++ b/app/test-acl/main.c
> > > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > > >  	/* perform build. */
> > > > >  	ret = rte_acl_build(config.acx, &cfg);
> > > > >
> > > > > +	/* setup default rte_acl_classify */
> > > > > +	if (config.scalar)
> > > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > > > +
> > > > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > > > the function changes you have to update all your applications.
> > >
> > > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> > >
> > Why?  If you hide this from the application, changes to the internal
> > implementation will also be invisible.  When building as a DSO, an application
> > will be able to transition between libraries without the need for a rebuild.
> 
> Because rte_acl_classify() is part of the ACL API that users use.
> If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.   
>  
Thats not at all true.  With API versioning scripts you can make several
versions of the same function with different prototypes as future needs dictate.
Hiding the internal implementation just makes that easier.
> > > >  Make the pointer
> > > > an internal symbol and set it using a get/set routine with an enum to represent
> > > > the path to choose.  That will help isolate the ABI from the internal
> > > > implementation.
> > >
> > > That's was my first intention too.
> > > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > > and it will cost us extra call (or jump).
> > Thats true, but I don't see that as a problem.  We're not talking about a hot
> > code path here, its a setup function.
> 
> I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
> If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:
> 
> ->call rte_acl_acl_classify
> ---> load rte_acl_calssify_default value into the reg 
> --->  jmp (*reg)
> 
Ah, yes, the actual classification path, you will need an extra call
instruction there.  I would say if thats the case, then you should either make
rte_acl_classify a macro or real function based on weather your building as a
shared library or a static library.  
> >  Or do you think that an application will
> > be switching between classification functions on every classify operation?
> 
> God no.
> 
> > > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> > I agree, but both the methods we are advocating for allow that.  Its really just
> > a question of exposing the mechanism as data or text in the binary.  Exposing it
> > as data comes with implicit ABI constraints that are less prevalanet when done
> > as code entry points.
>  
> > > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> 
> > In the case of a bug in the optimized path, you just fix the bug. 
> 
> It is not about me. It is about a user who get librte_acl as part of binary distribution.
Yes, those are my users :)
> Of course, he probably will report about it and we probably fix it sooner or later.
> But with such ability he can switch to the safe implementation immediately
> without touching the library and then wait for the fix.
> 
Thats not how users of a binary pacakge from a distribution operate.  If their
using a binary package they either:
1) Don't want to rebuild anything themselves, in which case they file the bug,
and wait for the developers to fix the issue.
or 
2) Have a staff to help them work around the issue, which will be done by
rebuilding/fixing the library, not the application.
With (2), what I am saying is that, if a 3rd party finds a bug in the classifier
code within dpdk which is built as a shared library within a distribution, and
they need it fixed immediately, they have a choice of what to do, they can
either (a), write a custom classifier function and point the dpdk library to it,
or (b), just fix the bug in the library directly.  Given that, if they can
accomplish (a), they by all rights can also accompilsh (b), the only decision
they need to make is one which makes the most sense for them.  The answer is
(b), because thats where the functionality lives.  i.e. when the fix occurs
upstream and a new release gets issued, you can go back to using the library
maintained version, and you don't have to clean up what has become vestigial
unused code.
 
> >  If you want
> > to provide your own classification function, thats fine I suppose, but that
> > seems completely outside the scope of what we're trying to do here.  Its not
> > adventageous to just throw that in there.  If you want to be able to provide
> > your own classifier function, lets at least take some time to make sure that the
> > function prototype is sufficiently capable to accept all the data you might want
> > to pass it in the future, before we go exposing it.  Otherwise you'll have to
> > break the ABI in future versions, whcih is something we've been discussing
> > trying to avoid.
> 
> rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
> If in future, we'll change these functions prototypes will break ABI anyway.
> 
Well, at the moment, thats fine because you don't make any ABI promises anyway,
I've been working to change that, so distributions can have greater dpdk
adoption.
> > 
> > > > It will also let you prevent things like selecting a run time
> > > > path that is incompatible with the running system
> > >
> > > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> > That really seems like poor design to me.  I don't see why you wouldn't at least
> > want to warn the developer of an application if they were at run time to assign
> > a default classifier method that was incompatible with a running system.  Yes,
> > they're likely smart enough to know what their doing, but smart people make
> > mistakes, and appreciate being told when they're doing so, especially if the
> > method of telling is something a bit more civil than a machine check that
> > might occur well after the application has been initilized.
> 
> I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
> Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().
> 
> > 
> > > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> > Not if the function is statically declared and not exposed to the application
> > they cant :)
> 
> I don't really want to hide  rte_acl_classify_sse/rte_acl_classify_scalar().
> Should be available directly I think.   
> In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
> Users should have an ability to downgrade their classify() function if they like.  
What in your mind is the reasoning behind being able to do so?  What is
adventageous about that?  Asside possibly from debugging that is (for which I
can see a use).  But in normal production operation, why would you choose to not
use the sse classifier over the scalar classifier?
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 12:25  4%     ` Neil Horman
@ 2014-08-08 13:09  3%       ` Ananyev, Konstantin
  2014-08-08 14:30  3%         ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Ananyev, Konstantin @ 2014-08-08 13:09 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Friday, August 08, 2014 1:25 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> 
> On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, August 07, 2014 9:12 PM
> > > To: Ananyev, Konstantin
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > >
> > > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > > Make ACL library to build/work on 'default' architecture:
> > > > - make rte_acl_classify_scalar really scalar
> > > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > > - Provide two versions of rte_acl_classify code path:
> > > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > > >   and upper, return -ENOTSUP on lower arch.
> > > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > > >   on all systems.
> > > > - keep common code shared between these two codepaths.
> > > >
> > > > v2 chages:
> > > >  run-time selection of most appropriate code-path for given ISA.
> > > >  By default the highest supprted one is selected.
> > > >  User can still override that selection by manually assigning new value to
> > > >  the global function pointer rte_acl_default_classify.
> > > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > > >  points to.
> > > >
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > This is alot better thank you.  A few remaining issues.
> >
> > My comments inline too.
> > Thanks
> > Konstantin
> >
> > >
> > > > ---
> > > >  app/test-acl/main.c                |  13 +-
> > > >  lib/librte_acl/Makefile            |   5 +-
> > > >  lib/librte_acl/acl_bld.c           |   5 +-
> > > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > > >  lib/librte_acl/rte_acl.c           |  15 +
> > > >  lib/librte_acl/rte_acl.h           |  24 +-
> > > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > > >  delete mode 100644 lib/librte_acl/acl_run.c
> > > >  create mode 100644 lib/librte_acl/acl_run.h
> > > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > > >
> > > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > > index d654409..45c6fa6 100644
> > > > --- a/app/test-acl/main.c
> > > > +++ b/app/test-acl/main.c
> > > > @@ -787,6 +787,10 @@ acx_init(void)
> > > >  	/* perform build. */
> > > >  	ret = rte_acl_build(config.acx, &cfg);
> > > >
> > > > +	/* setup default rte_acl_classify */
> > > > +	if (config.scalar)
> > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > > the function changes you have to update all your applications.
> >
> > If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway.
> >
> Why?  If you hide this from the application, changes to the internal
> implementation will also be invisible.  When building as a DSO, an application
> will be able to transition between libraries without the need for a rebuild.
Because rte_acl_classify() is part of the ACL API that users use.
If we'll add/modify its parameters and/or return value - users would have to change their apps anyway.   
 
> > >  Make the pointer
> > > an internal symbol and set it using a get/set routine with an enum to represent
> > > the path to choose.  That will help isolate the ABI from the internal
> > > implementation.
> >
> > That's was my first intention too.
> > But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> > and it will cost us extra call (or jump).
> Thats true, but I don't see that as a problem.  We're not talking about a hot
> code path here, its a setup function.
I am talking not about rte_acl_select_classify() but about rte_acl_classify() itself (not code path).
If I'll make rte_acl_default_classify statitc, the rte_acl_classiy() would need to become a real function and it'll be something like that:
->call rte_acl_acl_classify
---> load rte_acl_calssify_default value into the reg 
--->  jmp (*reg)
>  Or do you think that an application will
> be switching between classification functions on every classify operation?
God no.
> > Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
> I agree, but both the methods we are advocating for allow that.  Its really just
> a question of exposing the mechanism as data or text in the binary.  Exposing it
> as data comes with implicit ABI constraints that are less prevalanet when done
> as code entry points.
 
> > For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().
> In the case of a bug in the optimized path, you just fix the bug. 
It is not about me. It is about a user who get librte_acl as part of binary distribution.
Of course, he probably will report about it and we probably fix it sooner or later.
But with such ability he can switch to the safe implementation immediately
without touching the library and then wait for the fix.
>  If you want
> to provide your own classification function, thats fine I suppose, but that
> seems completely outside the scope of what we're trying to do here.  Its not
> adventageous to just throw that in there.  If you want to be able to provide
> your own classifier function, lets at least take some time to make sure that the
> function prototype is sufficiently capable to accept all the data you might want
> to pass it in the future, before we go exposing it.  Otherwise you'll have to
> break the ABI in future versions, whcih is something we've been discussing
> trying to avoid.
rte_acl_classify() it is already exposed (PART of API), same as rte_acl_classify_scalar().
If in future, we'll change these functions prototypes will break ABI anyway.
> 
> > > It will also let you prevent things like selecting a run time
> > > path that is incompatible with the running system
> >
> > If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
> That really seems like poor design to me.  I don't see why you wouldn't at least
> want to warn the developer of an application if they were at run time to assign
> a default classifier method that was incompatible with a running system.  Yes,
> they're likely smart enough to know what their doing, but smart people make
> mistakes, and appreciate being told when they're doing so, especially if the
> method of telling is something a bit more civil than a machine check that
> might occur well after the application has been initilized.
I have no problem providing rte_acl_check_classify(flags_required, classify_ptr) that would do checking and emit the warning.
Though as I said above, I'll prefer not to hide rte_acl_default_classify it will cause extra overhead for rte_acl_classify().
> 
> > From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> Not if the function is statically declared and not exposed to the application
> they cant :)
I don't really want to hide  rte_acl_classify_sse/rte_acl_classify_scalar().
Should be available directly I think.   
In future we might introduce new versions for more sophisticated ISAs (rte_acl_classify_avx() or something).
Users should have an ability to downgrade their classify() function if they like.  
> >
> > > and prevent path switching
> > > during searches, which may produce unexpected results.
> >
> > Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
> > All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> >
> Fair enough.
> 
> > >
> > > ><snip>
> > > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > > deleted file mode 100644
> > > > index e3d9fc1..0000000
> > > > --- a/lib/librte_acl/acl_run.c
> > > > +++ /dev/null
> > > > @@ -1,944 +0,0 @@
> > > > -/*-
> > > > - *   BSD LICENSE
> > > > - *
> > > > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > > - *   All rights reserved.
> > > > - *
> > > > - *   Redistribution and use in source and binary forms, with or without
> > > > - *   modification, are permitted provided that the following conditions
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_scalar
> > > > +#define	__func_match_check__		acl_match_check_scalar
> > > > +#include "acl_match_check.def"
> > > > +
> > > I get this lets you make some more code common, but its just unpleasant to trace
> > > through.  Looking at the defintion of __func_match_check__ I don't see anything
> > > particularly performance sensitive there.  What if instead you simply redefined
> > > __func_match_check__ in a common internal header as acl_match_check (a generic
> > > function), and had it accept priority resolution function as an argument?  That
> > > would still give you all the performance enhancements without having to include
> > > c files in the middle of other c files, and would make the code a bit more
> > > parseable.
> >
> > Yes, that way it would look much better.
> > And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> > Will change as you suggested.
> >
> Thank you
> Neil
> 
> > >
> > > > +/*
> > > > + * When processing the transition, rather than using if/else
> > > > + * construct, the offset is calculated for DFA and QRANGE and
> > > > + * then conditionally added to the address based on node type.
> > > > + * This is done to avoid branch mis-predictions. Since the
> > > > + * offset is rather simple calculation it is more efficient
> > > > + * to do the calculation and do a condition move rather than
> > > > + * a conditional branch to determine which calculation to do.
> > > > + */
> > > > +static inline uint32_t
> > > > +scan_forward(uint32_t input, uint32_t max)
> > > > +{
> > > > +	return (input == 0) ? max : rte_bsf32(input);
> > > > +}
> > > > +	}
> > > > +}
> > > ><snip>
> > > > +
> > > > +#define	__func_resolve_priority__	resolve_priority_sse
> > > > +#define	__func_match_check__		acl_match_check_sse
> > > > +#include "acl_match_check.def"
> > > > +
> > > Same deal as above.
> > >
> > > > +/*
> > > > + * Extract transitions from an XMM register and check for any matches
> > > > + */
> > > > +static void
> > > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > > +	struct parms *parms, struct acl_flow_data *flows)
> > > > +{
> > > > +	uint64_t transition1, transition2;
> > > > +
> > > > +	/* extract transition from low 64 bits. */
> > > > +	transition1 = MM_CVT64(*indicies);
> > > > +
> > > > +	/* extract transition from high 64 bits. */
> > > > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > > +	transition2 = MM_CVT64(*indicies);
> > > > +
> > > > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > > +		parms, flows);
> > > > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > > +		parms, flows);
> > > > +
> > > > +	/* update indicies with new transitions. */
> > > > +	*indicies = MM_SET64(transition2, transition1);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for a match in 2 transitions (contained in SSE register)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	temp = MM_AND(match_mask, *indicies);
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > > > +		temp = MM_AND(match_mask, *indicies);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > > + */
> > > > +static inline void
> > > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > > +	xmm_t match_mask)
> > > > +{
> > > > +	xmm_t temp;
> > > > +
> > > > +	/* put low 32 bits of each transition into one register */
> > > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > +		0x88);
> > > > +	/* test for match node */
> > > > +	temp = MM_AND(match_mask, temp);
> > > > +
> > > > +	while (!MM_TESTZ(temp, temp)) {
> > > > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > > +
> > > > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +					(__m128)*indicies2,
> > > > +					0x88);
> > > > +		temp = MM_AND(match_mask, temp);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Calculate the address of the next transition for
> > > > + * all types of nodes. Note that only DFA nodes and range
> > > > + * nodes actually transition to another node. Match
> > > > + * nodes don't move.
> > > > + */
> > > > +static inline xmm_t
> > > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr, node_types, temp;
> > > > +
> > > > +	/*
> > > > +	 * Note that no transition is done for a match
> > > > +	 * node and therefore a stream freezes when
> > > > +	 * it reaches a match.
> > > > +	 */
> > > > +
> > > > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > > +		0x88);
> > > > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > > +		(__m128)*indicies2, 0xdd);
> > > > +
> > > > +	/* Calc node type and node addr */
> > > > +	node_types = MM_ANDNOT(index_mask, temp);
> > > > +	addr = MM_AND(index_mask, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > > > +	 */
> > > > +
> > > > +	/* mask for DFA type (0) nodes */
> > > > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > > +
> > > > +	/* add input byte to DFA position */
> > > > +	temp = MM_AND(temp, bytes);
> > > > +	temp = MM_AND(temp, next_input);
> > > > +	addr = MM_ADD32(addr, temp);
> > > > +
> > > > +	/*
> > > > +	 * Calc addr for Range nodes -> range_index + range(input)
> > > > +	 */
> > > > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > > +
> > > > +	/*
> > > > +	 * Calculate number of range boundaries that are less than the
> > > > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > > > +	 * ordered from -128 to 127 in the indicies2 register.
> > > > +	 * This is effectively a popcnt of bytes that are greater than the
> > > > +	 * input byte.
> > > > +	 */
> > > > +
> > > > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > > > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > > +
> > > > +	/* check ranges */
> > > > +	temp = MM_CMPGT8(temp, *indicies2);
> > > > +
> > > > +	/* convert -1 to 1 (bytes greater than input byte */
> > > > +	temp = MM_SIGN8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of bytes into words */
> > > > +	temp = MM_MADD8(temp, temp);
> > > > +
> > > > +	/* horizontal add pairs of words into dwords */
> > > > +	temp = MM_MADD16(temp, ones_16);
> > > > +
> > > > +	/* mask to range type nodes */
> > > > +	temp = MM_AND(temp, node_types);
> > > > +
> > > > +	/* add index into node position */
> > > > +	return MM_ADD32(addr, temp);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > > + */
> > > > +static inline xmm_t
> > > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > > +{
> > > > +	xmm_t addr;
> > > > +	uint64_t trans0, trans2;
> > > > +
> > > > +	 /* Calculate the address (array index) for all 4 transitions. */
> > > > +
> > > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > +		bytes, type_quad_range, indicies1, indicies2);
> > > > +
> > > > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > > > +
> > > > +	trans0 = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 2 */
> > > > +
> > > > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > > +	trans2 = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +
> > > > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > > +
> > > > +	/* get slot 3 */
> > > > +
> > > > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 8 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > > > +	struct parms parms[MAX_SEARCHES_SSE8];
> > > > +	xmm_t input0, input1;
> > > > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * indicies1 contains index_array[0,1]
> > > > +	 * indicies2 contains index_array[2,3]
> > > > +	 * indicies3 contains index_array[4,5]
> > > > +	 * indicies4 contains index_array[6,7]
> > > > +	 */
> > > > +
> > > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > > +
> > > > +	 /* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +	acl_match_check_x4(4, ctx, parms, &flows,
> > > > +		&indicies3, &indicies4, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > > +			0);
> > > > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > > +			0);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > > +
> > > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > > +
> > > > +		 /* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		input0 = transition4(mm_index_mask.m, input0,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		input1 = transition4(mm_index_mask.m, input1,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies3, &indicies4);
> > > > +
> > > > +		 /* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +		acl_match_check_x4(4, ctx, parms, &flows,
> > > > +			&indicies3, &indicies4, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 4 traversals in parallel
> > > > + */
> > > > +static inline int
> > > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	 uint32_t *results, int total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > > > +	struct parms parms[MAX_SEARCHES_SSE4];
> > > > +	xmm_t input, indicies1, indicies2;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +		input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		 input = transition4(mm_index_mask.m, input,
> > > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > > +			mm_bytes.m, mm_type_quad_range.m,
> > > > +			flows.trans, &indicies1, &indicies2);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static inline xmm_t
> > > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > > +	const uint64_t *trans, xmm_t *indicies1)
> > > > +{
> > > > +	uint64_t t;
> > > > +	xmm_t addr, indicies2;
> > > > +
> > > > +	indicies2 = MM_XOR(ones_16, ones_16);
> > > > +
> > > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > > +		bytes, type_quad_range, indicies1, &indicies2);
> > > > +
> > > > +	/* Gather 64 bit transitions and pack 2 per register. */
> > > > +
> > > > +	t = trans[MM_CVT32(addr)];
> > > > +
> > > > +	/* get slot 1 */
> > > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > > +
> > > > +	return MM_SRL32(next_input, 8);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Execute trie traversal with 2 traversals in parallel.
> > > > + */
> > > > +static inline int
> > > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > > +{
> > > > +	int n;
> > > > +	struct acl_flow_data flows;
> > > > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > > > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > > > +	struct parms parms[MAX_SEARCHES_SSE2];
> > > > +	xmm_t input, indicies;
> > > > +
> > > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > > +		total_packets, categories, ctx->trans_table);
> > > > +
> > > > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > > +		cmplt[n].count = 0;
> > > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > > +	}
> > > > +
> > > > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > > +
> > > > +	/* Check for any matches. */
> > > > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > > +
> > > > +	while (flows.started > 0) {
> > > > +
> > > > +		/* Gather 4 bytes of input data for each stream. */
> > > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > > +
> > > > +		/* Process the 4 bytes of input on each stream. */
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		input = transition2(mm_index_mask64.m, input,
> > > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > > +			flows.trans, &indicies);
> > > > +
> > > > +		/* Check for any matches. */
> > > > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > > +			mm_match_mask64.m);
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int
> > > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > > +	uint32_t *results, uint32_t num, uint32_t categories)
> > > > +{
> > > > +	if (categories != 1 &&
> > > > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > > > +		return search_sse_8(ctx, data, results, num, categories);
> > > > +	else if (num >= MAX_SEARCHES_SSE4)
> > > > +		return search_sse_4(ctx, data, results, num, categories);
> > > > +	else
> > > > +		return search_sse_2(ctx, data, results, num, categories);
> > > > +}
> > > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > > index 7c288bd..0cde07e 100644
> > > > --- a/lib/librte_acl/rte_acl.c
> > > > +++ b/lib/librte_acl/rte_acl.c
> > > > @@ -38,6 +38,21 @@
> > > >
> > > >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > > >
> > > > +/* by default, use always avaialbe scalar code path. */
> > > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > > +
> > > make this static, the outside world shouldn't need to see it.
> >
> > As I said above, I think it more plausible to keep it globally visible.
> >
> > >
> > > > +void __attribute__((constructor(INT16_MAX)))
> > > > +rte_acl_select_classify(void)
> > > Make it static, The outside world doesn't need to call this.
> >
> > See above, would like user to have an ability to call it manually if needed.
> >
> > >
> > > > +{
> > > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > > +		/* SSE version requires SSE4.1 */
> > > > +		rte_acl_default_classify = rte_acl_classify_sse;
> > > > +	} else {
> > > > +		/* reset to scalar version. */
> > > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > Don't need the else clause here, the static initalizer has you covered.
> >
> > I think we better keep it like that - in case user calls it manually.
> > We always reset  rte_acl_default_classify to the 'best proper' value.
> >
> > > > +	}
> > > > +}
> > > > +
> > > > +
> > > > +/**
> > > > + * Invokes default rte_acl_classify function.
> > > > + */
> > > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > > +
> > > Doesn't need to be extern.
> > > > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > > > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > > +
> > > Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
> >
> > We preserve existing rte_acl_classify() API, so users don't need to modify their code.
> >
> This would be a great candidate for versioning (Bruce and have been discussing
> this).
> 
> Neil
> 
> >
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-08 11:49  0%   ` Ananyev, Konstantin
@ 2014-08-08 12:25  4%     ` Neil Horman
  2014-08-08 13:09  3%       ` Ananyev, Konstantin
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-08-08 12:25 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev
On Fri, Aug 08, 2014 at 11:49:58AM +0000, Ananyev, Konstantin wrote:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, August 07, 2014 9:12 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> > 
> > On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > > Make ACL library to build/work on 'default' architecture:
> > > - make rte_acl_classify_scalar really scalar
> > >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > > - Provide two versions of rte_acl_classify code path:
> > >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> > >   and upper, return -ENOTSUP on lower arch.
> > >   rte_acl_classify_scalar() - a slower version, but could be build and used
> > >   on all systems.
> > > - keep common code shared between these two codepaths.
> > >
> > > v2 chages:
> > >  run-time selection of most appropriate code-path for given ISA.
> > >  By default the highest supprted one is selected.
> > >  User can still override that selection by manually assigning new value to
> > >  the global function pointer rte_acl_default_classify.
> > >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> > >  points to.
> > >
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > 
> > This is alot better thank you.  A few remaining issues.
> 
> My comments inline too.
> Thanks
> Konstantin
> 
> > 
> > > ---
> > >  app/test-acl/main.c                |  13 +-
> > >  lib/librte_acl/Makefile            |   5 +-
> > >  lib/librte_acl/acl_bld.c           |   5 +-
> > >  lib/librte_acl/acl_match_check.def |  92 ++++
> > >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> > >  lib/librte_acl/acl_run.h           | 220 +++++++++
> > >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> > >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> > >  lib/librte_acl/rte_acl.c           |  15 +
> > >  lib/librte_acl/rte_acl.h           |  24 +-
> > >  10 files changed, 1189 insertions(+), 956 deletions(-)
> > >  create mode 100644 lib/librte_acl/acl_match_check.def
> > >  delete mode 100644 lib/librte_acl/acl_run.c
> > >  create mode 100644 lib/librte_acl/acl_run.h
> > >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> > >  create mode 100644 lib/librte_acl/acl_run_sse.c
> > >
> > > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > > index d654409..45c6fa6 100644
> > > --- a/app/test-acl/main.c
> > > +++ b/app/test-acl/main.c
> > > @@ -787,6 +787,10 @@ acx_init(void)
> > >  	/* perform build. */
> > >  	ret = rte_acl_build(config.acx, &cfg);
> > >
> > > +	/* setup default rte_acl_classify */
> > > +	if (config.scalar)
> > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> > the function changes you have to update all your applications.
> 
> If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway. 
> 
Why?  If you hide this from the application, changes to the internal
implementation will also be invisible.  When building as a DSO, an application
will be able to transition between libraries without the need for a rebuild.
> >  Make the pointer
> > an internal symbol and set it using a get/set routine with an enum to represent
> > the path to choose.  That will help isolate the ABI from the internal
> > implementation. 
> 
> That's was my first intention too.
> But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
> and it will cost us extra call (or jump).
Thats true, but I don't see that as a problem.  We're not talking about a hot
code path here, its a setup function.  Or do you think that an application will
be switching between classification functions on every classify operation?
> Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
I agree, but both the methods we are advocating for allow that.  Its really just
a question of exposing the mechanism as data or text in the binary.  Exposing it
as data comes with implicit ABI constraints that are less prevalanet when done
as code entry points.
> For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().  
In the case of a bug in the optimized path, you just fix the bug.  If you want
to provide your own classification function, thats fine I suppose, but that
seems completely outside the scope of what we're trying to do here.  Its not
adventageous to just throw that in there.  If you want to be able to provide
your own classifier function, lets at least take some time to make sure that the
function prototype is sufficiently capable to accept all the data you might want
to pass it in the future, before we go exposing it.  Otherwise you'll have to
break the ABI in future versions, whcih is something we've been discussing
trying to avoid.
> > It will also let you prevent things like selecting a run time
> > path that is incompatible with the running system
> 
> If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
That really seems like poor design to me.  I don't see why you wouldn't at least
want to warn the developer of an application if they were at run time to assign
a default classifier method that was incompatible with a running system.  Yes,
they're likely smart enough to know what their doing, but smart people make
mistakes, and appreciate being told when they're doing so, especially if the
method of telling is something a bit more civil than a machine check that
might occur well after the application has been initilized.
> From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
Not if the function is statically declared and not exposed to the application
they cant :)
> 
> > and prevent path switching
> > during searches, which may produce unexpected results.
> 
> Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
> All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> 
Fair enough.
> > 
> > ><snip>
> > > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > > deleted file mode 100644
> > > index e3d9fc1..0000000
> > > --- a/lib/librte_acl/acl_run.c
> > > +++ /dev/null
> > > @@ -1,944 +0,0 @@
> > > -/*-
> > > - *   BSD LICENSE
> > > - *
> > > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > > - *   All rights reserved.
> > > - *
> > > - *   Redistribution and use in source and binary forms, with or without
> > > - *   modification, are permitted provided that the following conditions
> > ><snip>
> > > +
> > > +#define	__func_resolve_priority__	resolve_priority_scalar
> > > +#define	__func_match_check__		acl_match_check_scalar
> > > +#include "acl_match_check.def"
> > > +
> > I get this lets you make some more code common, but its just unpleasant to trace
> > through.  Looking at the defintion of __func_match_check__ I don't see anything
> > particularly performance sensitive there.  What if instead you simply redefined
> > __func_match_check__ in a common internal header as acl_match_check (a generic
> > function), and had it accept priority resolution function as an argument?  That
> > would still give you all the performance enhancements without having to include
> > c files in the middle of other c files, and would make the code a bit more
> > parseable.
> 
> Yes, that way it would look much better.
> And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
> Will change as you suggested. 
> 
Thank you
Neil
> > 
> > > +/*
> > > + * When processing the transition, rather than using if/else
> > > + * construct, the offset is calculated for DFA and QRANGE and
> > > + * then conditionally added to the address based on node type.
> > > + * This is done to avoid branch mis-predictions. Since the
> > > + * offset is rather simple calculation it is more efficient
> > > + * to do the calculation and do a condition move rather than
> > > + * a conditional branch to determine which calculation to do.
> > > + */
> > > +static inline uint32_t
> > > +scan_forward(uint32_t input, uint32_t max)
> > > +{
> > > +	return (input == 0) ? max : rte_bsf32(input);
> > > +}
> > > +	}
> > > +}
> > ><snip>
> > > +
> > > +#define	__func_resolve_priority__	resolve_priority_sse
> > > +#define	__func_match_check__		acl_match_check_sse
> > > +#include "acl_match_check.def"
> > > +
> > Same deal as above.
> > 
> > > +/*
> > > + * Extract transitions from an XMM register and check for any matches
> > > + */
> > > +static void
> > > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > > +	struct parms *parms, struct acl_flow_data *flows)
> > > +{
> > > +	uint64_t transition1, transition2;
> > > +
> > > +	/* extract transition from low 64 bits. */
> > > +	transition1 = MM_CVT64(*indicies);
> > > +
> > > +	/* extract transition from high 64 bits. */
> > > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > > +	transition2 = MM_CVT64(*indicies);
> > > +
> > > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > > +		parms, flows);
> > > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > > +		parms, flows);
> > > +
> > > +	/* update indicies with new transitions. */
> > > +	*indicies = MM_SET64(transition2, transition1);
> > > +}
> > > +
> > > +/*
> > > + * Check for a match in 2 transitions (contained in SSE register)
> > > + */
> > > +static inline void
> > > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > > +{
> > > +	xmm_t temp;
> > > +
> > > +	temp = MM_AND(match_mask, *indicies);
> > > +	while (!MM_TESTZ(temp, temp)) {
> > > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > > +		temp = MM_AND(match_mask, *indicies);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > > + */
> > > +static inline void
> > > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > > +	xmm_t match_mask)
> > > +{
> > > +	xmm_t temp;
> > > +
> > > +	/* put low 32 bits of each transition into one register */
> > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > +		0x88);
> > > +	/* test for match node */
> > > +	temp = MM_AND(match_mask, temp);
> > > +
> > > +	while (!MM_TESTZ(temp, temp)) {
> > > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > > +
> > > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > +					(__m128)*indicies2,
> > > +					0x88);
> > > +		temp = MM_AND(match_mask, temp);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Calculate the address of the next transition for
> > > + * all types of nodes. Note that only DFA nodes and range
> > > + * nodes actually transition to another node. Match
> > > + * nodes don't move.
> > > + */
> > > +static inline xmm_t
> > > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > +	xmm_t addr, node_types, temp;
> > > +
> > > +	/*
> > > +	 * Note that no transition is done for a match
> > > +	 * node and therefore a stream freezes when
> > > +	 * it reaches a match.
> > > +	 */
> > > +
> > > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > > +		0x88);
> > > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > > +		(__m128)*indicies2, 0xdd);
> > > +
> > > +	/* Calc node type and node addr */
> > > +	node_types = MM_ANDNOT(index_mask, temp);
> > > +	addr = MM_AND(index_mask, temp);
> > > +
> > > +	/*
> > > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > > +	 */
> > > +
> > > +	/* mask for DFA type (0) nodes */
> > > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > > +
> > > +	/* add input byte to DFA position */
> > > +	temp = MM_AND(temp, bytes);
> > > +	temp = MM_AND(temp, next_input);
> > > +	addr = MM_ADD32(addr, temp);
> > > +
> > > +	/*
> > > +	 * Calc addr for Range nodes -> range_index + range(input)
> > > +	 */
> > > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > > +
> > > +	/*
> > > +	 * Calculate number of range boundaries that are less than the
> > > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > > +	 * ordered from -128 to 127 in the indicies2 register.
> > > +	 * This is effectively a popcnt of bytes that are greater than the
> > > +	 * input byte.
> > > +	 */
> > > +
> > > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > > +
> > > +	/* check ranges */
> > > +	temp = MM_CMPGT8(temp, *indicies2);
> > > +
> > > +	/* convert -1 to 1 (bytes greater than input byte */
> > > +	temp = MM_SIGN8(temp, temp);
> > > +
> > > +	/* horizontal add pairs of bytes into words */
> > > +	temp = MM_MADD8(temp, temp);
> > > +
> > > +	/* horizontal add pairs of words into dwords */
> > > +	temp = MM_MADD16(temp, ones_16);
> > > +
> > > +	/* mask to range type nodes */
> > > +	temp = MM_AND(temp, node_types);
> > > +
> > > +	/* add index into node position */
> > > +	return MM_ADD32(addr, temp);
> > > +}
> > > +
> > > +/*
> > > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > > + */
> > > +static inline xmm_t
> > > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > > +{
> > > +	xmm_t addr;
> > > +	uint64_t trans0, trans2;
> > > +
> > > +	 /* Calculate the address (array index) for all 4 transitions. */
> > > +
> > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > +		bytes, type_quad_range, indicies1, indicies2);
> > > +
> > > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > > +
> > > +	trans0 = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 2 */
> > > +
> > > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > > +	trans2 = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 1 */
> > > +
> > > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > > +
> > > +	/* get slot 3 */
> > > +
> > > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > > +
> > > +	return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 8 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > > +	struct parms parms[MAX_SEARCHES_SSE8];
> > > +	xmm_t input0, input1;
> > > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	/*
> > > +	 * indicies1 contains index_array[0,1]
> > > +	 * indicies2 contains index_array[2,3]
> > > +	 * indicies3 contains index_array[4,5]
> > > +	 * indicies4 contains index_array[6,7]
> > > +	 */
> > > +
> > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > > +
> > > +	 /* Check for any matches. */
> > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > +	acl_match_check_x4(4, ctx, parms, &flows,
> > > +		&indicies3, &indicies4, mm_match_mask.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > > +			0);
> > > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > > +			0);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > > +
> > > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > > +
> > > +		 /* Process the 4 bytes of input on each stream. */
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		input0 = transition4(mm_index_mask.m, input0,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		input1 = transition4(mm_index_mask.m, input1,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies3, &indicies4);
> > > +
> > > +		 /* Check for any matches. */
> > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > +		acl_match_check_x4(4, ctx, parms, &flows,
> > > +			&indicies3, &indicies4, mm_match_mask.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 4 traversals in parallel
> > > + */
> > > +static inline int
> > > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	 uint32_t *results, int total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > > +	struct parms parms[MAX_SEARCHES_SSE4];
> > > +	xmm_t input, indicies1, indicies2;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > > +
> > > +	/* Check for any matches. */
> > > +	acl_match_check_x4(0, ctx, parms, &flows,
> > > +		&indicies1, &indicies2, mm_match_mask.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > > +
> > > +		/* Process the 4 bytes of input on each stream. */
> > > +		input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		 input = transition4(mm_index_mask.m, input,
> > > +			mm_shuffle_input.m, mm_ones_16.m,
> > > +			mm_bytes.m, mm_type_quad_range.m,
> > > +			flows.trans, &indicies1, &indicies2);
> > > +
> > > +		/* Check for any matches. */
> > > +		acl_match_check_x4(0, ctx, parms, &flows,
> > > +			&indicies1, &indicies2, mm_match_mask.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static inline xmm_t
> > > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > > +	const uint64_t *trans, xmm_t *indicies1)
> > > +{
> > > +	uint64_t t;
> > > +	xmm_t addr, indicies2;
> > > +
> > > +	indicies2 = MM_XOR(ones_16, ones_16);
> > > +
> > > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > > +		bytes, type_quad_range, indicies1, &indicies2);
> > > +
> > > +	/* Gather 64 bit transitions and pack 2 per register. */
> > > +
> > > +	t = trans[MM_CVT32(addr)];
> > > +
> > > +	/* get slot 1 */
> > > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > > +
> > > +	return MM_SRL32(next_input, 8);
> > > +}
> > > +
> > > +/*
> > > + * Execute trie traversal with 2 traversals in parallel.
> > > + */
> > > +static inline int
> > > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > > +{
> > > +	int n;
> > > +	struct acl_flow_data flows;
> > > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > > +	struct parms parms[MAX_SEARCHES_SSE2];
> > > +	xmm_t input, indicies;
> > > +
> > > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > > +		total_packets, categories, ctx->trans_table);
> > > +
> > > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > > +		cmplt[n].count = 0;
> > > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > > +	}
> > > +
> > > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > > +
> > > +	/* Check for any matches. */
> > > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > > +
> > > +	while (flows.started > 0) {
> > > +
> > > +		/* Gather 4 bytes of input data for each stream. */
> > > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > > +
> > > +		/* Process the 4 bytes of input on each stream. */
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		input = transition2(mm_index_mask64.m, input,
> > > +			mm_shuffle_input64.m, mm_ones_16.m,
> > > +			mm_bytes64.m, mm_type_quad_range64.m,
> > > +			flows.trans, &indicies);
> > > +
> > > +		/* Check for any matches. */
> > > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > > +			mm_match_mask64.m);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int
> > > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > > +	uint32_t *results, uint32_t num, uint32_t categories)
> > > +{
> > > +	if (categories != 1 &&
> > > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > > +		return -EINVAL;
> > > +
> > > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > > +		return search_sse_8(ctx, data, results, num, categories);
> > > +	else if (num >= MAX_SEARCHES_SSE4)
> > > +		return search_sse_4(ctx, data, results, num, categories);
> > > +	else
> > > +		return search_sse_2(ctx, data, results, num, categories);
> > > +}
> > > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > > index 7c288bd..0cde07e 100644
> > > --- a/lib/librte_acl/rte_acl.c
> > > +++ b/lib/librte_acl/rte_acl.c
> > > @@ -38,6 +38,21 @@
> > >
> > >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> > >
> > > +/* by default, use always avaialbe scalar code path. */
> > > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > > +
> > make this static, the outside world shouldn't need to see it.
> 
> As I said above, I think it more plausible to keep it globally visible.
> 
> > 
> > > +void __attribute__((constructor(INT16_MAX)))
> > > +rte_acl_select_classify(void)
> > Make it static, The outside world doesn't need to call this.
> 
> See above, would like user to have an ability to call it manually if needed.
> 
> > 
> > > +{
> > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > > +		/* SSE version requires SSE4.1 */
> > > +		rte_acl_default_classify = rte_acl_classify_sse;
> > > +	} else {
> > > +		/* reset to scalar version. */
> > > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > Don't need the else clause here, the static initalizer has you covered.
> 
> I think we better keep it like that - in case user calls it manually.
> We always reset  rte_acl_default_classify to the 'best proper' value.
> 
> > > +	}
> > > +}
> > > +
> > > +
> > > +/**
> > > + * Invokes default rte_acl_classify function.
> > > + */
> > > +extern rte_acl_classify_t rte_acl_default_classify;
> > > +
> > Doesn't need to be extern.
> > > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > > +
> > Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
> 
> We preserve existing rte_acl_classify() API, so users don't need to modify their code.
> 
This would be a great candidate for versioning (Bruce and have been discussing
this).
Neil
> 
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 20:11  4% ` Neil Horman
  2014-08-07 20:58  0%   ` Vincent JARDIN
@ 2014-08-08 11:49  0%   ` Ananyev, Konstantin
  2014-08-08 12:25  4%     ` Neil Horman
  1 sibling, 1 reply; 86+ results
From: Ananyev, Konstantin @ 2014-08-08 11:49 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, August 07, 2014 9:12 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
> 
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
> >  points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 
> This is alot better thank you.  A few remaining issues.
My comments inline too.
Thanks
Konstantin
> 
> > ---
> >  app/test-acl/main.c                |  13 +-
> >  lib/librte_acl/Makefile            |   5 +-
> >  lib/librte_acl/acl_bld.c           |   5 +-
> >  lib/librte_acl/acl_match_check.def |  92 ++++
> >  lib/librte_acl/acl_run.c           | 944 -------------------------------------
> >  lib/librte_acl/acl_run.h           | 220 +++++++++
> >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> >  lib/librte_acl/rte_acl.c           |  15 +
> >  lib/librte_acl/rte_acl.h           |  24 +-
> >  10 files changed, 1189 insertions(+), 956 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_match_check.def
> >  delete mode 100644 lib/librte_acl/acl_run.c
> >  create mode 100644 lib/librte_acl/acl_run.h
> >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> >  create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> >  	/* perform build. */
> >  	ret = rte_acl_build(config.acx, &cfg);
> >
> > +	/* setup default rte_acl_classify */
> > +	if (config.scalar)
> > +		rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea.  If the prototype of
> the function changes you have to update all your applications.
If the prototype of rte_acl_classify will change, most likely you'll have to update code that uses it anyway. 
>  Make the pointer
> an internal symbol and set it using a get/set routine with an enum to represent
> the path to choose.  That will help isolate the ABI from the internal
> implementation. 
That's was my first intention too.
But then I realised that if we'll make it internal, then we'll need to make rte_acl_classify() a proper function
and it will cost us extra call (or jump).
Also I think user should have an ability to change default classify code path without modifying/rebuilding acl library.
For example: a bug in an optimised code path is discovered, or user may want to implement and use his own version of classify().  
> It will also let you prevent things like selecting a run time
> path that is incompatible with the running system
If the user going to update rte_acl_default_classify he is probably smart enough to know what he is doing.
>From other hand - user can hit same problem by simply calling rte_acl_classify_sse() directly.
> and prevent path switching
> during searches, which may produce unexpected results.
Not that I am advertising it, but  it should be safe to update rte_acl_default_classify during searches:
All versions of classify should produce exactly the same result for each input packet and treat acl context as read-only.
> 
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - *   BSD LICENSE
> > - *
> > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - *   All rights reserved.
> > - *
> > - *   Redistribution and use in source and binary forms, with or without
> > - *   modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define	__func_resolve_priority__	resolve_priority_scalar
> > +#define	__func_match_check__		acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant to trace
> through.  Looking at the defintion of __func_match_check__ I don't see anything
> particularly performance sensitive there.  What if instead you simply redefined
> __func_match_check__ in a common internal header as acl_match_check (a generic
> function), and had it accept priority resolution function as an argument?  That
> would still give you all the performance enhancements without having to include
> c files in the middle of other c files, and would make the code a bit more
> parseable.
Yes, that way it would look much better.
And it seems that with '-findirect-inlining' gcc is able to inline them via pointers properly.
Will change as you suggested. 
> 
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > +	return (input == 0) ? max : rte_bsf32(input);
> > +}
> > +	}
> > +}
> ><snip>
> > +
> > +#define	__func_resolve_priority__	resolve_priority_sse
> > +#define	__func_match_check__		acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
> 
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> > +	struct parms *parms, struct acl_flow_data *flows)
> > +{
> > +	uint64_t transition1, transition2;
> > +
> > +	/* extract transition from low 64 bits. */
> > +	transition1 = MM_CVT64(*indicies);
> > +
> > +	/* extract transition from high 64 bits. */
> > +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > +	transition2 = MM_CVT64(*indicies);
> > +
> > +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> > +		parms, flows);
> > +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > +		parms, flows);
> > +
> > +	/* update indicies with new transitions. */
> > +	*indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > +	xmm_t temp;
> > +
> > +	temp = MM_AND(match_mask, *indicies);
> > +	while (!MM_TESTZ(temp, temp)) {
> > +		acl_process_matches(indicies, slot, ctx, parms, flows);
> > +		temp = MM_AND(match_mask, *indicies);
> > +	}
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> > +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > +	xmm_t match_mask)
> > +{
> > +	xmm_t temp;
> > +
> > +	/* put low 32 bits of each transition into one register */
> > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +		0x88);
> > +	/* test for match node */
> > +	temp = MM_AND(match_mask, temp);
> > +
> > +	while (!MM_TESTZ(temp, temp)) {
> > +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> > +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> > +
> > +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +					(__m128)*indicies2,
> > +					0x88);
> > +		temp = MM_AND(match_mask, temp);
> > +	}
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +	xmm_t addr, node_types, temp;
> > +
> > +	/*
> > +	 * Note that no transition is done for a match
> > +	 * node and therefore a stream freezes when
> > +	 * it reaches a match.
> > +	 */
> > +
> > +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> > +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +		0x88);
> > +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +		(__m128)*indicies2, 0xdd);
> > +
> > +	/* Calc node type and node addr */
> > +	node_types = MM_ANDNOT(index_mask, temp);
> > +	addr = MM_AND(index_mask, temp);
> > +
> > +	/*
> > +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> > +	 */
> > +
> > +	/* mask for DFA type (0) nodes */
> > +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > +	/* add input byte to DFA position */
> > +	temp = MM_AND(temp, bytes);
> > +	temp = MM_AND(temp, next_input);
> > +	addr = MM_ADD32(addr, temp);
> > +
> > +	/*
> > +	 * Calc addr for Range nodes -> range_index + range(input)
> > +	 */
> > +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > +	/*
> > +	 * Calculate number of range boundaries that are less than the
> > +	 * input value. Range boundaries for each node are in signed 8 bit,
> > +	 * ordered from -128 to 127 in the indicies2 register.
> > +	 * This is effectively a popcnt of bytes that are greater than the
> > +	 * input byte.
> > +	 */
> > +
> > +	/* shuffle input byte to all 4 positions of 32 bit value */
> > +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > +	/* check ranges */
> > +	temp = MM_CMPGT8(temp, *indicies2);
> > +
> > +	/* convert -1 to 1 (bytes greater than input byte */
> > +	temp = MM_SIGN8(temp, temp);
> > +
> > +	/* horizontal add pairs of bytes into words */
> > +	temp = MM_MADD8(temp, temp);
> > +
> > +	/* horizontal add pairs of words into dwords */
> > +	temp = MM_MADD16(temp, ones_16);
> > +
> > +	/* mask to range type nodes */
> > +	temp = MM_AND(temp, node_types);
> > +
> > +	/* add index into node position */
> > +	return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +	xmm_t addr;
> > +	uint64_t trans0, trans2;
> > +
> > +	 /* Calculate the address (array index) for all 4 transitions. */
> > +
> > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > +		bytes, type_quad_range, indicies1, indicies2);
> > +
> > +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > +	trans0 = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 2 */
> > +
> > +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > +	trans2 = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 1 */
> > +
> > +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > +	/* get slot 3 */
> > +
> > +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > +	return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE8];
> > +	struct completion cmplt[MAX_SEARCHES_SSE8];
> > +	struct parms parms[MAX_SEARCHES_SSE8];
> > +	xmm_t input0, input1;
> > +	xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	/*
> > +	 * indicies1 contains index_array[0,1]
> > +	 * indicies2 contains index_array[2,3]
> > +	 * indicies3 contains index_array[4,5]
> > +	 * indicies4 contains index_array[6,7]
> > +	 */
> > +
> > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > +	 /* Check for any matches. */
> > +	acl_match_check_x4(0, ctx, parms, &flows,
> > +		&indicies1, &indicies2, mm_match_mask.m);
> > +	acl_match_check_x4(4, ctx, parms, &flows,
> > +		&indicies3, &indicies4, mm_match_mask.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> > +			0);
> > +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> > +			0);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> > +
> > +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> > +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> > +
> > +		 /* Process the 4 bytes of input on each stream. */
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		input0 = transition4(mm_index_mask.m, input0,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		input1 = transition4(mm_index_mask.m, input1,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies3, &indicies4);
> > +
> > +		 /* Check for any matches. */
> > +		acl_match_check_x4(0, ctx, parms, &flows,
> > +			&indicies1, &indicies2, mm_match_mask.m);
> > +		acl_match_check_x4(4, ctx, parms, &flows,
> > +			&indicies3, &indicies4, mm_match_mask.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	 uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE4];
> > +	struct completion cmplt[MAX_SEARCHES_SSE4];
> > +	struct parms parms[MAX_SEARCHES_SSE4];
> > +	xmm_t input, indicies1, indicies2;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +	/* Check for any matches. */
> > +	acl_match_check_x4(0, ctx, parms, &flows,
> > +		&indicies1, &indicies2, mm_match_mask.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > +		/* Process the 4 bytes of input on each stream. */
> > +		input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		 input = transition4(mm_index_mask.m, input,
> > +			mm_shuffle_input.m, mm_ones_16.m,
> > +			mm_bytes.m, mm_type_quad_range.m,
> > +			flows.trans, &indicies1, &indicies2);
> > +
> > +		/* Check for any matches. */
> > +		acl_match_check_x4(0, ctx, parms, &flows,
> > +			&indicies1, &indicies2, mm_match_mask.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +	const uint64_t *trans, xmm_t *indicies1)
> > +{
> > +	uint64_t t;
> > +	xmm_t addr, indicies2;
> > +
> > +	indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> > +		bytes, type_quad_range, indicies1, &indicies2);
> > +
> > +	/* Gather 64 bit transitions and pack 2 per register. */
> > +
> > +	t = trans[MM_CVT32(addr)];
> > +
> > +	/* get slot 1 */
> > +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > +	return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +	int n;
> > +	struct acl_flow_data flows;
> > +	uint64_t index_array[MAX_SEARCHES_SSE2];
> > +	struct completion cmplt[MAX_SEARCHES_SSE2];
> > +	struct parms parms[MAX_SEARCHES_SSE2];
> > +	xmm_t input, indicies;
> > +
> > +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +		total_packets, categories, ctx->trans_table);
> > +
> > +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > +		cmplt[n].count = 0;
> > +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> > +	}
> > +
> > +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > +	/* Check for any matches. */
> > +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> > +
> > +	while (flows.started > 0) {
> > +
> > +		/* Gather 4 bytes of input data for each stream. */
> > +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> > +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > +		/* Process the 4 bytes of input on each stream. */
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		input = transition2(mm_index_mask64.m, input,
> > +			mm_shuffle_input64.m, mm_ones_16.m,
> > +			mm_bytes64.m, mm_type_quad_range64.m,
> > +			flows.trans, &indicies);
> > +
> > +		/* Check for any matches. */
> > +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > +			mm_match_mask64.m);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +	uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > +	if (categories != 1 &&
> > +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > +		return -EINVAL;
> > +
> > +	if (likely(num >= MAX_SEARCHES_SSE8))
> > +		return search_sse_8(ctx, data, results, num, categories);
> > +	else if (num >= MAX_SEARCHES_SSE4)
> > +		return search_sse_4(ctx, data, results, num, categories);
> > +	else
> > +		return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.
As I said above, I think it more plausible to keep it globally visible.
> 
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.
See above, would like user to have an ability to call it manually if needed.
> 
> > +{
> > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > +		/* SSE version requires SSE4.1 */
> > +		rte_acl_default_classify = rte_acl_classify_sse;
> > +	} else {
> > +		/* reset to scalar version. */
> > +		rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.
I think we better keep it like that - in case user calls it manually.
We always reset  rte_acl_default_classify to the 'best proper' value.
> > +	}
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> > +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
We preserve existing rte_acl_classify() API, so users don't need to modify their code.
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  2014-08-07 20:11  4% ` Neil Horman
@ 2014-08-07 20:58  0%   ` Vincent JARDIN
  2014-08-08 11:49  0%   ` Ananyev, Konstantin
  1 sibling, 0 replies; 86+ results
From: Vincent JARDIN @ 2014-08-07 20:58 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
What's about using function versioning attributes too:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
?
Le 7 août 2014 22:11, "Neil Horman" <nhorman@tuxdriver.com> a écrit :
>
> On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> > Make ACL library to build/work on 'default' architecture:
> > - make rte_acl_classify_scalar really scalar
> >  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> > - Provide two versions of rte_acl_classify code path:
> >   rte_acl_classify_sse() - could be build and used only on systems with
sse4.2
> >   and upper, return -ENOTSUP on lower arch.
> >   rte_acl_classify_scalar() - a slower version, but could be build and
used
> >   on all systems.
> > - keep common code shared between these two codepaths.
> >
> > v2 chages:
> >  run-time selection of most appropriate code-path for given ISA.
> >  By default the highest supprted one is selected.
> >  User can still override that selection by manually assigning new value
to
> >  the global function pointer rte_acl_default_classify.
> >  rte_acl_classify() becomes a macro calling whatever
rte_acl_default_classify
> >  points to.
> >
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>
> This is alot better thank you.  A few remaining issues.
>
> > ---
> >  app/test-acl/main.c                |  13 +-
> >  lib/librte_acl/Makefile            |   5 +-
> >  lib/librte_acl/acl_bld.c           |   5 +-
> >  lib/librte_acl/acl_match_check.def |  92 ++++
> >  lib/librte_acl/acl_run.c           | 944
-------------------------------------
> >  lib/librte_acl/acl_run.h           | 220 +++++++++
> >  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
> >  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
> >  lib/librte_acl/rte_acl.c           |  15 +
> >  lib/librte_acl/rte_acl.h           |  24 +-
> >  10 files changed, 1189 insertions(+), 956 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_match_check.def
> >  delete mode 100644 lib/librte_acl/acl_run.c
> >  create mode 100644 lib/librte_acl/acl_run.h
> >  create mode 100644 lib/librte_acl/acl_run_scalar.c
> >  create mode 100644 lib/librte_acl/acl_run_sse.c
> >
> > diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> > index d654409..45c6fa6 100644
> > --- a/app/test-acl/main.c
> > +++ b/app/test-acl/main.c
> > @@ -787,6 +787,10 @@ acx_init(void)
> >       /* perform build. */
> >       ret = rte_acl_build(config.acx, &cfg);
> >
> > +     /* setup default rte_acl_classify */
> > +     if (config.scalar)
> > +             rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> Exporting this variable as part of the ABI is a bad idea.  If the
prototype of
> the function changes you have to update all your applications.  Make the
pointer
> an internal symbol and set it using a get/set routine with an enum to
represent
> the path to choose.  That will help isolate the ABI from the internal
> implementation.  It will also let you prevent things like selecting a run
time
> path that is incompatible with the running system, and prevent path
switching
> during searches, which may produce unexpected results.
>
> ><snip>
> > diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> > deleted file mode 100644
> > index e3d9fc1..0000000
> > --- a/lib/librte_acl/acl_run.c
> > +++ /dev/null
> > @@ -1,944 +0,0 @@
> > -/*-
> > - *   BSD LICENSE
> > - *
> > - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> > - *   All rights reserved.
> > - *
> > - *   Redistribution and use in source and binary forms, with or without
> > - *   modification, are permitted provided that the following conditions
> ><snip>
> > +
> > +#define      __func_resolve_priority__       resolve_priority_scalar
> > +#define      __func_match_check__            acl_match_check_scalar
> > +#include "acl_match_check.def"
> > +
> I get this lets you make some more code common, but its just unpleasant
to trace
> through.  Looking at the defintion of __func_match_check__ I don't see
anything
> particularly performance sensitive there.  What if instead you simply
redefined
> __func_match_check__ in a common internal header as acl_match_check (a
generic
> function), and had it accept priority resolution function as an argument?
 That
> would still give you all the performance enhancements without having to
include
> c files in the middle of other c files, and would make the code a bit more
> parseable.
>
> > +/*
> > + * When processing the transition, rather than using if/else
> > + * construct, the offset is calculated for DFA and QRANGE and
> > + * then conditionally added to the address based on node type.
> > + * This is done to avoid branch mis-predictions. Since the
> > + * offset is rather simple calculation it is more efficient
> > + * to do the calculation and do a condition move rather than
> > + * a conditional branch to determine which calculation to do.
> > + */
> > +static inline uint32_t
> > +scan_forward(uint32_t input, uint32_t max)
> > +{
> > +     return (input == 0) ? max : rte_bsf32(input);
> > +}
> > +     }
> > +}
> ><snip>
> > +
> > +#define      __func_resolve_priority__       resolve_priority_sse
> > +#define      __func_match_check__            acl_match_check_sse
> > +#include "acl_match_check.def"
> > +
> Same deal as above.
>
> > +/*
> > + * Extract transitions from an XMM register and check for any matches
> > + */
> > +static void
> > +acl_process_matches(xmm_t *indicies, int slot, const struct
rte_acl_ctx *ctx,
> > +     struct parms *parms, struct acl_flow_data *flows)
> > +{
> > +     uint64_t transition1, transition2;
> > +
> > +     /* extract transition from low 64 bits. */
> > +     transition1 = MM_CVT64(*indicies);
> > +
> > +     /* extract transition from high 64 bits. */
> > +     *indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> > +     transition2 = MM_CVT64(*indicies);
> > +
> > +     transition1 = acl_match_check_sse(transition1, slot, ctx,
> > +             parms, flows);
> > +     transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> > +             parms, flows);
> > +
> > +     /* update indicies with new transitions. */
> > +     *indicies = MM_SET64(transition2, transition1);
> > +}
> > +
> > +/*
> > + * Check for a match in 2 transitions (contained in SSE register)
> > + */
> > +static inline void
> > +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > +     struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> > +{
> > +     xmm_t temp;
> > +
> > +     temp = MM_AND(match_mask, *indicies);
> > +     while (!MM_TESTZ(temp, temp)) {
> > +             acl_process_matches(indicies, slot, ctx, parms, flows);
> > +             temp = MM_AND(match_mask, *indicies);
> > +     }
> > +}
> > +
> > +/*
> > + * Check for any match in 4 transitions (contained in 2 SSE registers)
> > + */
> > +static inline void
> > +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct
parms *parms,
> > +     struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> > +     xmm_t match_mask)
> > +{
> > +     xmm_t temp;
> > +
> > +     /* put low 32 bits of each transition into one register */
> > +     temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +             0x88);
> > +     /* test for match node */
> > +     temp = MM_AND(match_mask, temp);
> > +
> > +     while (!MM_TESTZ(temp, temp)) {
> > +             acl_process_matches(indicies1, slot, ctx, parms, flows);
> > +             acl_process_matches(indicies2, slot + 2, ctx, parms,
flows);
> > +
> > +             temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +                                     (__m128)*indicies2,
> > +                                     0x88);
> > +             temp = MM_AND(match_mask, temp);
> > +     }
> > +}
> > +
> > +/*
> > + * Calculate the address of the next transition for
> > + * all types of nodes. Note that only DFA nodes and range
> > + * nodes actually transition to another node. Match
> > + * nodes don't move.
> > + */
> > +static inline xmm_t
> > +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +     xmm_t addr, node_types, temp;
> > +
> > +     /*
> > +      * Note that no transition is done for a match
> > +      * node and therefore a stream freezes when
> > +      * it reaches a match.
> > +      */
> > +
> > +     /* Shuffle low 32 into temp and high 32 into indicies2 */
> > +     temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> > +             0x88);
> > +     *indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> > +             (__m128)*indicies2, 0xdd);
> > +
> > +     /* Calc node type and node addr */
> > +     node_types = MM_ANDNOT(index_mask, temp);
> > +     addr = MM_AND(index_mask, temp);
> > +
> > +     /*
> > +      * Calc addr for DFAs - addr = dfa_index + input_byte
> > +      */
> > +
> > +     /* mask for DFA type (0) nodes */
> > +     temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> > +
> > +     /* add input byte to DFA position */
> > +     temp = MM_AND(temp, bytes);
> > +     temp = MM_AND(temp, next_input);
> > +     addr = MM_ADD32(addr, temp);
> > +
> > +     /*
> > +      * Calc addr for Range nodes -> range_index + range(input)
> > +      */
> > +     node_types = MM_CMPEQ32(node_types, type_quad_range);
> > +
> > +     /*
> > +      * Calculate number of range boundaries that are less than the
> > +      * input value. Range boundaries for each node are in signed 8
bit,
> > +      * ordered from -128 to 127 in the indicies2 register.
> > +      * This is effectively a popcnt of bytes that are greater than the
> > +      * input byte.
> > +      */
> > +
> > +     /* shuffle input byte to all 4 positions of 32 bit value */
> > +     temp = MM_SHUFFLE8(next_input, shuffle_input);
> > +
> > +     /* check ranges */
> > +     temp = MM_CMPGT8(temp, *indicies2);
> > +
> > +     /* convert -1 to 1 (bytes greater than input byte */
> > +     temp = MM_SIGN8(temp, temp);
> > +
> > +     /* horizontal add pairs of bytes into words */
> > +     temp = MM_MADD8(temp, temp);
> > +
> > +     /* horizontal add pairs of words into dwords */
> > +     temp = MM_MADD16(temp, ones_16);
> > +
> > +     /* mask to range type nodes */
> > +     temp = MM_AND(temp, node_types);
> > +
> > +     /* add index into node position */
> > +     return MM_ADD32(addr, temp);
> > +}
> > +
> > +/*
> > + * Process 4 transitions (in 2 SIMD registers) in parallel
> > + */
> > +static inline xmm_t
> > +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> > +{
> > +     xmm_t addr;
> > +     uint64_t trans0, trans2;
> > +
> > +      /* Calculate the address (array index) for all 4 transitions. */
> > +
> > +     addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > +             bytes, type_quad_range, indicies1, indicies2);
> > +
> > +      /* Gather 64 bit transitions and pack back into 2 registers. */
> > +
> > +     trans0 = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 2 */
> > +
> > +     /* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> > +     trans2 = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 1 */
> > +
> > +     /* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +     *indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> > +
> > +     /* get slot 3 */
> > +
> > +     /* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> > +     *indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> > +
> > +     return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 8 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +     uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE8];
> > +     struct completion cmplt[MAX_SEARCHES_SSE8];
> > +     struct parms parms[MAX_SEARCHES_SSE8];
> > +     xmm_t input0, input1;
> > +     xmm_t indicies1, indicies2, indicies3, indicies4;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     /*
> > +      * indicies1 contains index_array[0,1]
> > +      * indicies2 contains index_array[2,3]
> > +      * indicies3 contains index_array[4,5]
> > +      * indicies4 contains index_array[6,7]
> > +      */
> > +
> > +     indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +     indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +     indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> > +     indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> > +
> > +      /* Check for any matches. */
> > +     acl_match_check_x4(0, ctx, parms, &flows,
> > +             &indicies1, &indicies2, mm_match_mask.m);
> > +     acl_match_check_x4(4, ctx, parms, &flows,
> > +             &indicies3, &indicies4, mm_match_mask.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0),
> > +                     0);
> > +             input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
4),
> > +                     0);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1),
1);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5),
1);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2),
2);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6),
2);
> > +
> > +             input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3),
3);
> > +             input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7),
3);
> > +
> > +              /* Process the 4 bytes of input on each stream. */
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +             input0 = transition4(mm_index_mask.m, input0,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             input1 = transition4(mm_index_mask.m, input1,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies3, &indicies4);
> > +
> > +              /* Check for any matches. */
> > +             acl_match_check_x4(0, ctx, parms, &flows,
> > +                     &indicies1, &indicies2, mm_match_mask.m);
> > +             acl_match_check_x4(4, ctx, parms, &flows,
> > +                     &indicies3, &indicies4, mm_match_mask.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 4 traversals in parallel
> > + */
> > +static inline int
> > +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +      uint32_t *results, int total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE4];
> > +     struct completion cmplt[MAX_SEARCHES_SSE4];
> > +     struct parms parms[MAX_SEARCHES_SSE4];
> > +     xmm_t input, indicies1, indicies2;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> > +     indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> > +
> > +     /* Check for any matches. */
> > +     acl_match_check_x4(0, ctx, parms, &flows,
> > +             &indicies1, &indicies2, mm_match_mask.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> > +
> > +             /* Process the 4 bytes of input on each stream. */
> > +             input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +              input = transition4(mm_index_mask.m, input,
> > +                     mm_shuffle_input.m, mm_ones_16.m,
> > +                     mm_bytes.m, mm_type_quad_range.m,
> > +                     flows.trans, &indicies1, &indicies2);
> > +
> > +             /* Check for any matches. */
> > +             acl_match_check_x4(0, ctx, parms, &flows,
> > +                     &indicies1, &indicies2, mm_match_mask.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static inline xmm_t
> > +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> > +     xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> > +     const uint64_t *trans, xmm_t *indicies1)
> > +{
> > +     uint64_t t;
> > +     xmm_t addr, indicies2;
> > +
> > +     indicies2 = MM_XOR(ones_16, ones_16);
> > +
> > +     addr = acl_calc_addr(index_mask, next_input, shuffle_input,
ones_16,
> > +             bytes, type_quad_range, indicies1, &indicies2);
> > +
> > +     /* Gather 64 bit transitions and pack 2 per register. */
> > +
> > +     t = trans[MM_CVT32(addr)];
> > +
> > +     /* get slot 1 */
> > +     addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> > +     *indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> > +
> > +     return MM_SRL32(next_input, 8);
> > +}
> > +
> > +/*
> > + * Execute trie traversal with 2 traversals in parallel.
> > + */
> > +static inline int
> > +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> > +     uint32_t *results, uint32_t total_packets, uint32_t categories)
> > +{
> > +     int n;
> > +     struct acl_flow_data flows;
> > +     uint64_t index_array[MAX_SEARCHES_SSE2];
> > +     struct completion cmplt[MAX_SEARCHES_SSE2];
> > +     struct parms parms[MAX_SEARCHES_SSE2];
> > +     xmm_t input, indicies;
> > +
> > +     acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> > +             total_packets, categories, ctx->trans_table);
> > +
> > +     for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> > +             cmplt[n].count = 0;
> > +             index_array[n] = acl_start_next_trie(&flows, parms, n,
ctx);
> > +     }
> > +
> > +     indicies = MM_LOADU((xmm_t *) &index_array[0]);
> > +
> > +     /* Check for any matches. */
> > +     acl_match_check_x2(0, ctx, parms, &flows, &indicies,
mm_match_mask64.m);
> > +
> > +     while (flows.started > 0) {
> > +
> > +             /* Gather 4 bytes of input data for each stream. */
> > +             input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms,
0), 0);
> > +             input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> > +
> > +             /* Process the 4 bytes of input on each stream. */
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             input = transition2(mm_index_mask64.m, input,
> > +                     mm_shuffle_input64.m, mm_ones_16.m,
> > +                     mm_bytes64.m, mm_type_quad_range64.m,
> > +                     flows.trans, &indicies);
> > +
> > +             /* Check for any matches. */
> > +             acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> > +                     mm_match_mask64.m);
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +int
> > +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t
**data,
> > +     uint32_t *results, uint32_t num, uint32_t categories)
> > +{
> > +     if (categories != 1 &&
> > +             ((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> > +             return -EINVAL;
> > +
> > +     if (likely(num >= MAX_SEARCHES_SSE8))
> > +             return search_sse_8(ctx, data, results, num, categories);
> > +     else if (num >= MAX_SEARCHES_SSE4)
> > +             return search_sse_4(ctx, data, results, num, categories);
> > +     else
> > +             return search_sse_2(ctx, data, results, num, categories);
> > +}
> > diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> > index 7c288bd..0cde07e 100644
> > --- a/lib/librte_acl/rte_acl.c
> > +++ b/lib/librte_acl/rte_acl.c
> > @@ -38,6 +38,21 @@
> >
> >  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
> >
> > +/* by default, use always avaialbe scalar code path. */
> > +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> > +
> make this static, the outside world shouldn't need to see it.
>
> > +void __attribute__((constructor(INT16_MAX)))
> > +rte_acl_select_classify(void)
> Make it static, The outside world doesn't need to call this.
>
> > +{
> > +     if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> > +             /* SSE version requires SSE4.1 */
> > +             rte_acl_default_classify = rte_acl_classify_sse;
> > +     } else {
> > +             /* reset to scalar version. */
> > +             rte_acl_default_classify = rte_acl_classify_scalar;
> Don't need the else clause here, the static initalizer has you covered.
> > +     }
> > +}
> > +
> > +
> > +/**
> > + * Invokes default rte_acl_classify function.
> > + */
> > +extern rte_acl_classify_t rte_acl_default_classify;
> > +
> Doesn't need to be extern.
> > +#define      rte_acl_classify(ctx, data, results, num, categories)   \
> > +     (*rte_acl_default_classify)(ctx, data, results, num, categories)
> > +
> Not sure why you need this either.  The rte_acl_classify_t should be
enough, no?
>
> Regards
> Neil
>
>
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target
  @ 2014-08-07 20:11  4% ` Neil Horman
  2014-08-07 20:58  0%   ` Vincent JARDIN
  2014-08-08 11:49  0%   ` Ananyev, Konstantin
  2014-08-21 20:15  1% ` [dpdk-dev] [PATCHv3] " Neil Horman
  2014-08-28 20:38  1% ` [dpdk-dev] [PATCHv4] " Neil Horman
  2 siblings, 2 replies; 86+ results
From: Neil Horman @ 2014-08-07 20:11 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev
On Thu, Aug 07, 2014 at 07:31:03PM +0100, Konstantin Ananyev wrote:
> Make ACL library to build/work on 'default' architecture:
> - make rte_acl_classify_scalar really scalar
>  (make sure it wouldn't use sse4 instrincts through resolve_priority()).
> - Provide two versions of rte_acl_classify code path:
>   rte_acl_classify_sse() - could be build and used only on systems with sse4.2
>   and upper, return -ENOTSUP on lower arch.
>   rte_acl_classify_scalar() - a slower version, but could be build and used
>   on all systems.
> - keep common code shared between these two codepaths.
> 
> v2 chages:
>  run-time selection of most appropriate code-path for given ISA.
>  By default the highest supprted one is selected.
>  User can still override that selection by manually assigning new value to 
>  the global function pointer rte_acl_default_classify.
>  rte_acl_classify() becomes a macro calling whatever rte_acl_default_classify
>  points to.
> 
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
This is alot better thank you.  A few remaining issues.
> ---
>  app/test-acl/main.c                |  13 +-
>  lib/librte_acl/Makefile            |   5 +-
>  lib/librte_acl/acl_bld.c           |   5 +-
>  lib/librte_acl/acl_match_check.def |  92 ++++
>  lib/librte_acl/acl_run.c           | 944 -------------------------------------
>  lib/librte_acl/acl_run.h           | 220 +++++++++
>  lib/librte_acl/acl_run_scalar.c    | 197 ++++++++
>  lib/librte_acl/acl_run_sse.c       | 630 +++++++++++++++++++++++++
>  lib/librte_acl/rte_acl.c           |  15 +
>  lib/librte_acl/rte_acl.h           |  24 +-
>  10 files changed, 1189 insertions(+), 956 deletions(-)
>  create mode 100644 lib/librte_acl/acl_match_check.def
>  delete mode 100644 lib/librte_acl/acl_run.c
>  create mode 100644 lib/librte_acl/acl_run.h
>  create mode 100644 lib/librte_acl/acl_run_scalar.c
>  create mode 100644 lib/librte_acl/acl_run_sse.c
> 
> diff --git a/app/test-acl/main.c b/app/test-acl/main.c
> index d654409..45c6fa6 100644
> --- a/app/test-acl/main.c
> +++ b/app/test-acl/main.c
> @@ -787,6 +787,10 @@ acx_init(void)
>  	/* perform build. */
>  	ret = rte_acl_build(config.acx, &cfg);
>  
> +	/* setup default rte_acl_classify */
> +	if (config.scalar)
> +		rte_acl_default_classify = rte_acl_classify_scalar;
> +
Exporting this variable as part of the ABI is a bad idea.  If the prototype of
the function changes you have to update all your applications.  Make the pointer
an internal symbol and set it using a get/set routine with an enum to represent
the path to choose.  That will help isolate the ABI from the internal
implementation.  It will also let you prevent things like selecting a run time
path that is incompatible with the running system, and prevent path switching
during searches, which may produce unexpected results.
><snip>
> diff --git a/lib/librte_acl/acl_run.c b/lib/librte_acl/acl_run.c
> deleted file mode 100644
> index e3d9fc1..0000000
> --- a/lib/librte_acl/acl_run.c
> +++ /dev/null
> @@ -1,944 +0,0 @@
> -/*-
> - *   BSD LICENSE
> - *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> - *   All rights reserved.
> - *
> - *   Redistribution and use in source and binary forms, with or without
> - *   modification, are permitted provided that the following conditions
><snip>
> +
> +#define	__func_resolve_priority__	resolve_priority_scalar
> +#define	__func_match_check__		acl_match_check_scalar
> +#include "acl_match_check.def"
> +
I get this lets you make some more code common, but its just unpleasant to trace
through.  Looking at the defintion of __func_match_check__ I don't see anything
particularly performance sensitive there.  What if instead you simply redefined
__func_match_check__ in a common internal header as acl_match_check (a generic
function), and had it accept priority resolution function as an argument?  That
would still give you all the performance enhancements without having to include
c files in the middle of other c files, and would make the code a bit more
parseable.
> +/*
> + * When processing the transition, rather than using if/else
> + * construct, the offset is calculated for DFA and QRANGE and
> + * then conditionally added to the address based on node type.
> + * This is done to avoid branch mis-predictions. Since the
> + * offset is rather simple calculation it is more efficient
> + * to do the calculation and do a condition move rather than
> + * a conditional branch to determine which calculation to do.
> + */
> +static inline uint32_t
> +scan_forward(uint32_t input, uint32_t max)
> +{
> +	return (input == 0) ? max : rte_bsf32(input);
> +}
> +	}
> +}
><snip>
> +
> +#define	__func_resolve_priority__	resolve_priority_sse
> +#define	__func_match_check__		acl_match_check_sse
> +#include "acl_match_check.def"
> +
Same deal as above.
> +/*
> + * Extract transitions from an XMM register and check for any matches
> + */
> +static void
> +acl_process_matches(xmm_t *indicies, int slot, const struct rte_acl_ctx *ctx,
> +	struct parms *parms, struct acl_flow_data *flows)
> +{
> +	uint64_t transition1, transition2;
> +
> +	/* extract transition from low 64 bits. */
> +	transition1 = MM_CVT64(*indicies);
> +
> +	/* extract transition from high 64 bits. */
> +	*indicies = MM_SHUFFLE32(*indicies, SHUFFLE32_SWAP64);
> +	transition2 = MM_CVT64(*indicies);
> +
> +	transition1 = acl_match_check_sse(transition1, slot, ctx,
> +		parms, flows);
> +	transition2 = acl_match_check_sse(transition2, slot + 1, ctx,
> +		parms, flows);
> +
> +	/* update indicies with new transitions. */
> +	*indicies = MM_SET64(transition2, transition1);
> +}
> +
> +/*
> + * Check for a match in 2 transitions (contained in SSE register)
> + */
> +static inline void
> +acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies, xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	temp = MM_AND(match_mask, *indicies);
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies, slot, ctx, parms, flows);
> +		temp = MM_AND(match_mask, *indicies);
> +	}
> +}
> +
> +/*
> + * Check for any match in 4 transitions (contained in 2 SSE registers)
> + */
> +static inline void
> +acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
> +	struct acl_flow_data *flows, xmm_t *indicies1, xmm_t *indicies2,
> +	xmm_t match_mask)
> +{
> +	xmm_t temp;
> +
> +	/* put low 32 bits of each transition into one register */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	/* test for match node */
> +	temp = MM_AND(match_mask, temp);
> +
> +	while (!MM_TESTZ(temp, temp)) {
> +		acl_process_matches(indicies1, slot, ctx, parms, flows);
> +		acl_process_matches(indicies2, slot + 2, ctx, parms, flows);
> +
> +		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +					(__m128)*indicies2,
> +					0x88);
> +		temp = MM_AND(match_mask, temp);
> +	}
> +}
> +
> +/*
> + * Calculate the address of the next transition for
> + * all types of nodes. Note that only DFA nodes and range
> + * nodes actually transition to another node. Match
> + * nodes don't move.
> + */
> +static inline xmm_t
> +acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr, node_types, temp;
> +
> +	/*
> +	 * Note that no transition is done for a match
> +	 * node and therefore a stream freezes when
> +	 * it reaches a match.
> +	 */
> +
> +	/* Shuffle low 32 into temp and high 32 into indicies2 */
> +	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1, (__m128)*indicies2,
> +		0x88);
> +	*indicies2 = (xmm_t)MM_SHUFFLEPS((__m128)*indicies1,
> +		(__m128)*indicies2, 0xdd);
> +
> +	/* Calc node type and node addr */
> +	node_types = MM_ANDNOT(index_mask, temp);
> +	addr = MM_AND(index_mask, temp);
> +
> +	/*
> +	 * Calc addr for DFAs - addr = dfa_index + input_byte
> +	 */
> +
> +	/* mask for DFA type (0) nodes */
> +	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
> +
> +	/* add input byte to DFA position */
> +	temp = MM_AND(temp, bytes);
> +	temp = MM_AND(temp, next_input);
> +	addr = MM_ADD32(addr, temp);
> +
> +	/*
> +	 * Calc addr for Range nodes -> range_index + range(input)
> +	 */
> +	node_types = MM_CMPEQ32(node_types, type_quad_range);
> +
> +	/*
> +	 * Calculate number of range boundaries that are less than the
> +	 * input value. Range boundaries for each node are in signed 8 bit,
> +	 * ordered from -128 to 127 in the indicies2 register.
> +	 * This is effectively a popcnt of bytes that are greater than the
> +	 * input byte.
> +	 */
> +
> +	/* shuffle input byte to all 4 positions of 32 bit value */
> +	temp = MM_SHUFFLE8(next_input, shuffle_input);
> +
> +	/* check ranges */
> +	temp = MM_CMPGT8(temp, *indicies2);
> +
> +	/* convert -1 to 1 (bytes greater than input byte */
> +	temp = MM_SIGN8(temp, temp);
> +
> +	/* horizontal add pairs of bytes into words */
> +	temp = MM_MADD8(temp, temp);
> +
> +	/* horizontal add pairs of words into dwords */
> +	temp = MM_MADD16(temp, ones_16);
> +
> +	/* mask to range type nodes */
> +	temp = MM_AND(temp, node_types);
> +
> +	/* add index into node position */
> +	return MM_ADD32(addr, temp);
> +}
> +
> +/*
> + * Process 4 transitions (in 2 SIMD registers) in parallel
> + */
> +static inline xmm_t
> +transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1, xmm_t *indicies2)
> +{
> +	xmm_t addr;
> +	uint64_t trans0, trans2;
> +
> +	 /* Calculate the address (array index) for all 4 transitions. */
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, indicies2);
> +
> +	 /* Gather 64 bit transitions and pack back into 2 registers. */
> +
> +	trans0 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 2 */
> +
> +	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
> +	trans2 = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +
> +	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
> +
> +	/* get slot 3 */
> +
> +	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
> +	*indicies2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 8 traversals in parallel
> + */
> +static inline int
> +search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE8];
> +	struct completion cmplt[MAX_SEARCHES_SSE8];
> +	struct parms parms[MAX_SEARCHES_SSE8];
> +	xmm_t input0, input1;
> +	xmm_t indicies1, indicies2, indicies3, indicies4;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	/*
> +	 * indicies1 contains index_array[0,1]
> +	 * indicies2 contains index_array[2,3]
> +	 * indicies3 contains index_array[4,5]
> +	 * indicies4 contains index_array[6,7]
> +	 */
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	indicies3 = MM_LOADU((xmm_t *) &index_array[4]);
> +	indicies4 = MM_LOADU((xmm_t *) &index_array[6]);
> +
> +	 /* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +	acl_match_check_x4(4, ctx, parms, &flows,
> +		&indicies3, &indicies4, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
> +			0);
> +		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
> +			0);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
> +
> +		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
> +		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
> +
> +		 /* Process the 4 bytes of input on each stream. */
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		input0 = transition4(mm_index_mask.m, input0,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		input1 = transition4(mm_index_mask.m, input1,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies3, &indicies4);
> +
> +		 /* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +		acl_match_check_x4(4, ctx, parms, &flows,
> +			&indicies3, &indicies4, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Execute trie traversal with 4 traversals in parallel
> + */
> +static inline int
> +search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	 uint32_t *results, int total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE4];
> +	struct completion cmplt[MAX_SEARCHES_SSE4];
> +	struct parms parms[MAX_SEARCHES_SSE4];
> +	xmm_t input, indicies1, indicies2;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies1 = MM_LOADU((xmm_t *) &index_array[0]);
> +	indicies2 = MM_LOADU((xmm_t *) &index_array[2]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x4(0, ctx, parms, &flows,
> +		&indicies1, &indicies2, mm_match_mask.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +		input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		 input = transition4(mm_index_mask.m, input,
> +			mm_shuffle_input.m, mm_ones_16.m,
> +			mm_bytes.m, mm_type_quad_range.m,
> +			flows.trans, &indicies1, &indicies2);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x4(0, ctx, parms, &flows,
> +			&indicies1, &indicies2, mm_match_mask.m);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline xmm_t
> +transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
> +	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
> +	const uint64_t *trans, xmm_t *indicies1)
> +{
> +	uint64_t t;
> +	xmm_t addr, indicies2;
> +
> +	indicies2 = MM_XOR(ones_16, ones_16);
> +
> +	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
> +		bytes, type_quad_range, indicies1, &indicies2);
> +
> +	/* Gather 64 bit transitions and pack 2 per register. */
> +
> +	t = trans[MM_CVT32(addr)];
> +
> +	/* get slot 1 */
> +	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
> +	*indicies1 = MM_SET64(trans[MM_CVT32(addr)], t);
> +
> +	return MM_SRL32(next_input, 8);
> +}
> +
> +/*
> + * Execute trie traversal with 2 traversals in parallel.
> + */
> +static inline int
> +search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t total_packets, uint32_t categories)
> +{
> +	int n;
> +	struct acl_flow_data flows;
> +	uint64_t index_array[MAX_SEARCHES_SSE2];
> +	struct completion cmplt[MAX_SEARCHES_SSE2];
> +	struct parms parms[MAX_SEARCHES_SSE2];
> +	xmm_t input, indicies;
> +
> +	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
> +		total_packets, categories, ctx->trans_table);
> +
> +	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
> +		cmplt[n].count = 0;
> +		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
> +	}
> +
> +	indicies = MM_LOADU((xmm_t *) &index_array[0]);
> +
> +	/* Check for any matches. */
> +	acl_match_check_x2(0, ctx, parms, &flows, &indicies, mm_match_mask64.m);
> +
> +	while (flows.started > 0) {
> +
> +		/* Gather 4 bytes of input data for each stream. */
> +		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
> +		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
> +
> +		/* Process the 4 bytes of input on each stream. */
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		input = transition2(mm_index_mask64.m, input,
> +			mm_shuffle_input64.m, mm_ones_16.m,
> +			mm_bytes64.m, mm_type_quad_range64.m,
> +			flows.trans, &indicies);
> +
> +		/* Check for any matches. */
> +		acl_match_check_x2(0, ctx, parms, &flows, &indicies,
> +			mm_match_mask64.m);
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
> +	uint32_t *results, uint32_t num, uint32_t categories)
> +{
> +	if (categories != 1 &&
> +		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
> +		return -EINVAL;
> +
> +	if (likely(num >= MAX_SEARCHES_SSE8))
> +		return search_sse_8(ctx, data, results, num, categories);
> +	else if (num >= MAX_SEARCHES_SSE4)
> +		return search_sse_4(ctx, data, results, num, categories);
> +	else
> +		return search_sse_2(ctx, data, results, num, categories);
> +}
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 7c288bd..0cde07e 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -38,6 +38,21 @@
>  
>  TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
>  
> +/* by default, use always avaialbe scalar code path. */
> +rte_acl_classify_t rte_acl_default_classify = rte_acl_classify_scalar;
> +
make this static, the outside world shouldn't need to see it.
> +void __attribute__((constructor(INT16_MAX)))
> +rte_acl_select_classify(void)
Make it static, The outside world doesn't need to call this.
> +{
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) {
> +		/* SSE version requires SSE4.1 */
> +		rte_acl_default_classify = rte_acl_classify_sse;
> +	} else {
> +		/* reset to scalar version. */
> +		rte_acl_default_classify = rte_acl_classify_scalar;
Don't need the else clause here, the static initalizer has you covered.
> +	}
> +}
> +
> +
> +/**
> + * Invokes default rte_acl_classify function.
> + */
> +extern rte_acl_classify_t rte_acl_default_classify;
> +
Doesn't need to be extern.
> +#define	rte_acl_classify(ctx, data, results, num, categories)	\
> +	(*rte_acl_default_classify)(ctx, data, results, num, categories)
> +
Not sure why you need this either.  The rte_acl_classify_t should be enough, no?
Regards
Neil
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
  2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
  2014-07-24 14:54  0% ` Thomas Monjalon
@ 2014-07-24 15:20  0% ` Chris Wright
  1 sibling, 0 replies; 86+ results
From: Chris Wright @ 2014-07-24 15:20 UTC (permalink / raw)
  To: Pablo de Lara; +Cc: dev, Patrice Buriez
* Pablo de Lara (pablo.de.lara.guarch@intel.com) wrote:
> Signed-off-by: Patrice Buriez <patrice.buriez@intel.com>
Just a mechanical nitpick on DCO.  Pablo, this patch appears to be
written by Patrice.  If so, it should begin with "From: Patrice Buriez
<patrice.buriez@intel.com>" and should include your own Signed-off-by.
thanks,
-chris
> ---
>  lib/librte_eal/linuxapp/kni/Makefile              |    9 +++++++++
>  lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h |   16 ++++++++++++++++
>  2 files changed, 25 insertions(+), 0 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/kni/Makefile b/lib/librte_eal/linuxapp/kni/Makefile
> index fb9462f..725d3e7 100644
> --- a/lib/librte_eal/linuxapp/kni/Makefile
> +++ b/lib/librte_eal/linuxapp/kni/Makefile
> @@ -44,6 +44,15 @@ MODULE_CFLAGS += -I$(RTE_OUTPUT)/include -I$(SRCDIR)/ethtool/ixgbe -I$(SRCDIR)/e
>  MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h
>  MODULE_CFLAGS += -Wall -Werror
>  
> +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
> +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
> +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
> +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
> +MODULE_CFLAGS += -D"UBUNTU_KERNEL_CODE=UBUNTU_KERNEL_VERSION($(UBUNTU_KERNEL_CODE))"
> +endif
> +
> +
>  # this lib needs main eal
>  DEPDIRS-y += lib/librte_eal/linuxapp/eal
>  
> diff --git a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> index 521a35d..5a06383 100644
> --- a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> +++ b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
> @@ -713,6 +713,20 @@ struct _kc_ethtool_pauseparam {
>  #define SLE_VERSION_CODE 0
>  #endif /* SLE_VERSION_CODE */
>  
> +/* Ubuntu release and kernel codes must be specified from Makefile */
> +#ifndef UBUNTU_RELEASE_VERSION
> +#define UBUNTU_RELEASE_VERSION(a,b) (((a) * 100) + (b))
> +#endif
> +#ifndef UBUNTU_KERNEL_VERSION
> +#define UBUNTU_KERNEL_VERSION(a,b,c,abi,upload) (((a) << 40) + ((b) << 32) + ((c) << 24) + ((abi) << 8) + (upload))
> +#endif
> +#ifndef UBUNTU_RELEASE_CODE
> +#define UBUNTU_RELEASE_CODE 0
> +#endif
> +#ifndef UBUNTU_KERNEL_CODE
> +#define UBUNTU_KERNEL_CODE 0
> +#endif
> +
>  #ifdef __KLOCWORK__
>  #ifdef ARRAY_SIZE
>  #undef ARRAY_SIZE
> @@ -3847,6 +3861,7 @@ static inline struct sk_buff *__kc__vlan_hwaccel_put_tag(struct sk_buff *skb,
>  
>  #if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) )
>  #if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7,0)))
> +#if (!(UBUNTU_RELEASE_CODE == UBUNTU_RELEASE_VERSION(14,4) && UBUNTU_KERNEL_CODE >= UBUNTU_KERNEL_VERSION(3,13,0,30,54)))
>  #ifdef NETIF_F_RXHASH
>  #define PKT_HASH_TYPE_L3 0
>  static inline void
> @@ -3855,6 +3870,7 @@ skb_set_hash(struct sk_buff *skb, __u32 hash, __always_unused int type)
>  	skb->rxhash = hash;
>  }
>  #endif /* NETIF_F_RXHASH */
> +#endif /* < 3.13.0-30.54 (Ubuntu 14.04) */
>  #endif /* < RHEL7 */
>  #endif /* < 3.14.0 */
>  
> -- 
> 1.7.0.7
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
  2014-07-24 14:54  0% ` Thomas Monjalon
@ 2014-07-24 14:59  0%   ` Thomas Monjalon
  0 siblings, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-07-24 14:59 UTC (permalink / raw)
  To: Patrice Buriez; +Cc: dev
2014-07-24 16:54, Thomas Monjalon:
> > Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
> > the box, so it needs to be crafted from the Makefile
> > Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
> 
> It's quite amazing to see that Linux distributions do backports and do not
> provide a way to check them.
> Anyway, thanks for the fix.
> 
> > +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
> 
> Why not this simpler form?
> $(shell lsb_release -si 2>/dev/null)
> 
> > +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
> 
> Or you can use | tr -d . instead of subst and keep the flow from left to right.
> 
> > +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
>                                                                         ^
>                                                          space missing here
> 
> > +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> > +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
> 
> Would be simpler with | tr -d .-
Sorry, I mean tr -d .- $(comma)
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
  2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
@ 2014-07-24 14:54  0% ` Thomas Monjalon
  2014-07-24 14:59  0%   ` Thomas Monjalon
  2014-07-24 15:20  0% ` Chris Wright
  1 sibling, 1 reply; 86+ results
From: Thomas Monjalon @ 2014-07-24 14:54 UTC (permalink / raw)
  To: Patrice Buriez; +Cc: dev
> Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
> the box, so it needs to be crafted from the Makefile
> Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
It's quite amazing to see that Linux distributions do backports and do not
provide a way to check them.
Anyway, thanks for the fix.
> +ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
Why not this simpler form?
$(shell lsb_release -si 2>/dev/null)
> +MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
Or you can use | tr -d . instead of subst and keep the flow from left to right.
> +UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
                                                                        ^
                                                         space missing here
> +UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
> +UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
Would be simpler with | tr -d .-
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54)
@ 2014-07-24 14:28 11% Pablo de Lara
  2014-07-24 14:54  0% ` Thomas Monjalon
  2014-07-24 15:20  0% ` Chris Wright
  0 siblings, 2 replies; 86+ results
From: Pablo de Lara @ 2014-07-24 14:28 UTC (permalink / raw)
  To: dev; +Cc: Patrice Buriez
Recent Ubuntu kernel 3.13.0-30.54, although based on Linux kernel 3.13.11,
already provides skb_set_hash() inline function, slightly different than
the one provided by lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
Ubuntu kernel 3.13.0-30.54 provides:
    * i40e/i40evf: i40e implementation for skb_set_hash
    - https://bugs.launchpad.net/bugs/1328037
    - http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_3.13.0-30.54/changelog
As a result, the implementation provided by kcompat.h must be skipped.
It is not appropriate to test whether LINUX_VERSION_CODE >= KERNEL_VERSION(3,13,11)
because previous Ubuntu kernel 3.13.0-29.53, already based on 3.13.11, needs to
get the implementation provided by kcompat.h
So the full Ubuntu kernel version numbering scheme must be tested:
<base kernel version>-<ABI number>.<upload number>-<flavour>
See "What does a specific Ubuntu kernel version number mean?"
and "How can we determine the version of the running kernel?"
at: https://wiki.ubuntu.com/Kernel/FAQ
Unlike RHEL_RELEASE_CODE, there is no such UBUNTU_RELEASE_CODE available out of
the box, so it needs to be crafted from the Makefile
Similarly, UBUNTU_KERNEL_CODE is generated with ABI and upload numbers.
`lsb_release -si` is first used to check whether we are running Ubuntu
`lsb_release -sr` provides release number 14.04, then converted to integer 1404
/proc/version_signature is parsed to get base kernel version, ABI and upload
numbers, and flavour is dropped
UBUNTU_KERNEL_CODE is indirectly defined using the UBUNTU_KERNEL_VERSION macro,
which in turn is defined in kcompat.h
This makes a single place to define the Ubuntu kernel version numbering scheme,
which is slightly different than the usual "shift by 8" scheme: ABI numbers can
be big (see: https://wiki.ubuntu.com/Kernel/Dev/TopicBranches), so 16-bits have
been reserved for them.
Finally, the implementaion of skb_set_hash is skipped in kcompat.h if we are
running Ubuntu 14.04 with an Ubuntu kernel >= 3.13.0-30.54
Signed-off-by: Patrice Buriez <patrice.buriez@intel.com>
---
 lib/librte_eal/linuxapp/kni/Makefile              |    9 +++++++++
 lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h |   16 ++++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/lib/librte_eal/linuxapp/kni/Makefile b/lib/librte_eal/linuxapp/kni/Makefile
index fb9462f..725d3e7 100644
--- a/lib/librte_eal/linuxapp/kni/Makefile
+++ b/lib/librte_eal/linuxapp/kni/Makefile
@@ -44,6 +44,15 @@ MODULE_CFLAGS += -I$(RTE_OUTPUT)/include -I$(SRCDIR)/ethtool/ixgbe -I$(SRCDIR)/e
 MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h
 MODULE_CFLAGS += -Wall -Werror
 
+ifeq ($(shell type lsb_release >/dev/null 2>&1 && lsb_release -si),Ubuntu)
+MODULE_CFLAGS += -DUBUNTU_RELEASE_CODE=$(subst .,,$(shell lsb_release -sr))
+UBUNTU_KERNEL_CODE := $(shell cut -d' ' -f2 /proc/version_signature |cut -d- -f1,2)
+UBUNTU_KERNEL_CODE := $(subst -,$(comma),$(UBUNTU_KERNEL_CODE))
+UBUNTU_KERNEL_CODE := $(subst .,$(comma),$(UBUNTU_KERNEL_CODE))
+MODULE_CFLAGS += -D"UBUNTU_KERNEL_CODE=UBUNTU_KERNEL_VERSION($(UBUNTU_KERNEL_CODE))"
+endif
+
+
 # this lib needs main eal
 DEPDIRS-y += lib/librte_eal/linuxapp/eal
 
diff --git a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
index 521a35d..5a06383 100644
--- a/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
+++ b/lib/librte_eal/linuxapp/kni/ethtool/igb/kcompat.h
@@ -713,6 +713,20 @@ struct _kc_ethtool_pauseparam {
 #define SLE_VERSION_CODE 0
 #endif /* SLE_VERSION_CODE */
 
+/* Ubuntu release and kernel codes must be specified from Makefile */
+#ifndef UBUNTU_RELEASE_VERSION
+#define UBUNTU_RELEASE_VERSION(a,b) (((a) * 100) + (b))
+#endif
+#ifndef UBUNTU_KERNEL_VERSION
+#define UBUNTU_KERNEL_VERSION(a,b,c,abi,upload) (((a) << 40) + ((b) << 32) + ((c) << 24) + ((abi) << 8) + (upload))
+#endif
+#ifndef UBUNTU_RELEASE_CODE
+#define UBUNTU_RELEASE_CODE 0
+#endif
+#ifndef UBUNTU_KERNEL_CODE
+#define UBUNTU_KERNEL_CODE 0
+#endif
+
 #ifdef __KLOCWORK__
 #ifdef ARRAY_SIZE
 #undef ARRAY_SIZE
@@ -3847,6 +3861,7 @@ static inline struct sk_buff *__kc__vlan_hwaccel_put_tag(struct sk_buff *skb,
 
 #if ( LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) )
 #if (!(RHEL_RELEASE_CODE && RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7,0)))
+#if (!(UBUNTU_RELEASE_CODE == UBUNTU_RELEASE_VERSION(14,4) && UBUNTU_KERNEL_CODE >= UBUNTU_KERNEL_VERSION(3,13,0,30,54)))
 #ifdef NETIF_F_RXHASH
 #define PKT_HASH_TYPE_L3 0
 static inline void
@@ -3855,6 +3870,7 @@ skb_set_hash(struct sk_buff *skb, __u32 hash, __always_unused int type)
 	skb->rxhash = hash;
 }
 #endif /* NETIF_F_RXHASH */
+#endif /* < 3.13.0-30.54 (Ubuntu 14.04) */
 #endif /* < RHEL7 */
 #endif /* < 3.14.0 */
 
-- 
1.7.0.7
^ permalink raw reply	[relevance 11%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
  2014-05-21 10:21  3%     ` Richardson, Bruce
@ 2014-05-21 15:23  3%       ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-05-21 15:23 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev
On Wed, May 21, 2014 at 10:21:26AM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Tuesday, May 20, 2014 7:19 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
> > 
> > On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> > > This adds the code for a new Intel DPDK library for packet distribution.
> > > The distributor is a component which is designed to pass packets
> > > one-at-a-time to workers, with dynamic load balancing. Using the RSS
> > > field in the mbuf as a tag, the distributor tracks what packet tag is
> > > being processed by what worker and then ensures that no two packets with
> > > the same tag are in-flight simultaneously. Once a tag is not in-flight,
> > > then the next packet with that tag will be sent to the next available
> > > core.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ><snip>
> > 
> > ><snip other comments as I agree with your responses to them all save below>
> > Don't need to reserve an extra argument here.  You're not ABI safe currently,
> > and if DPDK becomes ABI safe in the future, we will use a linker script to
> > provide versions with backward compatibility easily enough.
> We may not have ABI compatibility between releases, but on the other hand we try to reduce the amount of code changes that need to be made by our customers who are compiling their code against the libraries - generally linking against static rather than shared libraries. Since we have a reasonable expectation that this field will be needed in a future release, we want to include it now so that when we do need it, no code changes need to be made to upgrade this particular library to a new Intel DPDK version.
I understand why you added the reserved argument, but I still don't think its a
good idea, especially since you're not ABI safe/stable at the moment.  By adding
this argument, you're forcing early users to declare a variable to pass into
your library that they know is unused, and as such likely uninitalized (or at
least initilized to an unknown value).  When you do in the future make use of
this unknown value, your internal implementation will have to support being
called by both 'old' applications that just pass in any old value, and 'new'
users who pass in valid data, and the implementation wont have any way to
differentiate between the two.  You can certainly document a reserved value that
current users must initilize that variable too, so that you can make that
differentiation, but you have to hope they do that correctly and consistently.
It seems to me it would be better to do something like:
1) Not include the reserved parameter
2) When you do add the extra parameter, rename the function as well, and
3) provide a compatibility function that preserves the old API and passes the
reserved value as the new parameter to the renamed function in (2)
That way old applications will run transparently, and you don't have to hope
they code the reserved values properly (note you can also do this with a macro
if you want to save the call instruction)
Ideally, you would just do this with a version script during linking, so that
you could include 2 versions of the same function name (v1 without the extra
paramter and v2 with the extra parameter), and old applications linked against
v1 would just continue to work, but dpdk isn't there yet :)
Neil
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
  2014-05-20 18:18  4%   ` Neil Horman
@ 2014-05-21 10:21  3%     ` Richardson, Bruce
  2014-05-21 15:23  3%       ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Richardson, Bruce @ 2014-05-21 10:21 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, May 20, 2014 7:19 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
> 
> On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> > This adds the code for a new Intel DPDK library for packet distribution.
> > The distributor is a component which is designed to pass packets
> > one-at-a-time to workers, with dynamic load balancing. Using the RSS
> > field in the mbuf as a tag, the distributor tracks what packet tag is
> > being processed by what worker and then ensures that no two packets with
> > the same tag are in-flight simultaneously. Once a tag is not in-flight,
> > then the next packet with that tag will be sent to the next available
> > core.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ><snip>
> 
> > +#define RTE_DISTRIB_GET_BUF (1)
> > +#define RTE_DISTRIB_RETURN_BUF (2)
> > +
> Can you document the meaning of these bits please, the code makes it
> somewhat
> confusing to differentiate them.  As I read the code, GET_BUF is used as a flag
> to indicate that rte_distributor_get_pkt needs to wait while a buffer is
> filled in by the processing thread, while RETURN_BUF indicates that a worker is
> leaving and the buffer needs to be (re)assigned to an alternate worker, is that
> correct?
Pretty much. I'll add additional comments to the code.
> 
> > +#define RTE_DISTRIB_BACKLOG_SIZE 8
> > +#define RTE_DISTRIB_BACKLOG_MASK (RTE_DISTRIB_BACKLOG_SIZE - 1)
> > +
> > +#define RTE_DISTRIB_MAX_RETURNS 128
> > +#define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> > +
> > +union rte_distributor_buffer {
> > +	volatile int64_t bufptr64;
> > +	char pad[CACHE_LINE_SIZE*3];
> Do you need the pad, if you mark the struct as cache aligned?
Yes, for performance reasons we actually want the structure to take up three cache lines, not just one. For instance, this will guarantee that we don't have adjacent line prefetcher in hardware pull in an additional cache line -belonging to a different worker - when we access the memory.
> > +} __rte_cache_aligned;
> >
> +
> ><snip>
> > +
> > +struct rte_mbuf *
> > +rte_distributor_get_pkt(struct rte_distributor *d,
> > +		unsigned worker_id, struct rte_mbuf *oldpkt,
> > +		unsigned reserved __rte_unused)
> > +{
> > +	union rte_distributor_buffer *buf = &d->bufs[worker_id];
> > +	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS) |
> \
> > +			RTE_DISTRIB_GET_BUF;
> > +	while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
> > +		rte_pause();
> > +	buf->bufptr64 = req;
> > +	while (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
> > +		rte_pause();
> You may want to document the fact that this is deadlock prone.  You clearly
> state that only a single thread can run the processing routine, but if a user
> selects a single worker thread to preform double duty, the GET_BUF_FLAG will
> never get cleared here, and no other queues will get processed.
Agreed, I'll update the comments.
> 
> > +	/* since bufptr64 is a signed value, this should be an arithmetic shift */
> > +	int64_t ret = buf->bufptr64 >> RTE_DISTRIB_FLAG_BITS;
> > +	return (struct rte_mbuf *)((uintptr_t)ret);
> > +}
> > +
> > +int
> > +rte_distributor_return_pkt(struct rte_distributor *d,
> > +		unsigned worker_id, struct rte_mbuf *oldpkt)
> > +{
> Maybe some optional sanity checking, here and above, to ensure that a packet
> returned through get_pkt doesn't also get returned here, mangling the flags
> field?
That actually shouldn't be an issue. 
When we return a packet using this call, we just get the in_flight_ids value for the worker to zero (and re-assign the backlog, if any), and move on to the next worker. No checking of the returned packet is done. Also, since get_pkt always returns a new packet, the internal logic will still work ok - all that will happen if you return the wrong packet, e.g. by returning the same packet twice rather than returning the latest packet each time, is that the returns array will have the duplicated pointer in it. Whatever gets passed back by the worker gets stored directly there - it's up to the worker to return the correct pointer to the distributor.
> 
> ><snip>
> > +
> > +/* flush the distributor, so that there are no outstanding packets in flight or
> > + * queued up. */
> > +int
> > +rte_distributor_flush(struct rte_distributor *d)
> > +{
> You need to document that this function can only be called by the same thread
> that is running rte_distributor_process, lest your corrupt your queue data.
> Alternatively, it might be nicer to modify this functions internals to set a
> flag in the distributor status bits to make the process routine do the flush
> work when it gets set.  that would allow this function to be called by any
> other thread, which seems like a more natural interface.
Agreed. At minimum I'll update the comments, and I'll also look into what would be involved in changing the mechanism like you describe. However, given the limited time to code freeze date, it may not be possible to do here. [I also don't anticipate this function being much used in normal operations anyway - it was written in order to allow me to write proper unit tests to test the process function. We need a flush function for unit testing to sure that our packet counts are predictable at the end of each test run, and eliminate any dependency in the tests on the internal buffer sizes of the distributor.]
> 
> ><snip>
> > +}
> > +
> > +/* clears the internal returns array in the distributor */
> > +void
> > +rte_distributor_clear_returns(struct rte_distributor *d)
> > +{
> This can also only be called by the same thread that runs the process routine,
> lest the start and count values get mis-assigned.
Agreed. Will update comments.
> 
> > +	d->returns.start = d->returns.count = 0;
> > +#ifndef __OPTIMIZE__
> > +	memset(d->returns.mbufs, 0, sizeof(d->returns.mbufs));
> > +#endif
> > +}
> > +
> > +/* creates a distributor instance */
> > +struct rte_distributor *
> > +rte_distributor_create(const char *name,
> > +		unsigned socket_id,
> > +		unsigned num_workers,
> > +		struct rte_distributor_extra_args *args __rte_unused)
> > +{
> > +	struct rte_distributor *d;
> > +	struct rte_distributor_list *distributor_list;
> > +	char mz_name[RTE_MEMZONE_NAMESIZE];
> > +	const struct rte_memzone *mz;
> > +
> > +	/* compilation-time checks */
> > +	RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> > +	RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> > +
> > +	if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> > +		rte_errno = EINVAL;
> > +		return NULL;
> > +	}
> > +	rte_snprintf(mz_name, sizeof(mz_name), RTE_DISTRIB_PREFIX"%s",
> name);
> > +	mz = rte_memzone_reserve(mz_name, sizeof(*d), socket_id,
> NO_FLAGS);
> > +	if (mz == NULL) {
> > +		rte_errno = ENOMEM;
> > +		return NULL;
> > +	}
> > +
> > +	/* check that we have an initialised tail queue */
> > +	if ((distributor_list =
> RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_DISTRIBUTOR,
> > +			rte_distributor_list)) == NULL) {
> > +		rte_errno = E_RTE_NO_TAILQ;
> > +		return NULL;
> > +	}
> > +
> > +	d = mz->addr;
> > +	rte_snprintf(d->name, sizeof(d->name), "%s", name);
> > +	d->num_workers = num_workers;
> > +	TAILQ_INSERT_TAIL(distributor_list, d, next);
> You need locking around this list unless you intend to assert that distributor
> creation and destruction must only be preformed from a single thread.  Also,
> where is the API method to tear down a distributor instance?
Ack re locking, will make this as used in other structures.
For tearing down, that's not possible until such time as we get a function to free memzones back. Rings and mempools similarly have no free function.
> 
> ><snip>
> > +#endif
> > +
> > +#include <rte_mbuf.h>
> > +
> > +#define RTE_DISTRIBUTOR_NAMESIZE 32 /**< Length of name for instance
> */
> > +
> > +struct rte_distributor;
> > +
> > +struct rte_distributor_extra_args { }; /**< reserved for future use*/
> > +
> You don't need to reserve a struct name for future use.  No one will use it
> until you create it.
> 
> > +struct rte_mbuf *
> > +rte_distributor_get_pkt(struct rte_distributor *d,
> > +		unsigned worker_id, struct rte_mbuf *oldpkt, unsigned
> reserved);
> > +
> Don't need to reserve an extra argument here.  You're not ABI safe currently,
> and if DPDK becomes ABI safe in the future, we will use a linker script to
> provide versions with backward compatibility easily enough.
We may not have ABI compatibility between releases, but on the other hand we try to reduce the amount of code changes that need to be made by our customers who are compiling their code against the libraries - generally linking against static rather than shared libraries. Since we have a reasonable expectation that this field will be needed in a future release, we want to include it now so that when we do need it, no code changes need to be made to upgrade this particular library to a new Intel DPDK version.
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library
  @ 2014-05-20 18:18  4%   ` Neil Horman
  2014-05-21 10:21  3%     ` Richardson, Bruce
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-05-20 18:18 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev
On Tue, May 20, 2014 at 11:00:55AM +0100, Bruce Richardson wrote:
> This adds the code for a new Intel DPDK library for packet distribution.
> The distributor is a component which is designed to pass packets
> one-at-a-time to workers, with dynamic load balancing. Using the RSS
> field in the mbuf as a tag, the distributor tracks what packet tag is
> being processed by what worker and then ensures that no two packets with
> the same tag are in-flight simultaneously. Once a tag is not in-flight,
> then the next packet with that tag will be sent to the next available
> core.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
><snip>
> +#define RTE_DISTRIB_GET_BUF (1)
> +#define RTE_DISTRIB_RETURN_BUF (2)
> +
Can you document the meaning of these bits please, the code makes it somewhat
confusing to differentiate them.  As I read the code, GET_BUF is used as a flag
to indicate that rte_distributor_get_pkt needs to wait while a buffer is
filled in by the processing thread, while RETURN_BUF indicates that a worker is
leaving and the buffer needs to be (re)assigned to an alternate worker, is that
correct?
> +#define RTE_DISTRIB_BACKLOG_SIZE 8
> +#define RTE_DISTRIB_BACKLOG_MASK (RTE_DISTRIB_BACKLOG_SIZE - 1)
> +
> +#define RTE_DISTRIB_MAX_RETURNS 128
> +#define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1)
> +
> +union rte_distributor_buffer {
> +	volatile int64_t bufptr64;
> +	char pad[CACHE_LINE_SIZE*3];
Do you need the pad, if you mark the struct as cache aligned?
> +} __rte_cache_aligned;
> 
+
><snip>
> +
> +struct rte_mbuf *
> +rte_distributor_get_pkt(struct rte_distributor *d,
> +		unsigned worker_id, struct rte_mbuf *oldpkt,
> +		unsigned reserved __rte_unused)
> +{
> +	union rte_distributor_buffer *buf = &d->bufs[worker_id];
> +	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS) | \
> +			RTE_DISTRIB_GET_BUF;
> +	while (unlikely(buf->bufptr64 & RTE_DISTRIB_FLAGS_MASK))
> +		rte_pause();
> +	buf->bufptr64 = req;
> +	while (buf->bufptr64 & RTE_DISTRIB_GET_BUF)
> +		rte_pause();
You may want to document the fact that this is deadlock prone.  You clearly
state that only a single thread can run the processing routine, but if a user
selects a single worker thread to preform double duty, the GET_BUF_FLAG will
never get cleared here, and no other queues will get processed.
> +	/* since bufptr64 is a signed value, this should be an arithmetic shift */
> +	int64_t ret = buf->bufptr64 >> RTE_DISTRIB_FLAG_BITS;
> +	return (struct rte_mbuf *)((uintptr_t)ret);
> +}
> +
> +int
> +rte_distributor_return_pkt(struct rte_distributor *d,
> +		unsigned worker_id, struct rte_mbuf *oldpkt)
> +{
Maybe some optional sanity checking, here and above, to ensure that a packet
returned through get_pkt doesn't also get returned here, mangling the flags
field?
><snip>
> +
> +/* flush the distributor, so that there are no outstanding packets in flight or
> + * queued up. */
> +int
> +rte_distributor_flush(struct rte_distributor *d)
> +{
You need to document that this function can only be called by the same thread
that is running rte_distributor_process, lest your corrupt your queue data.
Alternatively, it might be nicer to modify this functions internals to set a
flag in the distributor status bits to make the process routine do the flush
work when it gets set.  that would allow this function to be called by any
other thread, which seems like a more natural interface.
><snip>
> +}
> +
> +/* clears the internal returns array in the distributor */
> +void
> +rte_distributor_clear_returns(struct rte_distributor *d)
> +{
This can also only be called by the same thread that runs the process routine,
lest the start and count values get mis-assigned.
> +	d->returns.start = d->returns.count = 0;
> +#ifndef __OPTIMIZE__
> +	memset(d->returns.mbufs, 0, sizeof(d->returns.mbufs));
> +#endif
> +}
> +
> +/* creates a distributor instance */
> +struct rte_distributor *
> +rte_distributor_create(const char *name,
> +		unsigned socket_id,
> +		unsigned num_workers,
> +		struct rte_distributor_extra_args *args __rte_unused)
> +{
> +	struct rte_distributor *d;
> +	struct rte_distributor_list *distributor_list;
> +	char mz_name[RTE_MEMZONE_NAMESIZE];
> +	const struct rte_memzone *mz;
> +
> +	/* compilation-time checks */
> +	RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0);
> +	RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0);
> +
> +	if (name == NULL || num_workers >= RTE_MAX_LCORE) {
> +		rte_errno = EINVAL;
> +		return NULL;
> +	}
> +	rte_snprintf(mz_name, sizeof(mz_name), RTE_DISTRIB_PREFIX"%s", name);
> +	mz = rte_memzone_reserve(mz_name, sizeof(*d), socket_id, NO_FLAGS);
> +	if (mz == NULL) {
> +		rte_errno = ENOMEM;
> +		return NULL;
> +	}
> +
> +	/* check that we have an initialised tail queue */
> +	if ((distributor_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_DISTRIBUTOR,
> +			rte_distributor_list)) == NULL) {
> +		rte_errno = E_RTE_NO_TAILQ;
> +		return NULL;
> +	}
> +
> +	d = mz->addr;
> +	rte_snprintf(d->name, sizeof(d->name), "%s", name);
> +	d->num_workers = num_workers;
> +	TAILQ_INSERT_TAIL(distributor_list, d, next);
You need locking around this list unless you intend to assert that distributor
creation and destruction must only be preformed from a single thread.  Also,
where is the API method to tear down a distributor instance?
><snip>
> +#endif
> +
> +#include <rte_mbuf.h>
> +
> +#define RTE_DISTRIBUTOR_NAMESIZE 32 /**< Length of name for instance */
> +
> +struct rte_distributor;
> +
> +struct rte_distributor_extra_args { }; /**< reserved for future use*/
> +
You don't need to reserve a struct name for future use.  No one will use it
until you create it.
> +struct rte_mbuf *
> +rte_distributor_get_pkt(struct rte_distributor *d,
> +		unsigned worker_id, struct rte_mbuf *oldpkt, unsigned reserved);
> +
Don't need to reserve an extra argument here.  You're not ABI safe currently,
and if DPDK becomes ABI safe in the future, we will use a linker script to
provide versions with backward compatibility easily enough.
Neil
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] Heads up: Fedora packaging plans
  2014-05-19 10:11  0% ` Thomas Monjalon
@ 2014-05-19 13:18  0%   ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-05-19 13:18 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Mon, May 19, 2014 at 12:11:35PM +0200, Thomas Monjalon wrote:
> Hi Neil,
> 
> Thanks for sharing your progress.
> 
No worries.
> My main concerns are about naming and extensions.
> We must keep "dpdk-core" naming in order to distinguish it from PMD 
> extensions.
I don't see why.  We can name packages whatever we want, as long as the spec and
srpm share the same name. It seems to me that the core should be the base name
of the package while the extensions should have some extension on their name.
> And then, packaging of memnic and non-uio paravirtualization PMDs 
> (virtio/vmxnet3) are missing.
> 
They're in separate repositories, I was planning on packaging them at a later
time separately, since their versioning and development is handled separately.
> 2014-05-13 15:08, Neil Horman:
> > My current effort to do so.  I've made some changes from the stock spec file
> > included in dpdk:
> 
> We should try to get .spec for Fedora and in-tree .spec as common as possible.
> There are probably some things to push.
> 
Ok, sure, just keep in mind that different distributions have different
packaging requirements that may affect the contents of the spec file, and so
attaining parity may not be possible (or even worthwhile).
> > * Modified the version and release values to be separate from the name.  I
> > did some reading on requirements for packaging and it seems we can be a bit
> > more lax with ABI version on a pre-release I think, so I setup the N-V-R to
> > use pre-release conventions, which makes sense, give that this is a 1.7.0
> > pre-release.  The git tag on the relase value will get bumped as we move
> > forward in the patch series.
> 
> I thought that we should put version in the name, in order to be able to 
> install many versions together. How is it handled by yum?
> 
So, I spent some time thinking about this, and I _really_ want to avoid the
inclusion of a version with the package name.  Doing so, while it allows yum to
install multiple versions side-by-side, is a real overhead for me, as it
requires that I go through a new pacakge review process for each released
version that we want to package.  I do not have time to do that.  If someone
from 6wind or intel wants to get involved in the Packaging process we can look
at that as a solution, but while I'm doing it, its really just too much
overhead.  This method will allow multiple version to be installed side by side
as well.  The tradeoff is that yum doesn't directly allow that, as it will just
preform an upgrade.  The multiple version solution will require that you
download older versions and install them directly using rpm commands.  I think
thats a fair tradeoff.
> > * Added config files to match desired configs for Fedora (i.e. disabled
> > PMD's that require out of tree kernel modules
> 
> It would be clearer to make your configuration changes with "sed -i".
> In a near future we would probably need a "configure" script to do it.
> 
I really disagree.  Its not clearer in my mind at all - in that the final config
file is a product of two pieces of information (the base config file, and the
sed scripts that you run on it), as opposed to one piece (the canonical modified
config specified in the source line).  Using sed also implies that you need to
list sed as a BuildRequires (minimal buildroots may not include sed when they
are spun up).
> So you don't package igb_uio but you build it because there is no option to 
> disable it currently. We should add such option.
> 
Not sure what you mean here.  The only uio code I see in the package is the uio
unbind script for igb, which should still work just fine (save for the fact that
we don't have a user space PMD to attach the hardware to).  I can certainly
remove the script though so it doesn't appear in the package until such time as
the LAD group integrates the uio code in the upstream driver.
> > * Moved the package target directories to include N-V of the package in the
> > path names.  This allows for multiple versions of the dpdk to be installed
> > in parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0,
> > /usr/include/dpdk-1.7.0, etc).  This is how java packages allow for
> > multiple version installs, and makes sense given ABI instability in dpdk. 
> > It will require that developers add some -I / -L paths to their makefiles
> > to pull the proper version, but I think thats a fair tradeoff.
> 
> I don't see version for include directory and bin directory (testpmd).
> 
Yup, need to fix that.  Thank you!
Neil
> Thanks
> -- 
> Thomas
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] Heads up: Fedora packaging plans
  2014-05-13 19:08  4% [dpdk-dev] Heads up: Fedora packaging plans Neil Horman
@ 2014-05-19 10:11  0% ` Thomas Monjalon
  2014-05-19 13:18  0%   ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Thomas Monjalon @ 2014-05-19 10:11 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
Hi Neil,
Thanks for sharing your progress.
My main concerns are about naming and extensions.
We must keep "dpdk-core" naming in order to distinguish it from PMD 
extensions. And then, packaging of memnic and non-uio paravirtualization PMDs 
(virtio/vmxnet3) are missing.
2014-05-13 15:08, Neil Horman:
> My current effort to do so.  I've made some changes from the stock spec file
> included in dpdk:
We should try to get .spec for Fedora and in-tree .spec as common as possible.
There are probably some things to push.
> * Modified the version and release values to be separate from the name.  I
> did some reading on requirements for packaging and it seems we can be a bit
> more lax with ABI version on a pre-release I think, so I setup the N-V-R to
> use pre-release conventions, which makes sense, give that this is a 1.7.0
> pre-release.  The git tag on the relase value will get bumped as we move
> forward in the patch series.
I thought that we should put version in the name, in order to be able to 
install many versions together. How is it handled by yum?
> * Added config files to match desired configs for Fedora (i.e. disabled
> PMD's that require out of tree kernel modules
It would be clearer to make your configuration changes with "sed -i".
In a near future we would probably need a "configure" script to do it.
So you don't package igb_uio but you build it because there is no option to 
disable it currently. We should add such option.
> * Moved the package target directories to include N-V of the package in the
> path names.  This allows for multiple versions of the dpdk to be installed
> in parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0,
> /usr/include/dpdk-1.7.0, etc).  This is how java packages allow for
> multiple version installs, and makes sense given ABI instability in dpdk. 
> It will require that developers add some -I / -L paths to their makefiles
> to pull the proper version, but I think thats a fair tradeoff.
I don't see version for include directory and bin directory (testpmd).
Thanks
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] Heads up: Fedora packaging plans
@ 2014-05-13 19:08  4% Neil Horman
  2014-05-19 10:11  0% ` Thomas Monjalon
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-05-13 19:08 UTC (permalink / raw)
  To: dev
Hey all-
	This isn't really germaine to dpdk development, but Thomas and Vincent,
you expressed interest in my progress regarding packaging of dpdk for Fedora, so
I figured I would post here in case others were interested.
Please find here:
http://people.redhat.com/nhorman/dpdk-1.7.0-0.1.gitb20539d68.src.rpm
My current effort to do so.  I've made some changes from the stock spec file
included in dpdk:
* Modified the version and release values to be separate from the name.  I did
some reading on requirements for packaging and it seems we can be a bit more lax
with ABI version on a pre-release I think, so I setup the N-V-R to use
pre-release conventions, which makes sense, give that this is a 1.7.0
pre-release.  The git tag on the relase value will get bumped as we move forward
in the patch series.
* Added config files to match desired configs for Fedora (i.e. disabled PMD's
that require out of tree kernel modules
* Removed Packager tag (Fedora doesn't use those)
* Moved the package target directories to include N-V of the package in the path
names.  This allows for multiple versions of the dpdk to be installed in
parallel (I.e. dpdk-1.7.0 files are in /lib/dpdk-1.7.0, /usr/include/dpdk-1.7.0,
etc).  This is how java packages allow for multiple version installs, and makes
sense given ABI instability in dpdk.  It will require that developers add some
-I / -L paths to their makefiles to pull the proper version, but I think thats a
fair tradeoff.
My plan is to go through the review process with this package, and update to
tagged 1.7.0 as soon as its ready.  
Neil
 
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
  2014-05-01 13:14  0% ` Neil Horman
@ 2014-05-01 21:15  0%   ` Thomas Monjalon
  0 siblings, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-05-01 21:15 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
2014-05-01 09:14, Neil Horman:
> On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> > The goal of this patch serie is to be able to package DPDK
> > for RPM-based distributions.
> > 
> > The file naming currently doesn't allow to install different DPDK
> > versions.
> > But the packaging naming should be ready to manage different DPDK versions
> > 
> > having different API/ABI for different applications:
> > 	- dpdk-core has full version in its name to manage API breaking
> > 	- extensions have a number as name suffix to manage PMD API breaking.
> > 
> > When API/ABI will be stable, package names could be simpler.
> > 
> > I suggest to add these .spec files as a starting point for integration
> > in Linux distributions.
> > 
> > Changes since v1:
> > 	- name of .spec file match package name
> > 	- version in package name
> > 	- no static library
> > 	- ldconfig/depmod in scriplets
> > 
> > Thanks for your comments/reviews.
> 
> I understand that this is holding up the 1.6.0r2 release, as well as the
> 1.7.0 integration.  As such, given that my concerns, while valid IMO,
> aren't required for the release:
> 
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
Applied for dpdk-1.6.0r2, memnic-1.1, vmxnet3-usermap 1.2
and virtio-net-pmd-1.2.
Thanks Neil and other RedHat people for helping in this first step.
-- 
Thomas
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
  2014-04-30  0:46  4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
  2014-04-30  0:46  4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
  2014-04-30 10:52  0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
@ 2014-05-01 13:14  0% ` Neil Horman
  2014-05-01 21:15  0%   ` Thomas Monjalon
  2 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-05-01 13:14 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> The goal of this patch serie is to be able to package DPDK
> for RPM-based distributions.
> 
> The file naming currently doesn't allow to install different DPDK versions.
> But the packaging naming should be ready to manage different DPDK versions
> having different API/ABI for different applications:
> 	- dpdk-core has full version in its name to manage API breaking
> 	- extensions have a number as name suffix to manage PMD API breaking.
> When API/ABI will be stable, package names could be simpler.
> 
> I suggest to add these .spec files as a starting point for integration
> in Linux distributions.
> 
> Changes since v1:
> 	- name of .spec file match package name
> 	- version in package name
> 	- no static library
> 	- ldconfig/depmod in scriplets
> 
> Thanks for your comments/reviews.
> -- 
> Thomas
> 
I understand that this is holding up the 1.6.0r2 release, as well as the 1.7.0
integration.  As such, given that my concerns, while valid IMO, aren't required
for the release:
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
  2014-04-30  0:46  4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
  2014-04-30  0:46  4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
@ 2014-04-30 10:52  0% ` Neil Horman
  2014-05-01 13:14  0% ` Neil Horman
  2 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-04-30 10:52 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Wed, Apr 30, 2014 at 02:46:41AM +0200, Thomas Monjalon wrote:
> The goal of this patch serie is to be able to package DPDK
> for RPM-based distributions.
> 
> The file naming currently doesn't allow to install different DPDK versions.
> But the packaging naming should be ready to manage different DPDK versions
> having different API/ABI for different applications:
> 	- dpdk-core has full version in its name to manage API breaking
> 	- extensions have a number as name suffix to manage PMD API breaking.
> When API/ABI will be stable, package names could be simpler.
> 
> I suggest to add these .spec files as a starting point for integration
> in Linux distributions.
> 
> Changes since v1:
> 	- name of .spec file match package name
> 	- version in package name
> 	- no static library
> 	- ldconfig/depmod in scriplets
> 
> Thanks for your comments/reviews.
> -- 
> Thomas
> 
You should merge these into a single spec file so that you only have to build
once.  That also cleans up the need to adjust the version information in the
spec file once, and build packages all get the same versioning.
Neil
^ permalink raw reply	[relevance 0%]
* [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM
  2014-04-30  0:46  4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
@ 2014-04-30  0:46  4% ` Thomas Monjalon
  2014-04-30 10:52  0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
  2014-05-01 13:14  0% ` Neil Horman
  2 siblings, 0 replies; 86+ results
From: Thomas Monjalon @ 2014-04-30  0:46 UTC (permalink / raw)
  To: dev
Packages can be built with:
	RPM_BUILD_NCPUS=8 rpmbuild -ta dpdk-1.6.0r1.tar.gz
There are packages for runtime and development.
Once devel package is installed, it can be used like this:
	make -C /usr/share/dpdk/examples/helloworld RTE_SDK=/usr/share/dpdk
Signed-off-by: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 pkg/dpdk-core.spec | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100644 pkg/dpdk-core.spec
diff --git a/pkg/dpdk-core.spec b/pkg/dpdk-core.spec
new file mode 100644
index 0000000..77d6c76
--- /dev/null
+++ b/pkg/dpdk-core.spec
@@ -0,0 +1,129 @@
+# Copyright 2014 6WIND S.A.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#
+# - Redistributions of source code must retain the above copyright
+#   notice, this list of conditions and the following disclaimer.
+#
+# - Redistributions in binary form must reproduce the above copyright
+#   notice, this list of conditions and the following disclaimer in
+#   the documentation and/or other materials provided with the
+#   distribution.
+#
+# - Neither the name of 6WIND S.A. nor the names of its
+#   contributors may be used to endorse or promote products derived
+#   from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+# FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+# COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+# OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# name includes full version because there is no ABI stability yet
+Name: dpdk-core-1.6.0
+Version: r1
+%define fullversion 1.6.0%{version}
+Release: 1
+Packager: packaging@6wind.com
+URL: http://dpdk.org
+Source: http://dpdk.org/browse/dpdk/snapshot/dpdk-%{fullversion}.tar.gz
+
+Summary: Intel(r) Data Plane Development Kit core
+Group: System Environment/Libraries
+License: BSD and LGPLv2 and GPLv2
+
+ExclusiveArch: i686, x86_64
+%define target %{_arch}-default-linuxapp-gcc
+%define machine default
+
+BuildRequires: kernel-devel, kernel-headers, doxygen
+
+%description
+Intel(r) DPDK core includes kernel modules, core libraries and tools.
+testpmd application allows to test fast packet processing environments
+on x86 platforms. For instance, it can be used to check that environment
+can support fast path applications such as 6WINDGate, pktgen, rumptcpip, etc.
+More libraries are available as extensions in other packages.
+
+%package devel
+Summary: Intel(r) Data Plane Development Kit core for development
+%description devel
+Intel(r) DPDK core-devel is a set of makefiles, headers, examples and documentation
+for fast packet processing on x86 platforms.
+More libraries are available as extensions in other packages.
+
+%define destdir %{buildroot}%{_prefix}
+%define moddir  /lib/modules/%(uname -r)/extra
+%define datadir %{_datadir}/dpdk
+%define docdir  %{_docdir}/dpdk
+
+%prep
+%setup -qn dpdk-%{fullversion}
+
+%build
+make O=%{target} T=%{target} config
+sed -ri 's,(RTE_MACHINE=).*,\1%{machine},' %{target}/.config
+sed -ri 's,(RTE_APP_TEST=).*,\1n,'         %{target}/.config
+sed -ri 's,(RTE_BUILD_SHARED_LIB=).*,\1y,' %{target}/.config
+make O=%{target} %{?_smp_mflags}
+make O=%{target} doc
+
+%install
+rm -rf %{buildroot}
+make           O=%{target}     DESTDIR=%{destdir}
+mkdir -p                               %{buildroot}%{moddir}
+mv    %{destdir}/%{target}/kmod/*.ko   %{buildroot}%{moddir}
+rmdir %{destdir}/%{target}/kmod
+mkdir -p                               %{buildroot}%{_sbindir}
+ln -s %{datadir}/tools/igb_uio_bind.py %{buildroot}%{_sbindir}/igb_uio_bind
+mkdir -p                               %{buildroot}%{_bindir}
+mv    %{destdir}/%{target}/app/testpmd %{buildroot}%{_bindir}
+rmdir %{destdir}/%{target}/app
+mv    %{destdir}/%{target}/include     %{buildroot}%{_includedir}
+mv    %{destdir}/%{target}/lib         %{buildroot}%{_libdir}
+mkdir -p                               %{buildroot}%{docdir}
+mv    %{destdir}/%{target}/doc/*       %{buildroot}%{docdir}
+rmdir %{destdir}/%{target}/doc
+mkdir -p                               %{buildroot}%{datadir}
+mv    %{destdir}/%{target}/.config     %{buildroot}%{datadir}/config
+mv    %{destdir}/%{target}             %{buildroot}%{datadir}
+mv    %{destdir}/mk                    %{buildroot}%{datadir}
+cp -a            examples              %{buildroot}%{datadir}
+cp -a            tools                 %{buildroot}%{datadir}
+ln -s            %{datadir}/config     %{buildroot}%{datadir}/%{target}/.config
+ln -s            %{_includedir}        %{buildroot}%{datadir}/%{target}/include
+ln -s            %{_libdir}            %{buildroot}%{datadir}/%{target}/lib
+
+%files
+%dir %{datadir}
+%{datadir}/config
+%{datadir}/tools
+%{moddir}/*
+%{_sbindir}/*
+%{_bindir}/*
+%{_libdir}/*
+
+%files devel
+%{_includedir}/*
+%{datadir}/mk
+%{datadir}/%{target}
+%{datadir}/examples
+%doc %{docdir}
+
+%post
+/sbin/ldconfig
+/sbin/depmod
+
+%postun
+/sbin/ldconfig
+/sbin/depmod
-- 
1.9.2
^ permalink raw reply	[relevance 4%]
* [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages
@ 2014-04-30  0:46  4% Thomas Monjalon
  2014-04-30  0:46  4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
                   ` (2 more replies)
  0 siblings, 3 replies; 86+ results
From: Thomas Monjalon @ 2014-04-30  0:46 UTC (permalink / raw)
  To: dev
The goal of this patch serie is to be able to package DPDK
for RPM-based distributions.
The file naming currently doesn't allow to install different DPDK versions.
But the packaging naming should be ready to manage different DPDK versions
having different API/ABI for different applications:
	- dpdk-core has full version in its name to manage API breaking
	- extensions have a number as name suffix to manage PMD API breaking.
When API/ABI will be stable, package names could be simpler.
I suggest to add these .spec files as a starting point for integration
in Linux distributions.
Changes since v1:
	- name of .spec file match package name
	- version in package name
	- no static library
	- ldconfig/depmod in scriplets
Thanks for your comments/reviews.
-- 
Thomas
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [PATCH 0/19] Separate compile time linkage between eal lib and pmd's
  @ 2014-04-15 13:46  3%     ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-04-15 13:46 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Tue, Apr 15, 2014 at 10:31:25AM +0200, Thomas Monjalon wrote:
> 2014-04-12 07:04, Neil Horman:
> > On Thu, Apr 10, 2014 at 04:47:07PM -0400, Neil Horman wrote:
> > > Disconnect compile time linkage between eal library / applications and
> > > pmd's
> > > 
> > > I noticed that, while tinkering with dpdk, building for shared libraries
> > > still resulted in all the test applications linking to all the built
> > > pmd's, despite not actually needing them all.  We are able to tell an
> > > application at run time (via the -d/--blacklist/--whitelist/--vdev
> > > options) which pmd's we want to use, and so have no need to link them at
> > > all. The only reason they get pulled in is because
> > > rte_eal_non_pci_init_etherdev and rte_pmd_init_all contain static lists
> > > to the individual pmd init functions. The result is that, even when
> > > building as DSO's, we have to load all the pmd libraries, which is space
> > > inefficient and defeating of some of the purpose of shared objects.
> > > 
> > > To correct this, I developed this patch series, which introduces two new
> > > macros, PMD_INIT_NONPCI and PMD_INIT.  These two macros use constructors
> > > to register their init routines at runtime, either prior to the execution
> > > of main() when linked statically, or when dlopen is called on a DSO at
> > > run time.  The result is that PMD's can be loaded at run time without the
> > > application or eal library having to hold a reference to them.  They work
> > > in a very simmilar fashion to the module_init routine in the linux
> > > kernel.
> > > 
> > > I've tested this feature using the igb and pcap pmd's, both statically and
> > > dynamically linked with the test and testpmd sample applications, and it
> > > seems to work well.
> > > 
> > > Note, I encountered  a few bugs along the way, which I fixed and noted in
> > > the series.
> > > 
> > > Regards
> > > Neil
> > 
> > Self NAK on this, based on the conversation Thomas and I had about Oliviers
> > patches from a while back, I'm going to rebase and repost these soon.
> > Neil
> 
> I'll be glad to get your fixes soon. So I could apply them for version 1.6.0r2 
> and release it.
> But I think you should post API changes (if any) in another series. Then we'll 
> think if we want to push it in another branch for next major version.
> 
I presume at this point you're fairly close to tagging
1.6.0r2, which, based on what I see in the git tree is usually the last rc
before you merge to the next major version.  Do you want to put this in now,
before that happens, or will you commit to the first 1.7.0 rc?  If the latter,
that seems like the best time to make ABI changes, so you maximize testing
Neil
> Thanks Neil
> -- 
> Thomas
> 
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] DPDK API/ABI Stability
  2014-04-09 21:08  4% ` Stephen Hemminger
@ 2014-04-10 10:54  7%   ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-04-10 10:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev
On Wed, Apr 09, 2014 at 02:08:49PM -0700, Stephen Hemminger wrote:
> On Wed, 9 Apr 2014 14:39:52 -0400
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Hey all-
> > 	I was going to include this as an addendum to the packaging thread on
> > this list, but I can't seem to find it in my inbox, so forgive me starting a new
> > one.
> > 
> > 	I wanted to broach the subject of ABI/API stability on the list here.
> > Given the recent great efforts to make dpdk packagable by disributions, I think
> > we probably need to discuss API stability in more depth and come up with a plan
> > to implement it.  Has anyone started looking into this?  If not, it seems to me
> > to be reasonable to start by placing a line in the sand with the functions
> > documented here:
> > 
> > http://dpdk.org/doc/api/
> > 
> > It seems to me we can start reviewing the API library by library, enusring only
> > those functions are exported, making sure the data types are appropriate for
> > export, and marking them with a linker script to version them appropriately.
> 
> To what level? source? binary, internal functions?
> 
Well, I was thinking both (hence the API/ABI comment above), but at least API
stability as a start.  Stabilizing internal functions doesn't make any sense to
me since, by definition those aren't exposed to users trying to make use of the
library.
> Some of the API's could be stablized without much impact but others such
> as the device driver interface is incomplete and freezing it would make
> live hard.
But the driver interface isn't listed on the api documentation above.  Clearly
we'd need to address that eventually, but as a start it can likely be ignored,
at least we can give applications a modicum of stability.
Neil
^ permalink raw reply	[relevance 7%]
* Re: [dpdk-dev] DPDK API/ABI Stability
  2014-04-09 18:39  7% [dpdk-dev] DPDK API/ABI Stability Neil Horman
@ 2014-04-09 21:08  4% ` Stephen Hemminger
  2014-04-10 10:54  7%   ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Stephen Hemminger @ 2014-04-09 21:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev
On Wed, 9 Apr 2014 14:39:52 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:
> Hey all-
> 	I was going to include this as an addendum to the packaging thread on
> this list, but I can't seem to find it in my inbox, so forgive me starting a new
> one.
> 
> 	I wanted to broach the subject of ABI/API stability on the list here.
> Given the recent great efforts to make dpdk packagable by disributions, I think
> we probably need to discuss API stability in more depth and come up with a plan
> to implement it.  Has anyone started looking into this?  If not, it seems to me
> to be reasonable to start by placing a line in the sand with the functions
> documented here:
> 
> http://dpdk.org/doc/api/
> 
> It seems to me we can start reviewing the API library by library, enusring only
> those functions are exported, making sure the data types are appropriate for
> export, and marking them with a linker script to version them appropriately.
To what level? source? binary, internal functions?
Some of the API's could be stablized without much impact but others such
as the device driver interface is incomplete and freezing it would make
live hard.
^ permalink raw reply	[relevance 4%]
* [dpdk-dev] DPDK API/ABI Stability
@ 2014-04-09 18:39  7% Neil Horman
  2014-04-09 21:08  4% ` Stephen Hemminger
  0 siblings, 1 reply; 86+ results
From: Neil Horman @ 2014-04-09 18:39 UTC (permalink / raw)
  To: dev
Hey all-
	I was going to include this as an addendum to the packaging thread on
this list, but I can't seem to find it in my inbox, so forgive me starting a new
one.
	I wanted to broach the subject of ABI/API stability on the list here.
Given the recent great efforts to make dpdk packagable by disributions, I think
we probably need to discuss API stability in more depth and come up with a plan
to implement it.  Has anyone started looking into this?  If not, it seems to me
to be reasonable to start by placing a line in the sand with the functions
documented here:
http://dpdk.org/doc/api/
It seems to me we can start reviewing the API library by library, enusring only
those functions are exported, making sure the data types are appropriate for
export, and marking them with a linker script to version them appropriately.
Thoughts?
Neil
^ permalink raw reply	[relevance 7%]
* Re: [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM
  2014-04-02  9:53  3%     ` Thomas Monjalon
@ 2014-04-02 11:29  0%       ` Neil Horman
  0 siblings, 0 replies; 86+ results
From: Neil Horman @ 2014-04-02 11:29 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Wed, Apr 02, 2014 at 11:53:51AM +0200, Thomas Monjalon wrote:
> Hello,
> 
> Sorry for the long delay.
> 
> 2014-02-26 14:07, Thomas Graf:
> > On 02/04/2014 04:54 PM, Thomas Monjalon wrote:
> > > +Version: 1.5.2r1
> > > +Release: 1
> > 
> > What kind of upgrade strategy do you have in mind?
> > 
> > I'm raising this because Fedora and other distributions will require
> > a unique package name for every version of the package that is not
> > backwards compatible.
> > 
> > Typically libraries provide backwards compatible within a major release,
> > i.e. all 1.x.x releases would be compatible. I realize that this might
> > not be applicable yet but maybe 1.5.x?
> > 
> > Depending on the versioning schema the name would be dpdk15, dpdk16, ...
> > or dpdk152, dpdk153, ...
> 
> We are working on this but at the moment there is no restriction on API/ABI 
> breakage. So I think it's too early to define such rule.
>  
Now that you have DSO builds in place, theres no reason not to take the extra
step of versioning your API's, making backwards compatibility fairly
straightforward.  Monolithic builds are still somewhat problematic regarding API
stability, but you could certainly offer stability in the DSOs.
> > > +BuildRequires: kernel-devel, kernel-headers, doxygen
> > 
> > Is a python environment required as well?
> 
> Python is only needed to run some tools on the target. But is is optional.
> Do you think it should be written somewhere?
> 
> > > +%description
> > > +Dummy main package. Make only subpackages.
> > 
> > I would just call the main package "libdpdk152" so you don't have to
> > repeat the encoding versioning in all the subpackages.
> > 
> > > +
> > > +%package core-runtime
> > 
> > What about calling it just "libdpdk"?
> 
The version name should be left out of the library name, whatever you do.
Packaging can be responsible for versioning.
> In this case, it should be libdpdk-core in order to distinguish it from dpdk 
> extensions. But the name of the project is dpdk so it seems simpler to call it 
> dpdk-core.
> Is the "lib" prefix mandatory for libraries?
> 
Not strictly, but IIRC if you don't add the lib, the linker won't find it with
the -l option, so you'll want to add it.
> > > +%files core-runtime
> > > +%dir %{datadir}
> > > +%{datadir}/config
> > > +%{datadir}/tools
> > > +%{moddir}/*
> > > +%{_sbindir}/*
> > > +%{_bindir}/*
> > > +%{_libdir}/*.so
> > 
> > This brings up the question of multiple parallel DPDK installations.
> > A specific application linking to library version X will also require
> > tools of version X, right? A second application linking against version
> > Y will require tools version Y. Right now, these could not be installed
> > in parallel. Any chance we can make the runtime version independent?
> 
> Are you thinking about installing different major versions? In my 
> understanding, we cannot install 2 different minor versions of a package.
> As long as there is no stable API, there is no major versions defined.
> So don't you think we should speak about it later?
> 
If the versioning is done properly (i.e shared libraries get version ids
attached to the library files properly), you can install as many library
versions as you like.  You can only install a single -devel package, since it
links lib<name>.so to a specific version.
> > Same applies to header files. A good option here would be to install
> > them to /usr/include/libdpdk{version}/ and have a dpdk-1.5.2.pc which
> > provides Cflags: -I${includedir}/libdpdk${version}
> 
> Yes same applies :)
> I agree that a .pc file would be a good idea. But we also must allow to build 
> with the DPDK framework.
> 
> > You'll also need for all packages and subpackages installing shared
> > libraries:
> > 
> > %post -p /sbin/ldconfig
> > %postun -p /sbin/ldconfig
> 
> OK
> 
> Thanks for the review
> -- 
> Thomas
> 
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM
  @ 2014-04-02  9:53  3%     ` Thomas Monjalon
  2014-04-02 11:29  0%       ` Neil Horman
  0 siblings, 1 reply; 86+ results
From: Thomas Monjalon @ 2014-04-02  9:53 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev
Hello,
Sorry for the long delay.
2014-02-26 14:07, Thomas Graf:
> On 02/04/2014 04:54 PM, Thomas Monjalon wrote:
> > +Version: 1.5.2r1
> > +Release: 1
> 
> What kind of upgrade strategy do you have in mind?
> 
> I'm raising this because Fedora and other distributions will require
> a unique package name for every version of the package that is not
> backwards compatible.
> 
> Typically libraries provide backwards compatible within a major release,
> i.e. all 1.x.x releases would be compatible. I realize that this might
> not be applicable yet but maybe 1.5.x?
> 
> Depending on the versioning schema the name would be dpdk15, dpdk16, ...
> or dpdk152, dpdk153, ...
We are working on this but at the moment there is no restriction on API/ABI 
breakage. So I think it's too early to define such rule.
 
> > +BuildRequires: kernel-devel, kernel-headers, doxygen
> 
> Is a python environment required as well?
Python is only needed to run some tools on the target. But is is optional.
Do you think it should be written somewhere?
> > +%description
> > +Dummy main package. Make only subpackages.
> 
> I would just call the main package "libdpdk152" so you don't have to
> repeat the encoding versioning in all the subpackages.
> 
> > +
> > +%package core-runtime
> 
> What about calling it just "libdpdk"?
In this case, it should be libdpdk-core in order to distinguish it from dpdk 
extensions. But the name of the project is dpdk so it seems simpler to call it 
dpdk-core.
Is the "lib" prefix mandatory for libraries?
> > +%files core-runtime
> > +%dir %{datadir}
> > +%{datadir}/config
> > +%{datadir}/tools
> > +%{moddir}/*
> > +%{_sbindir}/*
> > +%{_bindir}/*
> > +%{_libdir}/*.so
> 
> This brings up the question of multiple parallel DPDK installations.
> A specific application linking to library version X will also require
> tools of version X, right? A second application linking against version
> Y will require tools version Y. Right now, these could not be installed
> in parallel. Any chance we can make the runtime version independent?
Are you thinking about installing different major versions? In my 
understanding, we cannot install 2 different minor versions of a package.
As long as there is no stable API, there is no major versions defined.
So don't you think we should speak about it later?
> Same applies to header files. A good option here would be to install
> them to /usr/include/libdpdk{version}/ and have a dpdk-1.5.2.pc which
> provides Cflags: -I${includedir}/libdpdk${version}
Yes same applies :)
I agree that a .pc file would be a good idea. But we also must allow to build 
with the DPDK framework.
> You'll also need for all packages and subpackages installing shared
> libraries:
> 
> %post -p /sbin/ldconfig
> %postun -p /sbin/ldconfig
OK
Thanks for the review
-- 
Thomas
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 20:47  0%               ` François-Frédéric Ozog
  2014-01-29 23:15  3%                 ` Thomas Graf
@ 2014-03-13  7:37  0%                 ` David Nyström
  1 sibling, 0 replies; 86+ results
From: David Nyström @ 2014-03-13  7:37 UTC (permalink / raw)
  To: François-Frédéric Ozog, 'Thomas Graf',
	'Vincent JARDIN'
  Cc: dev, dev, dpdk-ovs
On 2014-01-29 21:47, François-Frédéric Ozog wrote:
>>> First and easy answer: it is open source, so anyone can recompile. So,
>>> what's the issue?
>>
>> I'm talking from a pure distribution perspective here: Requiring to
>> recompile all DPDK based applications to distribute a bugfix or to add
>> support for a new PMD is not ideal.
>
>>
>> So ideally OVS would have the possibility to link against the shared
>> library long term.
>
> I agree that distribution of DPDK apps is not covered properly at present.
> Identifying the proper scheme requires a specific analysis based on the
> constraints of the Telecom/Cloud/Networking markets.
>
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays  and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).
>
>
>>
>>> I get lost: do you mean ABI + API toward the PMDs or towards the
>>> applications using the librte ?
>>
>> Towards the PMDs is more straight forward at first so it seems logical to
>> focus on that first.
>
> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.
Hi,
Sorry for jumping in late.
Perhaps you are already aware of OpenDataPlane, which can use DPDK as 
its south bound NIC interface.
>
> Cordially,
>
> François-Frédéric
>
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 20:47  0%               ` François-Frédéric Ozog
@ 2014-01-29 23:15  3%                 ` Thomas Graf
  2014-03-13  7:37  0%                 ` David Nyström
  1 sibling, 0 replies; 86+ results
From: Thomas Graf @ 2014-01-29 23:15 UTC (permalink / raw)
  To: François-Frédéric Ozog, 'Vincent JARDIN'
  Cc: dev, dev, 'Gerald Rogers', dpdk-ovs
On 01/29/2014 09:47 PM, François-Frédéric Ozog wrote:
> In the telecom world, if you fix the underlying framework of an app, you
> will still have to validate the solution, ie app/framework. In addition, the
> idea of shared libraries introduces the implied requirement to validate apps
> against diverse versions of DPDK shared libraries. This translates into
> development and support costs.
>
> I also expect many DPDK applications to tackle core networking features,
> with sub micro second packet handling delays  and even lower than 200ns
> (NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
> mentioning that optimization stops are shared libraries boundaries (gcc
> whole program optimization can be very effective...). Microsoft DLL linkage
> are an order of magnitude faster. If Linux was to provide that, I would
> probably revise my judgment. (I haven't checked Linux dynamic linking
> implementation for some time so my understanding of Linux dynamic linking
> may be outdated).
All very valid points and I am not suggesting to stop offering the
static linking option in any way. Dynamic linking will by design result
in more cycles. My sole point is that for a core platform component
like OVS, the shared library benefits _might_ outweigh the performance
difference. In order for a shared library to be effective, some form of
ABI compatibility must be guaranteed though.
> I don't think it is so straight forward. Many recent cards such as Chelsio
> and Myricom have a very different "packet memory layout" that does not fit
> so easily into actual DPDK architecture.
>
> 1) "traditional" architecture: the driver reserves X buffers and provide the
> card with descriptors of those buffers. Each packet is DMA'ed into exactly
> one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
> one buffer
>
> 2) "alternative" new architecture: the driver reserves a memory zone, say
> 4MB, without any structure, and provide a a single zone description and a
> ring buffer to the card. (there no individual buffer descriptors any more).
> The card fills the memory zone with packets, one next to the other and
> specifies where the packets are by updating the supplied ring. Out of the
> many issues fitting this scheme into DPDK, you cannot free a single mbuf:
> you have to maintain a ref count to the memory zone so that, when all mbufs
> have been "released", the memory zone can be freed.
> That's quite a stretch from actual paradigm.
>
> Apart from this aspect, managing RSS is two tied to Intel's flow director
> concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
>
> That said, I fully agree PMD API should be revisited.
Fair enough. I don't see a reason why multiple interfaces could not
coexist in order to support multiple memory layouts. What I'm hearing
so far is that while there is no objection to bringing stability to the
APIs, it should not result in performance side effects and it is still
early to nail down the yet fluent APIs.
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 17:14  3%             ` Thomas Graf
  2014-01-29 18:42  4%               ` Stephen Hemminger
@ 2014-01-29 20:47  0%               ` François-Frédéric Ozog
  2014-01-29 23:15  3%                 ` Thomas Graf
  2014-03-13  7:37  0%                 ` David Nyström
  1 sibling, 2 replies; 86+ results
From: François-Frédéric Ozog @ 2014-01-29 20:47 UTC (permalink / raw)
  To: 'Thomas Graf', 'Vincent JARDIN'
  Cc: dev, dev, 'Gerald Rogers', dpdk-ovs
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to add
> support for a new PMD is not ideal.
> 
> So ideally OVS would have the possibility to link against the shared
> library long term.
I agree that distribution of DPDK apps is not covered properly at present.
Identifying the proper scheme requires a specific analysis based on the
constraints of the Telecom/Cloud/Networking markets.
In the telecom world, if you fix the underlying framework of an app, you
will still have to validate the solution, ie app/framework. In addition, the
idea of shared libraries introduces the implied requirement to validate apps
against diverse versions of DPDK shared libraries. This translates into
development and support costs.
I also expect many DPDK applications to tackle core networking features,
with sub micro second packet handling delays  and even lower than 200ns
(NAT64...). The lazy binding based on ELF PLT represent quite a cost, not
mentioning that optimization stops are shared libraries boundaries (gcc
whole program optimization can be very effective...). Microsoft DLL linkage
are an order of magnitude faster. If Linux was to provide that, I would
probably revise my judgment. (I haven't checked Linux dynamic linking
implementation for some time so my understanding of Linux dynamic linking
may be outdated).
> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical to
> focus on that first.
I don't think it is so straight forward. Many recent cards such as Chelsio
and Myricom have a very different "packet memory layout" that does not fit
so easily into actual DPDK architecture.
1) "traditional" architecture: the driver reserves X buffers and provide the
card with descriptors of those buffers. Each packet is DMA'ed into exactly
one buffer. Typically you have 2K buffers, a 64 byte packet consumes exactly
one buffer
2) "alternative" new architecture: the driver reserves a memory zone, say
4MB, without any structure, and provide a a single zone description and a
ring buffer to the card. (there no individual buffer descriptors any more).
The card fills the memory zone with packets, one next to the other and
specifies where the packets are by updating the supplied ring. Out of the
many issues fitting this scheme into DPDK, you cannot free a single mbuf:
you have to maintain a ref count to the memory zone so that, when all mbufs
have been "released", the memory zone can be freed.
That's quite a stretch from actual paradigm.
Apart from this aspect, managing RSS is two tied to Intel's flow director
concepts and cannot accommodate directly smarter or dumber RSS mechanisms.
That said, I fully agree PMD API should be revisited.
Cordially,
François-Frédéric
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 17:14  3%             ` Thomas Graf
@ 2014-01-29 18:42  4%               ` Stephen Hemminger
  2014-01-29 20:47  0%               ` François-Frédéric Ozog
  1 sibling, 0 replies; 86+ results
From: Stephen Hemminger @ 2014-01-29 18:42 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On Wed, 29 Jan 2014 18:14:01 +0100
Thomas Graf <tgraf@redhat.com> wrote:
> On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> > Thomas,
> >
> > First and easy answer: it is open source, so anyone can recompile. So,
> > what's the issue?
> 
> I'm talking from a pure distribution perspective here: Requiring to
> recompile all DPDK based applications to distribute a bugfix or to
> add support for a new PMD is not ideal.
> 
> So ideally OVS would have the possibility to link against the shared
> library long term.
> 
> > I get lost: do you mean ABI + API toward the PMDs or towards the
> > applications using the librte ?
> 
> Towards the PMDs is more straight forward at first so it seems logical
> to focus on that first.
> 
> A stable API and ABI for librte seems required as well long term as
> DPDK does offer shared libraries but I realize that this is a stretch
> goal in the initial phase.
I would hate to see the API/ABI nailed down. We have lots of bug fixes
and new drivers that are ready to contribute, but most of them have some
change to existing ABI.
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 16:34  3%           ` Vincent JARDIN
@ 2014-01-29 17:14  3%             ` Thomas Graf
  2014-01-29 18:42  4%               ` Stephen Hemminger
  2014-01-29 20:47  0%               ` François-Frédéric Ozog
  0 siblings, 2 replies; 86+ results
From: Thomas Graf @ 2014-01-29 17:14 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On 01/29/2014 05:34 PM, Vincent JARDIN wrote:
> Thomas,
>
> First and easy answer: it is open source, so anyone can recompile. So,
> what's the issue?
I'm talking from a pure distribution perspective here: Requiring to
recompile all DPDK based applications to distribute a bugfix or to
add support for a new PMD is not ideal.
So ideally OVS would have the possibility to link against the shared
library long term.
> I get lost: do you mean ABI + API toward the PMDs or towards the
> applications using the librte ?
Towards the PMDs is more straight forward at first so it seems logical
to focus on that first.
A stable API and ABI for librte seems required as well long term as
DPDK does offer shared libraries but I realize that this is a stretch
goal in the initial phase.
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 11:14  4%         ` Thomas Graf
@ 2014-01-29 16:34  3%           ` Vincent JARDIN
  2014-01-29 17:14  3%             ` Thomas Graf
  0 siblings, 1 reply; 86+ results
From: Vincent JARDIN @ 2014-01-29 16:34 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Thomas,
First and easy answer: it is open source, so anyone can recompile. So, 
what's the issue?
> Without a concept of stable interfaces, it will be difficult to
> package and distribute RTE libraries, PMD, and DPDK applications. Right
> now, the obvious path would include packaging the PMD bits together
> with each DPDK application depending on the version of DPDK the binary
> was compiled against. This is clearly not ideal.
>
>> I agree that some areas could be improved since they are not into the
>> critical datapath of packets, but still other areas remain very CPU
>> constraints. For instance:
>> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>>
>> is bad:
>>     struct eth_dev_ops
>> is churned, no comment, and a #ifdef that changes the structure
>> according to compilation!
>
> This is a very good example as it outlines the difference between
> control structures and the fast path. We have this same exact trade off
> in the kernel a lot where we have highly optimized internal APIs
> towards modules and drivers but want to provide binary compatibility to
> a certain extend.
As long as we agree on this limited scope, we'll think about it and 
provide a proposal on dev@dpdk.org mailing list.
> As for the specific example you mention, it is relatively trivial to
> make eth_dev_ops backwards compatible by appending appropriate padding
> to the struct before a new major release and ensure that new members
> are added by replacing the padding accordingly. Obviously no ifdefs
> would be allowed anymore.
Of course, it is basic C!
>> Should an application use the librte libraries of the DPDK:
>>    - you can use RTE_VERSION and RTE_VERSION_NUM :
>> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
>
> Right. This would be more or less identical to requiring a specific
> DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
> clutter their code with #ifdefs all over for every new minor release
> though.
>
>>    - you can write your own wrapper (with CPU overhead) in order to have
>> a stable ABI, that wrapper should be tight to the versions of the librte
>> => the overhead is part of your application instead of the DPDK,
>>    - *otherwise recompile your software, it is opensource, what's the
>> issue?*
>>
>> We are opened to any suggestion to have stable ABI, but it should never
>> remove the options to have fast/efficient/compilation/CPU execution
>> processing.
>
> Absolutely agreed. We also don't want to add tons of abstraction and
> overcomplicate everything. Still, I strongly believe that the definition
> of stable interfaces towards applications and especially PMD is
> essential.
>
> I'm not proposing to standardize all the APIs towards applications on
> the level of POSIX. DPDK is in early stages and disruptive changes will
> come along. What I would propose on an abstract level is:
>
> 1. Extend but not break API between minor releases. Postpone API
>     breakages to the next major release. High cadence of major
>     releases initially, lower cadence as DPDK matures.
>
> 2. Define ABI stability towards PMD for minor releases to allow
>     isolated packaging of PMD by padding control structures and keeping
>     functions ABI stable.
I get lost: do you mean ABI + API toward the PMDs or towards the 
applications using the librte ?
Best regards,
   Vincent
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29 10:26  5%       ` Vincent JARDIN
@ 2014-01-29 11:14  4%         ` Thomas Graf
  2014-01-29 16:34  3%           ` Vincent JARDIN
  0 siblings, 1 reply; 86+ results
From: Thomas Graf @ 2014-01-29 11:14 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Vincent,
On 01/29/2014 11:26 AM, Vincent JARDIN wrote:
> DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no
> standard. Currently, there is no such plan to have a stable ABI since we
> need to keep freedom to chase CPU cycles over having a stable ABI. For
> instance, some applications on top of the DPDK process the packets in
> less than 150 CPU cycles (have a look at testpmd:
>    http://dpdk.org/browse/dpdk/tree/app/test-pmd )
I understand the requirement to not introduce overhead with wrappers
or shim layers. No problem with that. I believe this is mainly a policy
and release process issue.
Without a concept of stable interfaces, it will be difficult to
package and distribute RTE libraries, PMD, and DPDK applications. Right
now, the obvious path would include packaging the PMD bits together
with each DPDK application depending on the version of DPDK the binary
was compiled against. This is clearly not ideal.
> I agree that some areas could be improved since they are not into the
> critical datapath of packets, but still other areas remain very CPU
> constraints. For instance:
> http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
>
> is bad:
>     struct eth_dev_ops
> is churned, no comment, and a #ifdef that changes the structure
> according to compilation!
This is a very good example as it outlines the difference between
control structures and the fast path. We have this same exact trade off
in the kernel a lot where we have highly optimized internal APIs
towards modules and drivers but want to provide binary compatibility to
a certain extend.
As for the specific example you mention, it is relatively trivial to
make eth_dev_ops backwards compatible by appending appropriate padding
to the struct before a new major release and ensure that new members
are added by replacing the padding accordingly. Obviously no ifdefs
would be allowed anymore.
> Should an application use the librte libraries of the DPDK:
>    - you can use RTE_VERSION and RTE_VERSION_NUM :
> http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
Right. This would be more or less identical to requiring a specific
DPDK version in OVS_CHEC_DPDK. It's not ideal to require application to
clutter their code with #ifdefs all over for every new minor release
though.
>    - you can write your own wrapper (with CPU overhead) in order to have
> a stable ABI, that wrapper should be tight to the versions of the librte
> => the overhead is part of your application instead of the DPDK,
>    - *otherwise recompile your software, it is opensource, what's the
> issue?*
>
> We are opened to any suggestion to have stable ABI, but it should never
> remove the options to have fast/efficient/compilation/CPU execution
> processing.
Absolutely agreed. We also don't want to add tons of abstraction and
overcomplicate everything. Still, I strongly believe that the definition
of stable interfaces towards applications and especially PMD is
essential.
I'm not proposing to standardize all the APIs towards applications on
the level of POSIX. DPDK is in early stages and disruptive changes will
come along. What I would propose on an abstract level is:
1. Extend but not break API between minor releases. Postpone API
    breakages to the next major release. High cadence of major
    releases initially, lower cadence as DPDK matures.
2. Define ABI stability towards PMD for minor releases to allow
    isolated packaging of PMD by padding control structures and keeping
    functions ABI stable.
I realize that this might be less trivial than it seems without
sacrificing performance but I consider it effort well spent.
Thomas
^ permalink raw reply	[relevance 4%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-29  8:15  3%     ` Thomas Graf
@ 2014-01-29 10:26  5%       ` Vincent JARDIN
  2014-01-29 11:14  4%         ` Thomas Graf
  0 siblings, 1 reply; 86+ results
From: Vincent JARDIN @ 2014-01-29 10:26 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
Hi Thomas,
On 29/01/2014 09:15, Thomas Graf wrote:
 > The obvious and usual best practise would be for DPDK to guarantee
 > ABI stability between minor releases.
 >
 > Since dpdk-dev is copied as well, any comments?
DPDK's ABIs are not Kernel's ABIs, they are not POSIX, there is no 
standard. Currently, there is no such plan to have a stable ABI since we 
need to keep freedom to chase CPU cycles over having a stable ABI. For 
instance, some applications on top of the DPDK process the packets in 
less than 150 CPU cycles (have a look at testpmd:
   http://dpdk.org/browse/dpdk/tree/app/test-pmd )
I agree that some areas could be improved since they are not into the 
critical datapath of packets, but still other areas remain very CPU 
constraints. For instance:
http://dpdk.org/browse/dpdk/commit/lib/librte_ether/rte_ethdev.h?id=c3d0564cf0f00c3c9a61cf72bd4bd1c441740637
is bad:
    struct eth_dev_ops
is churned, no comment, and a #ifdef that changes the structure 
according to compilation!
Should an application use the librte libraries of the DPDK:
   - you can use RTE_VERSION and RTE_VERSION_NUM :
http://dpdk.org/doc/api/rte__version_8h.html#a8775053b0f721b9fa0457494cfbb7ed9
   - you can write your own wrapper (with CPU overhead) in order to have 
a stable ABI, that wrapper should be tight to the versions of the librte 
=> the overhead is part of your application instead of the DPDK,
   - *otherwise recompile your software, it is opensource, what's the 
issue?*
We are opened to any suggestion to have stable ABI, but it should never 
remove the options to have fast/efficient/compilation/CPU execution 
processing.
Best regards,
   Vincent
^ permalink raw reply	[relevance 5%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
  2014-01-28 18:17  0%   ` [dpdk-dev] [ovs-dev] " Pravin Shelar
@ 2014-01-29  8:15  3%     ` Thomas Graf
  2014-01-29 10:26  5%       ` Vincent JARDIN
  0 siblings, 1 reply; 86+ results
From: Thomas Graf @ 2014-01-29  8:15 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On 01/28/2014 07:17 PM, Pravin Shelar wrote:
> Right, version mismatch will not work. API provided by DPDK are not
> stable, So OVS has to be built for different releases for now.
>
> I do not see how we can fix it from OVS side. DPDK needs to
> standardize API, Actually OVS also needs more API, like DPDK
> initialization, mempool destroy, etc.
Agreed. It's not fixable from the OVS side. I also don't want to
object to including this. I'm just raising awareness of the issue
as this will become essential for dstribution.
The obvious and usual best practise would be for DPDK to guarantee
ABI stability between minor releases.
Since dpdk-dev is copied as well, any comments?
^ permalink raw reply	[relevance 3%]
* Re: [dpdk-dev] [ovs-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports.
       [not found]     ` <52E7D13B.9020404@redhat.com>
@ 2014-01-28 18:17  0%   ` Pravin Shelar
  2014-01-29  8:15  3%     ` Thomas Graf
  0 siblings, 1 reply; 86+ results
From: Pravin Shelar @ 2014-01-28 18:17 UTC (permalink / raw)
  To: Thomas Graf; +Cc: dev, dev, Gerald Rogers, dpdk-ovs
On Tue, Jan 28, 2014 at 7:48 AM, Thomas Graf <tgraf@redhat.com> wrote:
> On 01/28/2014 02:48 AM, pshelar@nicira.com wrote:
>>
>> From: Pravin B Shelar <pshelar@nicira.com>
>>
>> Following patch adds DPDK netdev-class to userspace datapath.
>> Approach taken in this patch differs from Intel® DPDK vSwitch
>> where DPDK datapath switching is done in saparate process.  This
>> patch adds support for DPDK type port and uses OVS userspace
>> datapath for switching.  Therefore all DPDK processing and flow
>> miss handling is done in single process.  This also avoids code
>> duplication by reusing OVS userspace datapath switching and
>> therefore it supports all flow matching and actions that
>> user-space datapath supports.  Refer to INSTALL.DPDK doc for
>> further info.
>>
>> With this patch I got similar performance for netperf TCP_STREAM
>> tests compared to kernel datapath.
>
>
> I'm happy to see this happen!
>
>
>
>> +static const struct rte_eth_conf port_conf = {
>> +        .rxmode = {
>> +                .mq_mode = ETH_MQ_RX_RSS,
>> +                .split_hdr_size = 0,
>> +                .header_split   = 0, /* Header Split disabled */
>> +                .hw_ip_checksum = 0, /* IP checksum offload enabled */
>> +                .hw_vlan_filter = 0, /* VLAN filtering disabled */
>> +                .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>> +                .hw_strip_crc   = 0, /* CRC stripped by hardware */
>> +        },
>> +        .rx_adv_conf = {
>> +                .rss_conf = {
>> +                        .rss_key = NULL,
>> +                        .rss_hf = ETH_RSS_IPV4_TCP | ETH_RSS_IPV4 |
>> ETH_RSS_IPV6,
>> +                },
>> +        },
>> +        .txmode = {
>> +                .mq_mode = ETH_MQ_TX_NONE,
>> +        },
>> +};
>
>
> I realize this is an RFC patch but I will ask anyway:
>
> What are the plans on managing runtime dependencies of a DPDK enabled OVS
> and DPDK itself? Will a OVS built against DPDK 1.5.2 work with
> drivers written for 1.5.3?
>
> Based on the above use of struct rte_eth_conf it would seem that once
> released, rte_eth_conf cannot be extended anymore without breaking
> ABI compatibility. The same applies to many of the other user
> structures. I see various structures changes between minor releases,
> for example dpdk.org ed2c69c3ef7 between 1.5.1 and 1.5.2.
>
Right, version mismatch will not work. API provided by DPDK are not
stable, So OVS has to be built for different releases for now.
I do not see how we can fix it from OVS side. DPDK needs to
standardize API, Actually OVS also needs more API, like DPDK
initialization, mempool destroy, etc.
^ permalink raw reply	[relevance 0%]
* Re: [dpdk-dev] [PATCH 4/7] eal: support different modules
  @ 2013-06-03 17:25  3%       ` Stephen Hemminger
  0 siblings, 0 replies; 86+ results
From: Stephen Hemminger @ 2013-06-03 17:25 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev
On Mon, 3 Jun 2013 18:29:02 +0200
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:
> 03/06/2013 18:08, Antti Kantee :
> > On 03.06.2013 10:58, Damien Millescamps wrote:
> > >> -/** Device needs igb_uio kernel module */
> > >> -#define RTE_PCI_DRV_NEED_IGB_UIO 0x0001
> > >> 
> > >>   /** Device driver must be registered several times until failure */
> > >> 
> > >> -#define RTE_PCI_DRV_MULTIPLE 0x0002
> > >> +#define RTE_PCI_DRV_MULTIPLE 0x0001
> > > 
> > > You are breaking a public API here, and I don't see any technical reason
> > > to do so. The RTE_PCI_DRV_NEED_IGB_UIO flag could be deprecated, but
> > > there is no way its value could be recycled into an already existing
> > > flag.
> > 
> > Is breaking the API a bad thing in this context?  IMHO the
> > initialization APIs need work before they're general enough and
> > perpetually supporting the current ones seems like an unnecessary
> > burden.  I'm trying to understand the general guidelines of the project.
> > 
> > (and nittily, recycling flag values is fine for static-only libs as long
> > as you remove the old macro, but of course removal is the API breakage
> > you mentioned)
> 
> Yes, DPDK is a young project but breaking API should be always justified.
> In this case it is not mandatory to change it.
> 
This is a source project, there is no fixed ABI.
^ permalink raw reply	[relevance 3%]
Results 14001-14086 of 14086	prev (newer) | reverse | sort options + mbox downloads above
-- links below jump to the message on this page --
2013-05-30 17:12     [dpdk-dev] [PATCH 0/7] Vyatta patches Stephen Hemminger
2013-06-03  8:58     ` [dpdk-dev] [PATCH 4/7] eal: support different modules Damien Millescamps
2013-06-03 16:08       ` Antti Kantee
2013-06-03 16:29         ` Thomas Monjalon
2013-06-03 17:25  3%       ` Stephen Hemminger
2014-01-28  1:48     [dpdk-dev] [PATCH RFC] dpif-netdev: Add support Intel DPDK based ports pshelar
     [not found]     ` <52E7D13B.9020404@redhat.com>
2014-01-28 18:17  0%   ` [dpdk-dev] [ovs-dev] " Pravin Shelar
2014-01-29  8:15  3%     ` Thomas Graf
2014-01-29 10:26  5%       ` Vincent JARDIN
2014-01-29 11:14  4%         ` Thomas Graf
2014-01-29 16:34  3%           ` Vincent JARDIN
2014-01-29 17:14  3%             ` Thomas Graf
2014-01-29 18:42  4%               ` Stephen Hemminger
2014-01-29 20:47  0%               ` François-Frédéric Ozog
2014-01-29 23:15  3%                 ` Thomas Graf
2014-03-13  7:37  0%                 ` David Nyström
2014-02-04 15:54     [dpdk-dev] [PATCH 00/16] recipes for RPM packages Thomas Monjalon
2014-02-04 15:54     ` [dpdk-dev] [PATCH 03/16] pkg: add recipe for RPM Thomas Monjalon
2014-02-26 13:07       ` Thomas Graf
2014-04-02  9:53  3%     ` Thomas Monjalon
2014-04-02 11:29  0%       ` Neil Horman
2014-04-09 18:39  7% [dpdk-dev] DPDK API/ABI Stability Neil Horman
2014-04-09 21:08  4% ` Stephen Hemminger
2014-04-10 10:54  7%   ` Neil Horman
2014-04-10 20:47     [dpdk-dev] [PATCH 0/19] Separate compile time linkage between eal lib and pmd's Neil Horman
2014-04-12 11:04     ` Neil Horman
2014-04-15  8:31       ` Thomas Monjalon
2014-04-15 13:46  3%     ` Neil Horman
2014-04-30  0:46  4% [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Thomas Monjalon
2014-04-30  0:46  4% ` [dpdk-dev] [PATCH v2 1/4] pkg: add recipe for RPM Thomas Monjalon
2014-04-30 10:52  0% ` [dpdk-dev] [PATCH v2 0/4] recipes for RPM packages Neil Horman
2014-05-01 13:14  0% ` Neil Horman
2014-05-01 21:15  0%   ` Thomas Monjalon
2014-05-13 19:08  4% [dpdk-dev] Heads up: Fedora packaging plans Neil Horman
2014-05-19 10:11  0% ` Thomas Monjalon
2014-05-19 13:18  0%   ` Neil Horman
2014-05-20 10:00     [dpdk-dev] [PATCH 0/4] New library: rte_distributor Bruce Richardson
2014-05-20 10:00     ` [dpdk-dev] [PATCH 2/4] distributor: new packet distributor library Bruce Richardson
2014-05-20 18:18  4%   ` Neil Horman
2014-05-21 10:21  3%     ` Richardson, Bruce
2014-05-21 15:23  3%       ` Neil Horman
2014-07-24 14:28 11% [dpdk-dev] [PATCH] kni: fixed compilation error on Ubuntu 14.04 LTS (kernel 3.13.0-30.54) Pablo de Lara
2014-07-24 14:54  0% ` Thomas Monjalon
2014-07-24 14:59  0%   ` Thomas Monjalon
2014-07-24 15:20  0% ` Chris Wright
2014-08-07 18:31     [dpdk-dev] [PATCHv2] librte_acl make it build/work for 'default' target Konstantin Ananyev
2014-08-07 20:11  4% ` Neil Horman
2014-08-07 20:58  0%   ` Vincent JARDIN
2014-08-08 11:49  0%   ` Ananyev, Konstantin
2014-08-08 12:25  4%     ` Neil Horman
2014-08-08 13:09  3%       ` Ananyev, Konstantin
2014-08-08 14:30  3%         ` Neil Horman
2014-08-11 22:23  0%           ` Thomas Monjalon
2014-08-21 20:15  1% ` [dpdk-dev] [PATCHv3] " Neil Horman
2014-08-25 16:30  0%   ` Ananyev, Konstantin
2014-08-26 17:44  0%     ` Neil Horman
2014-08-27 11:25  0%       ` Ananyev, Konstantin
2014-08-27 18:56  0%         ` Neil Horman
2014-08-27 19:18  0%           ` Ananyev, Konstantin
2014-08-28  9:02  0%             ` Richardson, Bruce
2014-08-28 15:55  0%             ` Neil Horman
2014-08-28 20:38  1% ` [dpdk-dev] [PATCHv4] " Neil Horman
2014-08-29 17:58  0%   ` Ananyev, Konstantin
2014-09-01 15:28  1% [dpdk-dev] [PATCHv5] " Konstantin Ananyev
2014-09-02 13:43  0% ` Neil Horman
2014-09-03  1:29  0%   ` Thomas Monjalon
2014-09-15 19:23  4% [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Neil Horman
2014-09-15 19:23  4% ` [dpdk-dev] [PATCH 1/4] compat: Add infrastructure to support symbol versioning Neil Horman
2014-09-23 10:39  0%   ` Sergio Gonzalez Monroy
2014-09-23 14:58  0%     ` Neil Horman
2014-09-23 16:29  0%       ` Sergio Gonzalez Monroy
2014-09-23 17:31  0%         ` Neil Horman
2014-09-25 18:52  4%   ` [dpdk-dev] [PATCH 1/4 v2] " Neil Horman
2014-09-26 14:16  0%     ` Sergio Gonzalez Monroy
2014-09-26 15:16  0%       ` Neil Horman
2014-09-26 15:33  0%         ` Sergio Gonzalez Monroy
2014-09-26 16:22  0%           ` Neil Horman
2014-09-26 19:19  0%             ` Neil Horman
2014-09-15 19:23     ` [dpdk-dev] [PATCH 2/4] Provide initial versioning for all DPDK libraries Neil Horman
2014-09-19  9:45  4%   ` Bruce Richardson
2014-09-19 10:22  0%     ` Neil Horman
2014-09-15 19:23  7% ` [dpdk-dev] [PATCH 3/4] Add library version extenstion Neil Horman
2014-09-15 19:23 23% ` [dpdk-dev] [PATCH 4/4] docs: Add ABI documentation Neil Horman
2014-09-18 18:23  0% ` [dpdk-dev] [PATCH 0/4] Add DSO symbol versioning to support backwards compatibility Thomas Monjalon
2014-09-18 19:14  4%   ` Neil Horman
2014-09-19  8:57  0%     ` Richardson, Bruce
2014-09-19 14:18  0%     ` Venkatesan, Venky
2014-09-19 17:45  4%       ` Neil Horman
2014-09-24 18:19  3%     ` Neil Horman
2014-09-26 10:41  0%       ` Thomas Monjalon
2014-09-26 14:45  5%         ` Neil Horman
2014-09-26 22:02  4%           ` Stephen Hemminger
2014-09-27  2:22  5%             ` Neil Horman
2014-09-18 10:34     [dpdk-dev] [PATCH 0/3] New Thread Safe Hash Library Pablo de Lara
2014-09-18 12:21  3% ` Neil Horman
2014-09-18 15:31  0%   ` De Lara Guarch, Pablo
2014-09-18 15:45  0%     ` Thomas Monjalon
2014-09-18 16:09  3%     ` Neil Horman
2014-09-25 12:56     [dpdk-dev] [PATCH v2] Change alarm cancel function to thread-safe: Michal Jastrzebski
2014-09-25 15:08     ` Neil Horman
2014-09-25 16:03       ` Ananyev, Konstantin
2014-09-25 17:23         ` Neil Horman
2014-09-25 23:24           ` Ananyev, Konstantin
2014-09-26 11:46             ` Neil Horman
2014-09-26 12:37               ` Wodkowski, PawelX
2014-09-26 13:40                 ` Neil Horman
2014-09-26 14:01                   ` Wodkowski, PawelX
2014-09-26 15:01  5%                 ` Neil Horman
2014-09-26 15:41  0%                   ` Ananyev, Konstantin
2014-09-26 16:21  3%                     ` Neil Horman
2014-09-26  6:33           ` Wodkowski, PawelX
2014-09-26 13:43  3%         ` Neil Horman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).