DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
@ 2014-07-29 20:24 Neil Horman
  2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw)
  To: dev

Hey all-
        I've been trying to update the fedora dpdk package to support VFIO 
enabled drivers and ran into a problem in which ixgbe didn't compile because the 
rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
default config I have.  I tried to remedy this by replacing the intrinsics with 
the __builtin macros, but it was pointed out (correctly), that this doesn't work 
properly.  So this is my second attempt, which I actually like a bit better.  I 
noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
have those instructions turned on build-wide.  Rather, we can just enable the 
instructions in the specific code we want to build with support for that, and 
test for instruction support dynamically at run time.  This allows me to build 
the dpdk for a generic platform, but in such a way that some optimizations can 
be used if the executing cpu supports them at run time.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations
  2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
@ 2014-07-29 20:24 ` Neil Horman
  2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw)
  To: dev

The ixgbe vector receive code uses the sse4.2 intrinsics directly.  Instead of
requiring that they be enabled at build time for the entire dpdk library, just
enable sse4.2 for the ixgbe_rxtx_vec.c file, and test for support at runtime
when using it

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_pmd_ixgbe/Makefile         | 6 ++++++
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/lib/librte_pmd_ixgbe/Makefile b/lib/librte_pmd_ixgbe/Makefile
index 00ccedb..c002239 100644
--- a/lib/librte_pmd_ixgbe/Makefile
+++ b/lib/librte_pmd_ixgbe/Makefile
@@ -39,6 +39,12 @@ LIB = librte_pmd_ixgbe.a
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
+#
+# the vectorized recieve functions need sse4.2 instruction 
+# intrinsics, make sure we emit them from the compiler
+#
+CFLAGS_ixgbe_rxtx_vec.o += -msse4.2
+
 ifeq ($(CC), icc)
 #
 # CFLAGS for icc
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 09e19a3..ae615bc 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -34,6 +34,7 @@
 #include <stdint.h>
 #include <rte_ethdev.h>
 #include <rte_malloc.h>
+#include <rte_cpuflags.h>
 
 #include "ixgbe_ethdev.h"
 #include "ixgbe_rxtx.h"
@@ -679,6 +680,8 @@ int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev)
 	struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode;
 	struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf;
 
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2))
+		return -1;
 #ifndef RTE_IXGBE_RX_OLFLAGS_ENABLE
 	/* whithout rx ol_flags, no VP flag report */
 	if (rxmode->hw_vlan_strip != 0 ||
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check
  2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
  2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
@ 2014-07-29 20:24 ` Neil Horman
  2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw)
  To: dev

The ACL library relies on sse4.2 intrinsics to operate, but we don't have to
enable sse4.2 in the entire build.  Instead enable it for the ACL library alone,
and use rte_acl_create as a choke point, at which we can test for sse4.2
support, and return NULL.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_acl/Makefile  | 2 +-
 lib/librte_acl/rte_acl.c | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..3646439 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -35,7 +35,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_acl.a
 
 CFLAGS += -O3
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -msse4.2
 
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..14ea7e0 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param)
 	struct rte_acl_list *acl_list;
 	struct rte_tailq_entry *te;
 	char name[sizeof(ctx->name)];
+	static int acl_supported = -1;
+
+	if (acl_supported == -1)
+		acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2);
+
+	if (!acl_supported) {
+		rte_errno = EOPNOTSUPP;
+		return NULL;
+	}
 
 	/* check that we have an initialised tail queue */
 	acl_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_ACL, rte_acl_list);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
  2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
  2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
@ 2014-07-30 12:07 ` Ananyev, Konstantin
  2014-07-30 13:01   ` Neil Horman
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
  2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson
  4 siblings, 1 reply; 45+ messages in thread
From: Ananyev, Konstantin @ 2014-07-30 12:07 UTC (permalink / raw)
  To: Neil Horman, dev



Hi Neil,
 
> Hey all-
>         I've been trying to update the fedora dpdk package to support VFIO
> enabled drivers and ran into a problem in which ixgbe didn't compile because the
> rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> default config I have.  I tried to remedy this by replacing the intrinsics with
> the __builtin macros, but it was pointed out (correctly), that this doesn't work
> properly.  So this is my second attempt, which I actually like a bit better.  I
> noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> have those instructions turned on build-wide.  Rather, we can just enable the
> instructions in the specific code we want to build with support for that, and
> test for instruction support dynamically at run time.  This allows me to build
> the dpdk for a generic platform, but in such a way that some optimizations can
> be used if the executing cpu supports them at run time.
 
Indeed it looks much better to me too.
Just few nits from me:

1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param)
>  	struct rte_acl_list *acl_list;
>  	struct rte_tailq_entry *te;
>  	char name[sizeof(ctx->name)];
> +	static int acl_supported = -1;
> +
> +	if (acl_supported == -1)
> +		acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2);

Do we really need acl_supported here?
It seems not a big deal to just always call  rte_cpu_get_flag_enabled().
After all it is a create function, and no-one expects it to be extremely fast.

2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported?

Konstantin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
@ 2014-07-30 13:01   ` Neil Horman
  2014-07-30 13:44     ` Ananyev, Konstantin
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-07-30 13:01 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Wed, Jul 30, 2014 at 12:07:39PM +0000, Ananyev, Konstantin wrote:
> 
> 
> Hi Neil,
>  
> > Hey all-
> >         I've been trying to update the fedora dpdk package to support VFIO
> > enabled drivers and ran into a problem in which ixgbe didn't compile because the
> > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> > default config I have.  I tried to remedy this by replacing the intrinsics with
> > the __builtin macros, but it was pointed out (correctly), that this doesn't work
> > properly.  So this is my second attempt, which I actually like a bit better.  I
> > noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> > have those instructions turned on build-wide.  Rather, we can just enable the
> > instructions in the specific code we want to build with support for that, and
> > test for instruction support dynamically at run time.  This allows me to build
> > the dpdk for a generic platform, but in such a way that some optimizations can
> > be used if the executing cpu supports them at run time.
>  
> Indeed it looks much better to me too.
> Just few nits from me:
> 
> 1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param)
> >  	struct rte_acl_list *acl_list;
> >  	struct rte_tailq_entry *te;
> >  	char name[sizeof(ctx->name)];
> > +	static int acl_supported = -1;
> > +
> > +	if (acl_supported == -1)
> > +		acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2);
> 
> Do we really need acl_supported here?
> It seems not a big deal to just always call  rte_cpu_get_flag_enabled().
> After all it is a create function, and no-one expects it to be extremely fast.
> 
Need, no.  My only thought was that some poorly behaved application will call
rte_acl_create multiple times regardless of the error returned, and doing so
will cause large volumes of calls to cpuid, which evicts several high-use
registers, so I didn't want to call it more than needed.  If you think its ok to
call it multiple times though, I'm fine with removing it.

> 2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported?
> 
Absolutely, v2 shortly.
Neil

> Konstantin
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 13:01   ` Neil Horman
@ 2014-07-30 13:44     ` Ananyev, Konstantin
  0 siblings, 0 replies; 45+ messages in thread
From: Ananyev, Konstantin @ 2014-07-30 13:44 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, July 30, 2014 2:01 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
> 
> On Wed, Jul 30, 2014 at 12:07:39PM +0000, Ananyev, Konstantin wrote:
> >
> >
> > Hi Neil,
> >
> > > Hey all-
> > >         I've been trying to update the fedora dpdk package to support VFIO
> > > enabled drivers and ran into a problem in which ixgbe didn't compile because the
> > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> > > default config I have.  I tried to remedy this by replacing the intrinsics with
> > > the __builtin macros, but it was pointed out (correctly), that this doesn't work
> > > properly.  So this is my second attempt, which I actually like a bit better.  I
> > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> > > have those instructions turned on build-wide.  Rather, we can just enable the
> > > instructions in the specific code we want to build with support for that, and
> > > test for instruction support dynamically at run time.  This allows me to build
> > > the dpdk for a generic platform, but in such a way that some optimizations can
> > > be used if the executing cpu supports them at run time.
> >
> > Indeed it looks much better to me too.
> > Just few nits from me:
> >
> > 1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param)
> > >  	struct rte_acl_list *acl_list;
> > >  	struct rte_tailq_entry *te;
> > >  	char name[sizeof(ctx->name)];
> > > +	static int acl_supported = -1;
> > > +
> > > +	if (acl_supported == -1)
> > > +		acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2);
> >
> > Do we really need acl_supported here?
> > It seems not a big deal to just always call  rte_cpu_get_flag_enabled().
> > After all it is a create function, and no-one expects it to be extremely fast.
> >
> Need, no.  My only thought was that some poorly behaved application will call
> rte_acl_create multiple times regardless of the error returned, and doing so
> will cause large volumes of calls to cpuid, which evicts several high-use
> registers, so I didn't want to call it more than needed.  If you think its ok to
> call it multiple times though, I'm fine with removing it.

>From my thought rte_acl_create() is not supposed to be called in  the middle packet processing.
It is sort of setup function. That's why I think nothing wrong would happen even if cpuid would be called several times. 
Again  ixgbe_rx_vec_condition_check() would probably be called much more often (for each ixgbe rx queue we are going to use).

> > 2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported?
> >
> Absolutely, v2 shortly.
> Neil
> 
> > Konstantin
> >
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
                   ` (2 preceding siblings ...)
  2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
@ 2014-07-30 14:49 ` Neil Horman
  2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
                     ` (3 more replies)
  2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson
  4 siblings, 4 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw)
  To: dev

Hey all-
        I've been trying to update the fedora dpdk package to support VFIO
enabled drivers and ran into a problem in which ixgbe didn't compile because the
rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
default config I have.  I tried to remedy this by replacing the intrinsics with
the __builtin macros, but it was pointed out (correctly), that this doesn't work
properly.  So this is my second attempt, which I actually like a bit better.  I
noted that code that uses intrinsics (ixgbe and the acl library), don't need to
have those instructions turned on build-wide.  Rather, we can just enable the
instructions in the specific code we want to build with support for that, and
test for instruction support dynamically at run time.  This allows me to build
the dpdk for a generic platform, but in such a way that some optimizations can
be used if the executing cpu supports them at run time.

Change notes:

v2)
	* Added Log messages to run time check failures per Konstantin
	* Removed run time check caching in acl per Konstantin

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
@ 2014-07-30 14:49   ` Neil Horman
  2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw)
  To: dev

The ixgbe vector receive code uses the sse4.2 intrinsics directly.  Instead of
requiring that they be enabled at build time for the entire dpdk library, just
enable sse4.2 for the ixgbe_rxtx_vec.c file, and test for support at runtime
when using it

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_pmd_ixgbe/Makefile         | 6 ++++++
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/lib/librte_pmd_ixgbe/Makefile b/lib/librte_pmd_ixgbe/Makefile
index 00ccedb..c002239 100644
--- a/lib/librte_pmd_ixgbe/Makefile
+++ b/lib/librte_pmd_ixgbe/Makefile
@@ -39,6 +39,12 @@ LIB = librte_pmd_ixgbe.a
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
+#
+# the vectorized recieve functions need sse4.2 instruction 
+# intrinsics, make sure we emit them from the compiler
+#
+CFLAGS_ixgbe_rxtx_vec.o += -msse4.2
+
 ifeq ($(CC), icc)
 #
 # CFLAGS for icc
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 09e19a3..18e8bfe 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -34,6 +34,7 @@
 #include <stdint.h>
 #include <rte_ethdev.h>
 #include <rte_malloc.h>
+#include <rte_cpuflags.h>
 
 #include "ixgbe_ethdev.h"
 #include "ixgbe_rxtx.h"
@@ -679,6 +680,10 @@ int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev)
 	struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode;
 	struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf;
 
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2)) {
+		RTE_LOG(ERR, PMD, "ixgbe vector rx needs sse4.2!\n");
+		return -1;
+	}
 #ifndef RTE_IXGBE_RX_OLFLAGS_ENABLE
 	/* whithout rx ol_flags, no VP flag report */
 	if (rxmode->hw_vlan_strip != 0 ||
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
  2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
@ 2014-07-30 14:49   ` Neil Horman
  2014-07-30 15:36   ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
  2014-07-30 19:03   ` Venky Venkatesan
  3 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw)
  To: dev

The ACL library relies on sse4.2 intrinsics to operate, but we don't have to
enable sse4.2 in the entire build.  Instead enable it for the ACL library alone,
and use rte_acl_create as a choke point, at which we can test for sse4.2
support, and return NULL.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_acl/Makefile  | 2 +-
 lib/librte_acl/rte_acl.c | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 4fe4593..3646439 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -35,7 +35,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_acl.a
 
 CFLAGS += -O3
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR)
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -msse4.2
 
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 7c288bd..37a590b 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -112,6 +112,12 @@ rte_acl_create(const struct rte_acl_param *param)
 	struct rte_acl_list *acl_list;
 	struct rte_tailq_entry *te;
 	char name[sizeof(ctx->name)];
+	
+	if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2)) {
+		RTE_LOG(ERR, ACL, "librte_acl requires sse4.2 instructions!\n");
+		rte_errno = EOPNOTSUPP;
+		return NULL;
+	}
 
 	/* check that we have an initialised tail queue */
 	acl_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_ACL, rte_acl_list);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
  2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
  2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
@ 2014-07-30 15:36   ` Ananyev, Konstantin
  2014-07-30 19:03   ` Venky Venkatesan
  3 siblings, 0 replies; 45+ messages in thread
From: Ananyev, Konstantin @ 2014-07-30 15:36 UTC (permalink / raw)
  To: Neil Horman, dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, July 30, 2014 3:49 PM
> To: dev@dpdk.org
> Cc: Ananyev, Konstantin; Neil Horman; Thomas Monjalon
> Subject: [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
> 
> Hey all-
>         I've been trying to update the fedora dpdk package to support VFIO
> enabled drivers and ran into a problem in which ixgbe didn't compile because the
> rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> default config I have.  I tried to remedy this by replacing the intrinsics with
> the __builtin macros, but it was pointed out (correctly), that this doesn't work
> properly.  So this is my second attempt, which I actually like a bit better.  I
> noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> have those instructions turned on build-wide.  Rather, we can just enable the
> instructions in the specific code we want to build with support for that, and
> test for instruction support dynamically at run time.  This allows me to build
> the dpdk for a generic platform, but in such a way that some optimizations can
> be used if the executing cpu supports them at run time.
> 
> Change notes:
> 
> v2)
> 	* Added Log messages to run time check failures per Konstantin
> 	* Removed run time check caching in acl per Konstantin
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
                   ` (3 preceding siblings ...)
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
@ 2014-07-30 18:59 ` Bruce Richardson
  2014-07-30 19:28   ` Neil Horman
  4 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-07-30 18:59 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> Hey all-
>         I've been trying to update the fedora dpdk package to support VFIO 
> enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> default config I have.  I tried to remedy this by replacing the intrinsics with 
> the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> properly.  So this is my second attempt, which I actually like a bit better.  I 
> noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> have those instructions turned on build-wide.  Rather, we can just enable the 
> instructions in the specific code we want to build with support for that, and 
> test for instruction support dynamically at run time.  This allows me to build 
> the dpdk for a generic platform, but in such a way that some optimizations can 
> be used if the executing cpu supports them at run time.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>
>
I'd prefer if a solution could be found based off your original patch
set, as it gives us more chance to deprecate the older code paths in
future. Looking at the Intel Intrinsics Guide site online, it shows that
the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
and so should be available on all 64-bit systems, I believe. The
popcount intrinsic is newer, but it's a much more basic instruction so
hopefully the __builtin should work for that.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
                     ` (2 preceding siblings ...)
  2014-07-30 15:36   ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
@ 2014-07-30 19:03   ` Venky Venkatesan
  2014-07-30 19:17     ` Neil Horman
  2014-07-30 19:34     ` Neil Horman
  3 siblings, 2 replies; 45+ messages in thread
From: Venky Venkatesan @ 2014-07-30 19:03 UTC (permalink / raw)
  To: dev

Neil,

> Hey all-
>          I've been trying to update the fedora dpdk package to support VFIO
> enabled drivers and ran into a problem in which ixgbe didn't compile because the
> rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> default config I have.  I tried to remedy this by replacing the intrinsics with
> the __builtin macros, but it was pointed out (correctly), that this doesn't work
> properly.  So this is my second attempt, which I actually like a bit better.  I
> noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> have those instructions turned on build-wide.  Rather, we can just enable the
> instructions in the specific code we want to build with support for that, and
> test for instruction support dynamically at run time.  This allows me to build
> the dpdk for a generic platform, but in such a way that some optimizations can
> be used if the executing cpu supports them at run time.
>
> Change notes:
>
> v2)
> 	* Added Log messages to run time check failures per Konstantin
> 	* Removed run time check caching in acl per Konstantin
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Thomas Monjalon <thomas.monjalon@6wind.com>
>

One possible change to look at - the _mm_shuffle_epi8 is something that 
is available all the way from SSSE3, so you should be able to leave that 
in. The SSE4.2 dependency is only for the popcount (and 
__builtin_popcountll will work for that anyway). That should make things 
simpler.

-Venky

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 19:03   ` Venky Venkatesan
@ 2014-07-30 19:17     ` Neil Horman
  2014-07-30 19:34     ` Neil Horman
  1 sibling, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-30 19:17 UTC (permalink / raw)
  To: Venky Venkatesan; +Cc: dev

On Wed, Jul 30, 2014 at 12:03:38PM -0700, Venky Venkatesan wrote:
> Neil,
> 
> >Hey all-
> >         I've been trying to update the fedora dpdk package to support VFIO
> >enabled drivers and ran into a problem in which ixgbe didn't compile because the
> >rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> >default config I have.  I tried to remedy this by replacing the intrinsics with
> >the __builtin macros, but it was pointed out (correctly), that this doesn't work
> >properly.  So this is my second attempt, which I actually like a bit better.  I
> >noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> >have those instructions turned on build-wide.  Rather, we can just enable the
> >instructions in the specific code we want to build with support for that, and
> >test for instruction support dynamically at run time.  This allows me to build
> >the dpdk for a generic platform, but in such a way that some optimizations can
> >be used if the executing cpu supports them at run time.
> >
> >Change notes:
> >
> >v2)
> >	* Added Log messages to run time check failures per Konstantin
> >	* Removed run time check caching in acl per Konstantin
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> >
> 
> One possible change to look at - the _mm_shuffle_epi8 is something that is
> available all the way from SSSE3, so you should be able to leave that in.
> The SSE4.2 dependency is only for the popcount (and __builtin_popcountll
> will work for that anyway). That should make things simpler.
> 
Hmm, I'd just as soon leave the use of the __builtin_* intrinsics alone at this
point, as I'd like to focus on the optional enablement of these features at run
time more than making them work in all cases (at least for right now), as I'd
like a distribution build to take advantage of the accelerated instructions when
they're available at run time..  But if we can reduce the intrinsic requirement
to ssse3 in some locations that would be good.  Let me experiment and I'll post
a v3
Neil

> -Venky
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson
@ 2014-07-30 19:28   ` Neil Horman
  2014-07-30 21:09     ` Bruce Richardson
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-07-30 19:28 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > Hey all-
> >         I've been trying to update the fedora dpdk package to support VFIO 
> > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > have those instructions turned on build-wide.  Rather, we can just enable the 
> > instructions in the specific code we want to build with support for that, and 
> > test for instruction support dynamically at run time.  This allows me to build 
> > the dpdk for a generic platform, but in such a way that some optimizations can 
> > be used if the executing cpu supports them at run time.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> >
> I'd prefer if a solution could be found based off your original patch
> set, as it gives us more chance to deprecate the older code paths in
> future. Looking at the Intel Intrinsics Guide site online, it shows that
> the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> and so should be available on all 64-bit systems, I believe. The
> popcount intrinsic is newer, but it's a much more basic instruction so
> hopefully the __builtin should work for that.
> 
Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
accelerated code paths on systems that can make use of it at run time.  If We
use the __builtin compiler functions, we will either:

1) Build those code paths with advanced instructions that won't work on older
systems (i.e. crash)

2) Build those code paths with less advanced instructions, meaning that we won't
speedup execution on systems that are capable of using the more advanced
instructions.

Using this run time check, we can, at least in these situations, make use of the
accelerated paths when the instructions are available, and ignore them when
they're not, at run time.

What would be ideal, would be an alternative type macro, like the linux kernel
employs, but implementing that would require some pretty significant work and
testing.  This seems like a much simpler approach.

Neil

> Regards,
> /Bruce
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 19:03   ` Venky Venkatesan
  2014-07-30 19:17     ` Neil Horman
@ 2014-07-30 19:34     ` Neil Horman
  1 sibling, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-30 19:34 UTC (permalink / raw)
  To: Venky Venkatesan; +Cc: dev

On Wed, Jul 30, 2014 at 12:03:38PM -0700, Venky Venkatesan wrote:
> Neil,
> 
> >Hey all-
> >         I've been trying to update the fedora dpdk package to support VFIO
> >enabled drivers and ran into a problem in which ixgbe didn't compile because the
> >rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> >default config I have.  I tried to remedy this by replacing the intrinsics with
> >the __builtin macros, but it was pointed out (correctly), that this doesn't work
> >properly.  So this is my second attempt, which I actually like a bit better.  I
> >noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> >have those instructions turned on build-wide.  Rather, we can just enable the
> >instructions in the specific code we want to build with support for that, and
> >test for instruction support dynamically at run time.  This allows me to build
> >the dpdk for a generic platform, but in such a way that some optimizations can
> >be used if the executing cpu supports them at run time.
> >
> >Change notes:
> >
> >v2)
> >	* Added Log messages to run time check failures per Konstantin
> >	* Removed run time check caching in acl per Konstantin
> >
> >Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> >CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> >
> 
> One possible change to look at - the _mm_shuffle_epi8 is something that is
> available all the way from SSSE3, so you should be able to leave that in.
> The SSE4.2 dependency is only for the popcount (and __builtin_popcountll
> will work for that anyway). That should make things simpler.
> 
> -Venky
> 

ugh, actually both the ixgbe and acl libraries use the popcnt instruction, so I
think I'd rather just keep the sse4.2 instruction set as it is, unless you think
the __builtin_popcnt is equally performant.

Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 19:28   ` Neil Horman
@ 2014-07-30 21:09     ` Bruce Richardson
  2014-07-31  9:30       ` Thomas Monjalon
                         ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Bruce Richardson @ 2014-07-30 21:09 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > Hey all-
> > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > instructions in the specific code we want to build with support for that, and 
> > > test for instruction support dynamically at run time.  This allows me to build 
> > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > be used if the executing cpu supports them at run time.
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > >
> > I'd prefer if a solution could be found based off your original patch
> > set, as it gives us more chance to deprecate the older code paths in
> > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > and so should be available on all 64-bit systems, I believe. The
> > popcount intrinsic is newer, but it's a much more basic instruction so
> > hopefully the __builtin should work for that.
> > 
> Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> accelerated code paths on systems that can make use of it at run time.  If We
> use the __builtin compiler functions, we will either:
> 
> 1) Build those code paths with advanced instructions that won't work on older
> systems (i.e. crash)
> 
> 2) Build those code paths with less advanced instructions, meaning that we won't
> speedup execution on systems that are capable of using the more advanced
> instructions.
> 
> Using this run time check, we can, at least in these situations, make use of the
> accelerated paths when the instructions are available, and ignore them when
> they're not, at run time.
> 
> What would be ideal, would be an alternative type macro, like the linux kernel
> employs, but implementing that would require some pretty significant work and
> testing.  This seems like a much simpler approach.
> 

Ok, I understand where you are coming from indeed. However, within that,
I'd like to see us reduce the amount of code that's needed for
maintenance.

What we should really aim for, is to have common code, with perhaps some
small ifdefs or __builtins, and then compile that code multiple times
for multiple different architectures. So in this case, it would be nice
to use the __builtin, and then compile that code up with and without SSE
and select at runtime the code path to be used. Ideally, this could be
done at the driver level.

However, once you get down this path, you are dealing with more than
just SSE. If I compile up the PMD on my system, which has a chip based
on Sandy Bridge uarch, I find that there are multiple instructions
starting with "vp" which means that they are actually AVX instructions.
Even though the code is written using intrinsics which correspond to SSE
operations, the compiler is free to use AVX instructions where necessary
to improve performance. Therefore, if we go down this road, we need to
look to compile up the code for all microarchitectures, rather than just
assuming that we will get equivalent performance to "native" by turning
on the instruction set indicated by the primitives in the code. This is
where having one codepath recompiled multiple times will work far better
than having multiple code paths.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 21:09     ` Bruce Richardson
@ 2014-07-31  9:30       ` Thomas Monjalon
  2014-07-31 11:36       ` Ananyev, Konstantin
  2014-07-31 13:13       ` Neil Horman
  2 siblings, 0 replies; 45+ messages in thread
From: Thomas Monjalon @ 2014-07-31  9:30 UTC (permalink / raw)
  To: Bruce Richardson, Neil Horman; +Cc: dev

2014-07-30 14:09, Bruce Richardson:
> On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > Hey all-
> > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > instructions in the specific code we want to build with support for that, and 
> > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > be used if the executing cpu supports them at run time.
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > >
> > > I'd prefer if a solution could be found based off your original patch
> > > set, as it gives us more chance to deprecate the older code paths in
> > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > and so should be available on all 64-bit systems, I believe. The
> > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > hopefully the __builtin should work for that.
> > > 
> > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > accelerated code paths on systems that can make use of it at run time.  If We
> > use the __builtin compiler functions, we will either:
> > 
> > 1) Build those code paths with advanced instructions that won't work on older
> > systems (i.e. crash)
> > 
> > 2) Build those code paths with less advanced instructions, meaning that we won't
> > speedup execution on systems that are capable of using the more advanced
> > instructions.
> > 
> > Using this run time check, we can, at least in these situations, make use of the
> > accelerated paths when the instructions are available, and ignore them when
> > they're not, at run time.
> > 
> > What would be ideal, would be an alternative type macro, like the linux kernel
> > employs, but implementing that would require some pretty significant work and
> > testing.  This seems like a much simpler approach.
> > 
> 
> Ok, I understand where you are coming from indeed. However, within that,
> I'd like to see us reduce the amount of code that's needed for
> maintenance.
> 
> What we should really aim for, is to have common code, with perhaps some
> small ifdefs or __builtins, and then compile that code multiple times
> for multiple different architectures. So in this case, it would be nice
> to use the __builtin, and then compile that code up with and without SSE
> and select at runtime the code path to be used. Ideally, this could be
> done at the driver level.

Yes, having a runtime fallback in the driver to make it work efficiently
on all architectures seems a good idea.
We should keep in mind that it's very convenient to functionnaly test
a code path on a basic machine.

> However, once you get down this path, you are dealing with more than
> just SSE. If I compile up the PMD on my system, which has a chip based
> on Sandy Bridge uarch, I find that there are multiple instructions
> starting with "vp" which means that they are actually AVX instructions.
> Even though the code is written using intrinsics which correspond to SSE
> operations, the compiler is free to use AVX instructions where necessary
> to improve performance. Therefore, if we go down this road, we need to
> look to compile up the code for all microarchitectures, rather than just
> assuming that we will get equivalent performance to "native" by turning
> on the instruction set indicated by the primitives in the code. This is
> where having one codepath recompiled multiple times will work far better
> than having multiple code paths.

Choosing the best instructions for a task is the work of the compiler.
Making this choice at runtime is a big task and probably not desirable.
For performance, compilation must be done in "native mode" to let compiler
make the right decisions. For compatibility, compilation must be done in
"default mode".
Here the problem is that "default mode" compilation is broken for ixgbe and
acl. So we must fix it with, at least, an ifdef against SSE4.
Having a runtime decision is better because it brings some performance with
a default compilation but it's not really required today.

-- 
Thomas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 21:09     ` Bruce Richardson
  2014-07-31  9:30       ` Thomas Monjalon
@ 2014-07-31 11:36       ` Ananyev, Konstantin
  2014-07-31 13:13       ` Neil Horman
  2 siblings, 0 replies; 45+ messages in thread
From: Ananyev, Konstantin @ 2014-07-31 11:36 UTC (permalink / raw)
  To: Richardson, Bruce, Neil Horman; +Cc: dev


Hi Bruce,

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, July 30, 2014 10:09 PM
> To: Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
> 
> On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > Hey all-
> > > >         I've been trying to update the fedora dpdk package to support VFIO
> > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the
> > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the
> > > > default config I have.  I tried to remedy this by replacing the intrinsics with
> > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work
> > > > properly.  So this is my second attempt, which I actually like a bit better.  I
> > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to
> > > > have those instructions turned on build-wide.  Rather, we can just enable the
> > > > instructions in the specific code we want to build with support for that, and
> > > > test for instruction support dynamically at run time.  This allows me to build
> > > > the dpdk for a generic platform, but in such a way that some optimizations can
> > > > be used if the executing cpu supports them at run time.
> > > >
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > >
> > > I'd prefer if a solution could be found based off your original patch
> > > set, as it gives us more chance to deprecate the older code paths in
> > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > and so should be available on all 64-bit systems, I believe. The
> > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > hopefully the __builtin should work for that.
> > >
> > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > accelerated code paths on systems that can make use of it at run time.  If We
> > use the __builtin compiler functions, we will either:
> >
> > 1) Build those code paths with advanced instructions that won't work on older
> > systems (i.e. crash)
> >
> > 2) Build those code paths with less advanced instructions, meaning that we won't
> > speedup execution on systems that are capable of using the more advanced
> > instructions.
> >
> > Using this run time check, we can, at least in these situations, make use of the
> > accelerated paths when the instructions are available, and ignore them when
> > they're not, at run time.
> >
> > What would be ideal, would be an alternative type macro, like the linux kernel
> > employs, but implementing that would require some pretty significant work and
> > testing.  This seems like a much simpler approach.
> >
> 
> Ok, I understand where you are coming from indeed. However, within that,
> I'd like to see us reduce the amount of code that's needed for
> maintenance.
> 
> What we should really aim for, is to have common code, with perhaps some
> small ifdefs or __builtins, and then compile that code multiple times
> for multiple different architectures. So in this case, it would be nice
> to use the __builtin, and then compile that code up with and without SSE
> and select at runtime the code path to be used. Ideally, this could be
> done at the driver level.
> 
> However, once you get down this path, you are dealing with more than
> just SSE. If I compile up the PMD on my system, which has a chip based
> on Sandy Bridge uarch, I find that there are multiple instructions
> starting with "vp" which means that they are actually AVX instructions.
> Even though the code is written using intrinsics which correspond to SSE
> operations, the compiler is free to use AVX instructions where necessary
> to improve performance. 
> Therefore, if we go down this road, we need to
> look to compile up the code for all microarchitectures, rather than just
> assuming that we will get equivalent performance to "native" by turning
> on the instruction set indicated by the primitives in the code. This is
> where having one codepath recompiled multiple times will work far better
> than having multiple code paths.

Using your example - as long as we specify '-mavx' compiler can (and does) use AVX instructions
even for 'scalar' code (code without any SIMD instrincts).
And yes, that probably affects performance.
So, as I understand your suggestion, we'll then need to divide our code into:
- generic one - compiled to run on all supported platforms   
- performance critical that will be recompiled for each supported platform.
Then generic code would have to make decision at run-time what particular version of recompiled code to use.
And that for each PMD and all others performance-critical DPDK libraries.
Looks like too much hassle to me.
After all - if someone needs a package with binaries optimised for different architectures,
he can provide multiple DPDK binaries (build for different architectures) and small install script,
that would decide which binary is more appropriate for the given platform.

Konstantin
   

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-30 21:09     ` Bruce Richardson
  2014-07-31  9:30       ` Thomas Monjalon
  2014-07-31 11:36       ` Ananyev, Konstantin
@ 2014-07-31 13:13       ` Neil Horman
  2014-07-31 13:26         ` Thomas Monjalon
  2 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-07-31 13:13 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > Hey all-
> > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > instructions in the specific code we want to build with support for that, and 
> > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > be used if the executing cpu supports them at run time.
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > >
> > > I'd prefer if a solution could be found based off your original patch
> > > set, as it gives us more chance to deprecate the older code paths in
> > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > and so should be available on all 64-bit systems, I believe. The
> > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > hopefully the __builtin should work for that.
> > > 
> > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > accelerated code paths on systems that can make use of it at run time.  If We
> > use the __builtin compiler functions, we will either:
> > 
> > 1) Build those code paths with advanced instructions that won't work on older
> > systems (i.e. crash)
> > 
> > 2) Build those code paths with less advanced instructions, meaning that we won't
> > speedup execution on systems that are capable of using the more advanced
> > instructions.
> > 
> > Using this run time check, we can, at least in these situations, make use of the
> > accelerated paths when the instructions are available, and ignore them when
> > they're not, at run time.
> > 
> > What would be ideal, would be an alternative type macro, like the linux kernel
> > employs, but implementing that would require some pretty significant work and
> > testing.  This seems like a much simpler approach.
> > 
> 
> Ok, I understand where you are coming from indeed. However, within that,
> I'd like to see us reduce the amount of code that's needed for
> maintenance.
> 
Ok, but that seems orthogonal to what I've done here.  I've added about 10 lines
of easily understandable code, which seems reasonable to me.

> What we should really aim for, is to have common code, with perhaps some
> small ifdefs or __builtins, and then compile that code multiple times
> for multiple different architectures. So in this case, it would be nice
> to use the __builtin, and then compile that code up with and without SSE
> and select at runtime the code path to be used. Ideally, this could be
> done at the driver level.
> 
No, that is in direct conflict with what I'm trying to do here.  My goal is to
enable a build for a least common denominator system, but include those code
paths that allow some performance benefit on systems which support the feature,
to be determined at run time.  My goal is improving performance in the Fedora
dpdk package, which we build once for all supported systems on an arch.
Building multiple libraries for multiple system configurations is simply an
unmaintainable solution.

Now, a macro that selected an instruction optimized or generic path is fine, as
long as it can happen at run time.  The Linux kernel has such a feature, called
alternatives.  But its a complex subsystem that does run time replacement of
instructions based on cpu feature flags.  It would be great to have in the DPDK,
but its a significant code base and difficult to maintain, which goes against
your desire to reduce code.

> However, once you get down this path, you are dealing with more than
> just SSE. If I compile up the PMD on my system, which has a chip based
> on Sandy Bridge uarch, I find that there are multiple instructions
> starting with "vp" which means that they are actually AVX instructions.

You've done it wrong, you're building for the native machine target.  For fedora
I build for the default machine target (core2) which is the minimal system that
fedora supports.  So all the code emitted for the dpdk is functional for all
fedora systems.  This patch set then just enables the sse4.2 intrinsics for the
ixgbe vec rx/tx path and the acl library, and inculdes a run time check for
SSE4.2 before either of those paths can be used.

> Even though the code is written using intrinsics which correspond to SSE
> operations, the compiler is free to use AVX instructions where necessary
Not if you use the default machine target.

> to improve performance. Therefore, if we go down this road, we need to
> look to compile up the code for all microarchitectures, rather than just
> assuming that we will get equivalent performance to "native" by turning
> on the instruction set indicated by the primitives in the code. This is
No, you compile for the least common demonitor system, and enable more
performant paths opportunistically as run time checks allow.

> where having one codepath recompiled multiple times will work far better
> than having multiple code paths.
Only if you're only concern is performance.  As noted above, my goal is more
than just performance, its compatibility accross systems.  Multiple builds for
multiple cpu flag availability is simply a non-starter for a generic
distribution.

Regards
Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 13:13       ` Neil Horman
@ 2014-07-31 13:26         ` Thomas Monjalon
  2014-07-31 14:32           ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Thomas Monjalon @ 2014-07-31 13:26 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

2014-07-31 09:13, Neil Horman:
> On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > Hey all-
> > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > instructions in the specific code we want to build with support for that, and 
> > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > be used if the executing cpu supports them at run time.
> > > > > 
> > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > >
> > > > I'd prefer if a solution could be found based off your original patch
> > > > set, as it gives us more chance to deprecate the older code paths in
> > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > and so should be available on all 64-bit systems, I believe. The
> > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > hopefully the __builtin should work for that.
> > > > 
> > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > accelerated code paths on systems that can make use of it at run time.  If We
> > > use the __builtin compiler functions, we will either:
> > > 
> > > 1) Build those code paths with advanced instructions that won't work on older
> > > systems (i.e. crash)
> > > 
> > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > speedup execution on systems that are capable of using the more advanced
> > > instructions.
> > > 
> > > Using this run time check, we can, at least in these situations, make use of the
> > > accelerated paths when the instructions are available, and ignore them when
> > > they're not, at run time.
> > > 
> > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > employs, but implementing that would require some pretty significant work and
> > > testing.  This seems like a much simpler approach.

[...]

> Now, a macro that selected an instruction optimized or generic path is fine, as
> long as it can happen at run time.  The Linux kernel has such a feature, called
> alternatives.  But its a complex subsystem that does run time replacement of
> instructions based on cpu feature flags.  It would be great to have in the DPDK,
> but its a significant code base and difficult to maintain, which goes against
> your desire to reduce code.

[...]

> > Even though the code is written using intrinsics which correspond to SSE
> > operations, the compiler is free to use AVX instructions where necessary
> Not if you use the default machine target.
> 
> > to improve performance. Therefore, if we go down this road, we need to
> > look to compile up the code for all microarchitectures, rather than just
> > assuming that we will get equivalent performance to "native" by turning
> > on the instruction set indicated by the primitives in the code. This is
> No, you compile for the least common demonitor system, and enable more
> performant paths opportunistically as run time checks allow.
> 
> > where having one codepath recompiled multiple times will work far better
> > than having multiple code paths.
> Only if you're only concern is performance.  As noted above, my goal is more
> than just performance, its compatibility accross systems.  Multiple builds for
> multiple cpu flag availability is simply a non-starter for a generic
> distribution.

Neil, we are mixing 2 different problems here.
1) we have to fix default build (without SSE-4.2)
2) we could try to have performance with default build

Please, let's focus on the first item and we could discuss about performance
later. Having some different code path choosed at runtime is a big rework and
imply changing the compilation model (RFC welcome).

-- 
Thomas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 13:26         ` Thomas Monjalon
@ 2014-07-31 14:32           ` Neil Horman
  2014-07-31 18:10             ` Neil Horman
  2014-07-31 21:25             ` Thomas Monjalon
  0 siblings, 2 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-31 14:32 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> 2014-07-31 09:13, Neil Horman:
> > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > Hey all-
> > > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > > instructions in the specific code we want to build with support for that, and 
> > > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > > be used if the executing cpu supports them at run time.
> > > > > > 
> > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > > >
> > > > > I'd prefer if a solution could be found based off your original patch
> > > > > set, as it gives us more chance to deprecate the older code paths in
> > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > > hopefully the __builtin should work for that.
> > > > > 
> > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > > accelerated code paths on systems that can make use of it at run time.  If We
> > > > use the __builtin compiler functions, we will either:
> > > > 
> > > > 1) Build those code paths with advanced instructions that won't work on older
> > > > systems (i.e. crash)
> > > > 
> > > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > > speedup execution on systems that are capable of using the more advanced
> > > > instructions.
> > > > 
> > > > Using this run time check, we can, at least in these situations, make use of the
> > > > accelerated paths when the instructions are available, and ignore them when
> > > > they're not, at run time.
> > > > 
> > > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > > employs, but implementing that would require some pretty significant work and
> > > > testing.  This seems like a much simpler approach.
> 
> [...]
> 
> > Now, a macro that selected an instruction optimized or generic path is fine, as
> > long as it can happen at run time.  The Linux kernel has such a feature, called
> > alternatives.  But its a complex subsystem that does run time replacement of
> > instructions based on cpu feature flags.  It would be great to have in the DPDK,
> > but its a significant code base and difficult to maintain, which goes against
> > your desire to reduce code.
> 
> [...]
> 
> > > Even though the code is written using intrinsics which correspond to SSE
> > > operations, the compiler is free to use AVX instructions where necessary
> > Not if you use the default machine target.
> > 
> > > to improve performance. Therefore, if we go down this road, we need to
> > > look to compile up the code for all microarchitectures, rather than just
> > > assuming that we will get equivalent performance to "native" by turning
> > > on the instruction set indicated by the primitives in the code. This is
> > No, you compile for the least common demonitor system, and enable more
> > performant paths opportunistically as run time checks allow.
> > 
> > > where having one codepath recompiled multiple times will work far better
> > > than having multiple code paths.
> > Only if you're only concern is performance.  As noted above, my goal is more
> > than just performance, its compatibility accross systems.  Multiple builds for
> > multiple cpu flag availability is simply a non-starter for a generic
> > distribution.
> 
> Neil, we are mixing 2 different problems here.
> 1) we have to fix default build (without SSE-4.2)
Thats nothing to fix, thats a configuration issue.  Just build for a lesser
machine.  I've already done that in the fedora build, using the defalut machine
target.  What exactly is missing from that?

> 2) we could try to have performance with default build
> 
Yes, we can, thats what this patch does.  It doesn't address every code path,
no, but it addresses two paths that are low hanging fruit for doing so, and we
can incrementally build on that

> Please, let's focus on the first item and we could discuss about performance
> later. Having some different code path choosed at runtime is a big rework and
> imply changing the compilation model (RFC welcome).
> 
Are you referring to the use of the Alternatives code here?  Yes, I absolutely
agree, thats a huge rework, and not worth considering yet (though it would be
great if we had it).  I'm not proposing that though, I'm just proposing that we
allow a few isolated paths to be selected at run time if and only if the cpu
supports doing so.  The check for both of these cases is in the setup path, and
so the check itself should not have any significant impact on performance.

I really don't see what the conflict is here.

Neil

> -- 
> Thomas
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 14:32           ` Neil Horman
@ 2014-07-31 18:10             ` Neil Horman
  2014-07-31 18:36               ` Bruce Richardson
  2014-07-31 21:53               ` Thomas Monjalon
  2014-07-31 21:25             ` Thomas Monjalon
  1 sibling, 2 replies; 45+ messages in thread
From: Neil Horman @ 2014-07-31 18:10 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > 2014-07-31 09:13, Neil Horman:
> > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > Hey all-
> > > > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > > > instructions in the specific code we want to build with support for that, and 
> > > > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > > > be used if the executing cpu supports them at run time.
> > > > > > > 
> > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > > > >
> > > > > > I'd prefer if a solution could be found based off your original patch
> > > > > > set, as it gives us more chance to deprecate the older code paths in
> > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > > > hopefully the __builtin should work for that.
> > > > > > 
> > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > > > accelerated code paths on systems that can make use of it at run time.  If We
> > > > > use the __builtin compiler functions, we will either:
> > > > > 
> > > > > 1) Build those code paths with advanced instructions that won't work on older
> > > > > systems (i.e. crash)
> > > > > 
> > > > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > > > speedup execution on systems that are capable of using the more advanced
> > > > > instructions.
> > > > > 
> > > > > Using this run time check, we can, at least in these situations, make use of the
> > > > > accelerated paths when the instructions are available, and ignore them when
> > > > > they're not, at run time.
> > > > > 
> > > > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > > > employs, but implementing that would require some pretty significant work and
> > > > > testing.  This seems like a much simpler approach.
> > 
> > [...]
> > 
> > > Now, a macro that selected an instruction optimized or generic path is fine, as
> > > long as it can happen at run time.  The Linux kernel has such a feature, called
> > > alternatives.  But its a complex subsystem that does run time replacement of
> > > instructions based on cpu feature flags.  It would be great to have in the DPDK,
> > > but its a significant code base and difficult to maintain, which goes against
> > > your desire to reduce code.
> > 
> > [...]
> > 
> > > > Even though the code is written using intrinsics which correspond to SSE
> > > > operations, the compiler is free to use AVX instructions where necessary
> > > Not if you use the default machine target.
> > > 
> > > > to improve performance. Therefore, if we go down this road, we need to
> > > > look to compile up the code for all microarchitectures, rather than just
> > > > assuming that we will get equivalent performance to "native" by turning
> > > > on the instruction set indicated by the primitives in the code. This is
> > > No, you compile for the least common demonitor system, and enable more
> > > performant paths opportunistically as run time checks allow.
> > > 
> > > > where having one codepath recompiled multiple times will work far better
> > > > than having multiple code paths.
> > > Only if you're only concern is performance.  As noted above, my goal is more
> > > than just performance, its compatibility accross systems.  Multiple builds for
> > > multiple cpu flag availability is simply a non-starter for a generic
> > > distribution.
> > 
> > Neil, we are mixing 2 different problems here.
> > 1) we have to fix default build (without SSE-4.2)
> Thats nothing to fix, thats a configuration issue.  Just build for a lesser
> machine.  I've already done that in the fedora build, using the defalut machine
> target.  What exactly is missing from that?
> 
Re-reading this, I'm wondering if I missed what you were trying to say, if so I
apologize.  Were you trying to assert that the right thing to do here is to
adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that
they are buildable on the default platform?  If so, I agree, thats a nice idea,
and am supportive of it, though I don't think that fully solves teh problem.  In
the case of the ixgbe pmd, what we have is 2 code paths, a generic code path,
and an optimized code path using sse4.2 intrinsics.  In this case, I don't think
theres anything to fix, in that I'm fine with the optimized path needing sse4.2
to execute.  There I just want to be able to do a run time check and use the
optimized path if the cpu supports it, and just use the default path otherwise.
In effect we already have exactly what you are looking for there.

As far as the ACL library goes, yes, thats more complex.  The use of sse4.2
intrinsics there is done througout the code, so theres no easy way to select a
path.  we're just left with either using the code or returning an error at run
time, as my patch does.  Certainly we can build some macros that either use the
intrinsics for sse4.2 or code up some C-level variants of those instructions
based on generic code, and build for the least common demoniator, or compile the
code twice (once without sse4.2 support, and once with), and do a runtime
selection between the two.  Either way, thats going to be a useful, though
significant effort.

> > 2) we could try to have performance with default build
> > 
> Yes, we can, thats what this patch does.  It doesn't address every code path,
> no, but it addresses two paths that are low hanging fruit for doing so, and we
> can incrementally build on that
> 
> > Please, let's focus on the first item and we could discuss about performance
> > later. Having some different code path choosed at runtime is a big rework and
> > imply changing the compilation model (RFC welcome).
> > 
Even if I misinterpreted your statement above, I'm still not sure why your
asserting this. Fixing the build to work with the default target machine is
good, and should be undertaken, and I'll happily do so, but why reject the
solution in front of you to wait for it?  Even if I write macros to fix up the
ACL library, I'd still like to be able to do a run time check and select the
optimized version or the generic version based on cpu support.  Just doing a
compile time check to determine if sse4.2 is available really isn't going to cut
it for me, as I don't want the fedora dpdk to have pessimal performance if it
doesn't have to.

Regards
Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 18:10             ` Neil Horman
@ 2014-07-31 18:36               ` Bruce Richardson
  2014-07-31 19:01                 ` Neil Horman
                                   ` (2 more replies)
  2014-07-31 21:53               ` Thomas Monjalon
  1 sibling, 3 replies; 45+ messages in thread
From: Bruce Richardson @ 2014-07-31 18:36 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > 2014-07-31 09:13, Neil Horman:
> > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > Hey all-
> > > > > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > > > > instructions in the specific code we want to build with support for that, and 
> > > > > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > > > > be used if the executing cpu supports them at run time.
> > > > > > > > 
> > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > > > > >
> > > > > > > I'd prefer if a solution could be found based off your original patch
> > > > > > > set, as it gives us more chance to deprecate the older code paths in
> > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > > > > hopefully the __builtin should work for that.
> > > > > > > 
> > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > > > > accelerated code paths on systems that can make use of it at run time.  If We
> > > > > > use the __builtin compiler functions, we will either:
> > > > > > 
> > > > > > 1) Build those code paths with advanced instructions that won't work on older
> > > > > > systems (i.e. crash)
> > > > > > 
> > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > > > > speedup execution on systems that are capable of using the more advanced
> > > > > > instructions.
> > > > > > 
> > > > > > Using this run time check, we can, at least in these situations, make use of the
> > > > > > accelerated paths when the instructions are available, and ignore them when
> > > > > > they're not, at run time.
> > > > > > 
> > > > > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > > > > employs, but implementing that would require some pretty significant work and
> > > > > > testing.  This seems like a much simpler approach.
> > > 
> > > [...]
> > > 
> > > > Now, a macro that selected an instruction optimized or generic path is fine, as
> > > > long as it can happen at run time.  The Linux kernel has such a feature, called
> > > > alternatives.  But its a complex subsystem that does run time replacement of
> > > > instructions based on cpu feature flags.  It would be great to have in the DPDK,
> > > > but its a significant code base and difficult to maintain, which goes against
> > > > your desire to reduce code.
> > > 
> > > [...]
> > > 
> > > > > Even though the code is written using intrinsics which correspond to SSE
> > > > > operations, the compiler is free to use AVX instructions where necessary
> > > > Not if you use the default machine target.
> > > > 
> > > > > to improve performance. Therefore, if we go down this road, we need to
> > > > > look to compile up the code for all microarchitectures, rather than just
> > > > > assuming that we will get equivalent performance to "native" by turning
> > > > > on the instruction set indicated by the primitives in the code. This is
> > > > No, you compile for the least common demonitor system, and enable more
> > > > performant paths opportunistically as run time checks allow.
> > > > 
> > > > > where having one codepath recompiled multiple times will work far better
> > > > > than having multiple code paths.
> > > > Only if you're only concern is performance.  As noted above, my goal is more
> > > > than just performance, its compatibility accross systems.  Multiple builds for
> > > > multiple cpu flag availability is simply a non-starter for a generic
> > > > distribution.
> > > 
> > > Neil, we are mixing 2 different problems here.
> > > 1) we have to fix default build (without SSE-4.2)
> > Thats nothing to fix, thats a configuration issue.  Just build for a lesser
> > machine.  I've already done that in the fedora build, using the defalut machine
> > target.  What exactly is missing from that?
> > 
> Re-reading this, I'm wondering if I missed what you were trying to say, if so I
> apologize.  Were you trying to assert that the right thing to do here is to
> adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that
> they are buildable on the default platform?  If so, I agree, thats a nice idea,
> and am supportive of it, though I don't think that fully solves teh problem.  In
> the case of the ixgbe pmd, what we have is 2 code paths, a generic code path,
> and an optimized code path using sse4.2 intrinsics.  In this case, I don't think
> theres anything to fix, in that I'm fine with the optimized path needing sse4.2
> to execute.  There I just want to be able to do a run time check and use the
> optimized path if the cpu supports it, and just use the default path otherwise.
> In effect we already have exactly what you are looking for there.
> 
> As far as the ACL library goes, yes, thats more complex.  The use of sse4.2
> intrinsics there is done througout the code, so theres no easy way to select a
> path.  we're just left with either using the code or returning an error at run
> time, as my patch does.  Certainly we can build some macros that either use the
> intrinsics for sse4.2 or code up some C-level variants of those instructions
> based on generic code, and build for the least common demoniator, or compile the
> code twice (once without sse4.2 support, and once with), and do a runtime
> selection between the two.  Either way, thats going to be a useful, though
> significant effort.

I think a good first step here that I can't see anyone objecting to is
to enable the ixgbe driver to use the vector code path for a generic
x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
allows a compile for machine type default, and testpmd can still forward
packets at a good rate (roughly perf down about 10% vs native compile on
SNB).
The ACL is a tougher nut to crack, but anyone see any issues with that
two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
set thread, do you want to submit an official patch here, or would you prefer I
do so?]

> 
> > > 2) we could try to have performance with default build
> > > 
> > Yes, we can, thats what this patch does.  It doesn't address every code path,
> > no, but it addresses two paths that are low hanging fruit for doing so, and we
> > can incrementally build on that
> > 
> > > Please, let's focus on the first item and we could discuss about performance
> > > later. Having some different code path choosed at runtime is a big rework and
> > > imply changing the compilation model (RFC welcome).
> > > 
> Even if I misinterpreted your statement above, I'm still not sure why your
> asserting this. Fixing the build to work with the default target machine is
> good, and should be undertaken, and I'll happily do so, but why reject the
> solution in front of you to wait for it?  Even if I write macros to fix up the
> ACL library, I'd still like to be able to do a run time check and select the
> optimized version or the generic version based on cpu support.  Just doing a
> compile time check to determine if sse4.2 is available really isn't going to cut
> it for me, as I don't want the fedora dpdk to have pessimal performance if it
> doesn't have to.
> 
> Regards
> Neil
> 

With regards to the general approach for runtime detection of software
functions, I wonder if something like this can be handled by the
packaging system? Is it possible to ship out a set of shared libs
compiled up for different instruction sets, and then at rpm install
time, symlink the appropriate library? This would push the whole issue
of detection of code paths outside of code, work across all our
libraries and ensure each user got the best performance they could get
form a binary?
Has something like this been done before? The building of all the
libraries could be scripted easy enough, just do multiple builds using
different EXTRA_CFLAGS each time, and move and rename the .so's after
each run.

/Bruce

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 18:36               ` Bruce Richardson
@ 2014-07-31 19:01                 ` Neil Horman
  2014-07-31 20:19                   ` Bruce Richardson
  2014-07-31 19:58                 ` John W. Linville
  2014-07-31 20:10                 ` Neil Horman
  2 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-07-31 19:01 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > 2014-07-31 09:13, Neil Horman:
> > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > Hey all-
> > > > > > > > >         I've been trying to update the fedora dpdk package to support VFIO 
> > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the 
> > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the 
> > > > > > > > > default config I have.  I tried to remedy this by replacing the intrinsics with 
> > > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work 
> > > > > > > > > properly.  So this is my second attempt, which I actually like a bit better.  I 
> > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to 
> > > > > > > > > have those instructions turned on build-wide.  Rather, we can just enable the 
> > > > > > > > > instructions in the specific code we want to build with support for that, and 
> > > > > > > > > test for instruction support dynamically at run time.  This allows me to build 
> > > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can 
> > > > > > > > > be used if the executing cpu supports them at run time.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com>
> > > > > > > > >
> > > > > > > > I'd prefer if a solution could be found based off your original patch
> > > > > > > > set, as it gives us more chance to deprecate the older code paths in
> > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that
> > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x,
> > > > > > > > and so should be available on all 64-bit systems, I believe. The
> > > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so
> > > > > > > > hopefully the __builtin should work for that.
> > > > > > > > 
> > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer
> > > > > > > accelerated code paths on systems that can make use of it at run time.  If We
> > > > > > > use the __builtin compiler functions, we will either:
> > > > > > > 
> > > > > > > 1) Build those code paths with advanced instructions that won't work on older
> > > > > > > systems (i.e. crash)
> > > > > > > 
> > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't
> > > > > > > speedup execution on systems that are capable of using the more advanced
> > > > > > > instructions.
> > > > > > > 
> > > > > > > Using this run time check, we can, at least in these situations, make use of the
> > > > > > > accelerated paths when the instructions are available, and ignore them when
> > > > > > > they're not, at run time.
> > > > > > > 
> > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel
> > > > > > > employs, but implementing that would require some pretty significant work and
> > > > > > > testing.  This seems like a much simpler approach.
> > > > 
> > > > [...]
> > > > 
> > > > > Now, a macro that selected an instruction optimized or generic path is fine, as
> > > > > long as it can happen at run time.  The Linux kernel has such a feature, called
> > > > > alternatives.  But its a complex subsystem that does run time replacement of
> > > > > instructions based on cpu feature flags.  It would be great to have in the DPDK,
> > > > > but its a significant code base and difficult to maintain, which goes against
> > > > > your desire to reduce code.
> > > > 
> > > > [...]
> > > > 
> > > > > > Even though the code is written using intrinsics which correspond to SSE
> > > > > > operations, the compiler is free to use AVX instructions where necessary
> > > > > Not if you use the default machine target.
> > > > > 
> > > > > > to improve performance. Therefore, if we go down this road, we need to
> > > > > > look to compile up the code for all microarchitectures, rather than just
> > > > > > assuming that we will get equivalent performance to "native" by turning
> > > > > > on the instruction set indicated by the primitives in the code. This is
> > > > > No, you compile for the least common demonitor system, and enable more
> > > > > performant paths opportunistically as run time checks allow.
> > > > > 
> > > > > > where having one codepath recompiled multiple times will work far better
> > > > > > than having multiple code paths.
> > > > > Only if you're only concern is performance.  As noted above, my goal is more
> > > > > than just performance, its compatibility accross systems.  Multiple builds for
> > > > > multiple cpu flag availability is simply a non-starter for a generic
> > > > > distribution.
> > > > 
> > > > Neil, we are mixing 2 different problems here.
> > > > 1) we have to fix default build (without SSE-4.2)
> > > Thats nothing to fix, thats a configuration issue.  Just build for a lesser
> > > machine.  I've already done that in the fedora build, using the defalut machine
> > > target.  What exactly is missing from that?
> > > 
> > Re-reading this, I'm wondering if I missed what you were trying to say, if so I
> > apologize.  Were you trying to assert that the right thing to do here is to
> > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that
> > they are buildable on the default platform?  If so, I agree, thats a nice idea,
> > and am supportive of it, though I don't think that fully solves teh problem.  In
> > the case of the ixgbe pmd, what we have is 2 code paths, a generic code path,
> > and an optimized code path using sse4.2 intrinsics.  In this case, I don't think
> > theres anything to fix, in that I'm fine with the optimized path needing sse4.2
> > to execute.  There I just want to be able to do a run time check and use the
> > optimized path if the cpu supports it, and just use the default path otherwise.
> > In effect we already have exactly what you are looking for there.
> > 
> > As far as the ACL library goes, yes, thats more complex.  The use of sse4.2
> > intrinsics there is done througout the code, so theres no easy way to select a
> > path.  we're just left with either using the code or returning an error at run
> > time, as my patch does.  Certainly we can build some macros that either use the
> > intrinsics for sse4.2 or code up some C-level variants of those instructions
> > based on generic code, and build for the least common demoniator, or compile the
> > code twice (once without sse4.2 support, and once with), and do a runtime
> > selection between the two.  Either way, thats going to be a useful, though
> > significant effort.
> 
> I think a good first step here that I can't see anyone objecting to is
> to enable the ixgbe driver to use the vector code path for a generic
> x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> allows a compile for machine type default, and testpmd can still forward
> packets at a good rate (roughly perf down about 10% vs native compile on
> SNB).
> The ACL is a tougher nut to crack, but anyone see any issues with that
> two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> set thread, do you want to submit an official patch here, or would you prefer I
> do so?]
> 

I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
instructions in that path seems significant, isn't it? Given that performance
delta, it seems like it would still be preferable to have a path that used the
sse4.2 instructions when they're available.  Or am I misreading what you mean
when you say down 10%

Neil
 
> > 
> > > > 2) we could try to have performance with default build
> > > > 
> > > Yes, we can, thats what this patch does.  It doesn't address every code path,
> > > no, but it addresses two paths that are low hanging fruit for doing so, and we
> > > can incrementally build on that
> > > 
> > > > Please, let's focus on the first item and we could discuss about performance
> > > > later. Having some different code path choosed at runtime is a big rework and
> > > > imply changing the compilation model (RFC welcome).
> > > > 
> > Even if I misinterpreted your statement above, I'm still not sure why your
> > asserting this. Fixing the build to work with the default target machine is
> > good, and should be undertaken, and I'll happily do so, but why reject the
> > solution in front of you to wait for it?  Even if I write macros to fix up the
> > ACL library, I'd still like to be able to do a run time check and select the
> > optimized version or the generic version based on cpu support.  Just doing a
> > compile time check to determine if sse4.2 is available really isn't going to cut
> > it for me, as I don't want the fedora dpdk to have pessimal performance if it
> > doesn't have to.
> > 
> > Regards
> > Neil
> > 
> 
> With regards to the general approach for runtime detection of software
> functions, I wonder if something like this can be handled by the
> packaging system? Is it possible to ship out a set of shared libs
> compiled up for different instruction sets, and then at rpm install
> time, symlink the appropriate library? This would push the whole issue
> of detection of code paths outside of code, work across all our
> libraries and ensure each user got the best performance they could get
> form a binary?
> Has something like this been done before? The building of all the
> libraries could be scripted easy enough, just do multiple builds using
> different EXTRA_CFLAGS each time, and move and rename the .so's after
> each run.
> 
> /Bruce
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 18:36               ` Bruce Richardson
  2014-07-31 19:01                 ` Neil Horman
@ 2014-07-31 19:58                 ` John W. Linville
  2014-07-31 20:20                   ` Bruce Richardson
  2014-07-31 20:10                 ` Neil Horman
  2 siblings, 1 reply; 45+ messages in thread
From: John W. Linville @ 2014-07-31 19:58 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:

> With regards to the general approach for runtime detection of software
> functions, I wonder if something like this can be handled by the
> packaging system? Is it possible to ship out a set of shared libs
> compiled up for different instruction sets, and then at rpm install
> time, symlink the appropriate library? This would push the whole issue
> of detection of code paths outside of code, work across all our
> libraries and ensure each user got the best performance they could get
> form a binary?
> Has something like this been done before? The building of all the
> libraries could be scripted easy enough, just do multiple builds using
> different EXTRA_CFLAGS each time, and move and rename the .so's after
> each run.

I'm not aware of a package that does anything like that.  It probably
is possible, but I imagine that it would provoke a lot of debate
and consternation in FESCO...

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 18:36               ` Bruce Richardson
  2014-07-31 19:01                 ` Neil Horman
  2014-07-31 19:58                 ` John W. Linville
@ 2014-07-31 20:10                 ` Neil Horman
  2014-07-31 20:25                   ` Bruce Richardson
  2 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-07-31 20:10 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > 2014-07-31 09:13, Neil Horman:
> > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > Hey all-
> 
> With regards to the general approach for runtime detection of software
> functions, I wonder if something like this can be handled by the
> packaging system? Is it possible to ship out a set of shared libs
> compiled up for different instruction sets, and then at rpm install
> time, symlink the appropriate library? This would push the whole issue
> of detection of code paths outside of code, work across all our
> libraries and ensure each user got the best performance they could get
> form a binary?
> Has something like this been done before? The building of all the
> libraries could be scripted easy enough, just do multiple builds using
> different EXTRA_CFLAGS each time, and move and rename the .so's after
> each run.
> 

Sorry, I missed this in my last reply.

In answer to your question, the short version is that such a thing is roughly
possible from a packaging standpoint, but completely unworkable from a
distribution standpoint.  We could certainly build the dpdk multiple times and
rename all the shared objects to some variant name representative of the
optimzations we build in for certain cpu flags, but then we woudl be shipping X
versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
would need to provide a version linked against each variant to be useful when
making a product, and each end user would need to manually select (or run a
script to select) which variant is most optimized for the system at hand.  Its
just not a reasonable way to package a library.

When pacaging software, the only consideration given to code variance at pacakge
time is architecture (x86/x86_64/ppc/s390/etc).  If you install a package for
your a given architecture, its expected to run on that architecture.  Optional
code paths are just that, optional, and executed based on run time tests.  Its a
requirement that we build for the lowest common demoniator system that is
supported, and enable accelerative code paths optionally at run time when the
cpu indicates support for them.

Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 19:01                 ` Neil Horman
@ 2014-07-31 20:19                   ` Bruce Richardson
  2014-08-01 13:36                     ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-07-31 20:19 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote:
> On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > 
> > I think a good first step here that I can't see anyone objecting to is
> > to enable the ixgbe driver to use the vector code path for a generic
> > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> > allows a compile for machine type default, and testpmd can still forward
> > packets at a good rate (roughly perf down about 10% vs native compile on
> > SNB).
> > The ACL is a tougher nut to crack, but anyone see any issues with that
> > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> > set thread, do you want to submit an official patch here, or would you prefer I
> > do so?]
> > 
> 
> I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
> instructions in that path seems significant, isn't it? Given that performance
> delta, it seems like it would still be preferable to have a path that used the
> sse4.2 instructions when they're available.  Or am I misreading what you mean
> when you say down 10%
> 
> Neil
>
Ok, I did a little bit more testing here. Using the vector pmd compiled
for generic x86_64 and using __builtin_popcountll is approx 35% faster
for packet IO than the existing fast-path functions. It is also 7% (a
bit lower than ~10% as I originally stated) slower than the existing
native-compiled vpmd on a Sandy Bridge platform.

I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the
extra instructions. The ~7% performance drop went to ~3%, so we would
gain a little more with using SSE4.2, but compared to the gain from
having the vector driver at all, it's not that much. [I don't have a
system handy with AVX2 support to see what boosts might come from
compiling with that instruction set enabled.]

Because of this, I'd take the ~35% speed boost for now, and try and find
what would be the best general way to solve this problem across all
libraries. Also, I think that anyone who needs that extra 4% performance
probably wants the other 3% too, and so will compile up the code from
source using the "native" compilation target. :-)

/Bruce

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 19:58                 ` John W. Linville
@ 2014-07-31 20:20                   ` Bruce Richardson
  2014-07-31 20:32                     ` John W. Linville
  0 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-07-31 20:20 UTC (permalink / raw)
  To: John W. Linville; +Cc: dev

On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote:
> On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> 
> > With regards to the general approach for runtime detection of software
> > functions, I wonder if something like this can be handled by the
> > packaging system? Is it possible to ship out a set of shared libs
> > compiled up for different instruction sets, and then at rpm install
> > time, symlink the appropriate library? This would push the whole issue
> > of detection of code paths outside of code, work across all our
> > libraries and ensure each user got the best performance they could get
> > form a binary?
> > Has something like this been done before? The building of all the
> > libraries could be scripted easy enough, just do multiple builds using
> > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > each run.
> 
> I'm not aware of a package that does anything like that.  It probably
> is possible, but I imagine that it would provoke a lot of debate
> and consternation in FESCO...

Nothing like a bit of consternation to get the adrenaline pumping, right
:-)
BTW: what is FESCO?

/Bruce

> 
> -- 
> John W. Linville		Someday the world will need a hero, and you
> linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 20:10                 ` Neil Horman
@ 2014-07-31 20:25                   ` Bruce Richardson
  2014-08-01 15:06                     ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-07-31 20:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > Hey all-
> > 
> > With regards to the general approach for runtime detection of software
> > functions, I wonder if something like this can be handled by the
> > packaging system? Is it possible to ship out a set of shared libs
> > compiled up for different instruction sets, and then at rpm install
> > time, symlink the appropriate library? This would push the whole issue
> > of detection of code paths outside of code, work across all our
> > libraries and ensure each user got the best performance they could get
> > form a binary?
> > Has something like this been done before? The building of all the
> > libraries could be scripted easy enough, just do multiple builds using
> > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > each run.
> > 
> 
> Sorry, I missed this in my last reply.
> 
> In answer to your question, the short version is that such a thing is roughly
> possible from a packaging standpoint, but completely unworkable from a
> distribution standpoint.  We could certainly build the dpdk multiple times and
> rename all the shared objects to some variant name representative of the
> optimzations we build in for certain cpu flags, but then we woudl be shipping X
> versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> would need to provide a version linked against each variant to be useful when
> making a product, and each end user would need to manually select (or run a
> script to select) which variant is most optimized for the system at hand.  Its
> just not a reasonable way to package a library.

Sorry, perhaps I was not clear, having the user have to select the
appropriate library was not what I was suggesting. Instead, I was
suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
"librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
post-install script would look at the cpuflags in cpuinfo and then
symlink librte_pmd_ixgbe.so to the best-match version. That way the user
only has to link against "librte_pmd_ixgbe.so" and depending on the
system its run on, the loader will automatically resolve the symbols
from the appropriate instruction-set specific .so file.

> 
> When pacaging software, the only consideration given to code variance at pacakge
> time is architecture (x86/x86_64/ppc/s390/etc).  If you install a package for
> your a given architecture, its expected to run on that architecture.  Optional
> code paths are just that, optional, and executed based on run time tests.  Its a
> requirement that we build for the lowest common demoniator system that is
> supported, and enable accelerative code paths optionally at run time when the
> cpu indicates support for them.
> 
> Neil
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 20:20                   ` Bruce Richardson
@ 2014-07-31 20:32                     ` John W. Linville
  2014-08-01  8:46                       ` Vincent JARDIN
  0 siblings, 1 reply; 45+ messages in thread
From: John W. Linville @ 2014-07-31 20:32 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 01:20:42PM -0700, Bruce Richardson wrote:
> On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote:
> > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > 
> > > With regards to the general approach for runtime detection of software
> > > functions, I wonder if something like this can be handled by the
> > > packaging system? Is it possible to ship out a set of shared libs
> > > compiled up for different instruction sets, and then at rpm install
> > > time, symlink the appropriate library? This would push the whole issue
> > > of detection of code paths outside of code, work across all our
> > > libraries and ensure each user got the best performance they could get
> > > form a binary?
> > > Has something like this been done before? The building of all the
> > > libraries could be scripted easy enough, just do multiple builds using
> > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > each run.
> > 
> > I'm not aware of a package that does anything like that.  It probably
> > is possible, but I imagine that it would provoke a lot of debate
> > and consternation in FESCO...
> 
> Nothing like a bit of consternation to get the adrenaline pumping, right
> :-)
> BTW: what is FESCO?

Fedora Engineering Steering Committee

Neil and I have already felt the hot breath of FESCO on our necks
regarding the Fedora DPDK package...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 14:32           ` Neil Horman
  2014-07-31 18:10             ` Neil Horman
@ 2014-07-31 21:25             ` Thomas Monjalon
  1 sibling, 0 replies; 45+ messages in thread
From: Thomas Monjalon @ 2014-07-31 21:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

2014-07-31 10:32, Neil Horman:
> On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > Neil, we are mixing 2 different problems here.
> > 1) we have to fix default build (without SSE-4.2)
> 
> Thats nothing to fix, thats a configuration issue.  Just build for a lesser
> machine.  I've already done that in the fedora build, using the defalut machine
> target.  What exactly is missing from that?

I mean that building with RTE_MACHINE=default is broken in ixgbe and acl.
So the main priority is to fix it.
But performance of the native build must not be degraded.
And if we can have good performance with a default build, it's fine.

-- 
Thomas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 18:10             ` Neil Horman
  2014-07-31 18:36               ` Bruce Richardson
@ 2014-07-31 21:53               ` Thomas Monjalon
  1 sibling, 0 replies; 45+ messages in thread
From: Thomas Monjalon @ 2014-07-31 21:53 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

2014-07-31 14:10, Neil Horman:
> On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > Please, let's focus on the first item and we could discuss about performance
> > > later. Having some different code path choosed at runtime is a big rework and
> > > imply changing the compilation model (RFC welcome).
> > > 
> Even if I misinterpreted your statement above, I'm still not sure why your
> asserting this. Fixing the build to work with the default target machine is
> good, and should be undertaken, and I'll happily do so, but why reject the
> solution in front of you to wait for it?

I'm not rejecting the solution.
Let's try to improve performance of the default build with runtime checks.
Seeing patches and benchmarks will be interesting.
You're opening a door and I don't know if we'll see the ceiling of this room :)

-- 
Thomas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 20:32                     ` John W. Linville
@ 2014-08-01  8:46                       ` Vincent JARDIN
  2014-08-01 14:06                         ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Vincent JARDIN @ 2014-08-01  8:46 UTC (permalink / raw)
  To: dev

On 31/07/2014 22:32, John W. Linville wrote:
>> BTW: what is FESCO?
> Fedora Engineering Steering Committee
>
> Neil and I have already felt the hot breath of FESCO on our necks
> regarding the Fedora DPDK package...
>

I do confirm and feel that we should go step by step.

Having multiple library as Bruce suggest could be an option, I like this 
idea.

Another one could be to get a new ELF format (like executables on Mac or 
Windows) that allows to support multiple binaries optimized for each 
CPUs. I am not aware of such options with Linux loader. But here, as a 
DPDK community, we cannot push it. Any one at Fedora?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 20:19                   ` Bruce Richardson
@ 2014-08-01 13:36                     ` Neil Horman
  2014-08-01 13:56                       ` Ananyev, Konstantin
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-08-01 13:36 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote:
> On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > 
> > > I think a good first step here that I can't see anyone objecting to is
> > > to enable the ixgbe driver to use the vector code path for a generic
> > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> > > allows a compile for machine type default, and testpmd can still forward
> > > packets at a good rate (roughly perf down about 10% vs native compile on
> > > SNB).
> > > The ACL is a tougher nut to crack, but anyone see any issues with that
> > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> > > set thread, do you want to submit an official patch here, or would you prefer I
> > > do so?]
> > > 
> > 
> > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
> > instructions in that path seems significant, isn't it? Given that performance
> > delta, it seems like it would still be preferable to have a path that used the
> > sse4.2 instructions when they're available.  Or am I misreading what you mean
> > when you say down 10%
> > 
> > Neil
> >
> Ok, I did a little bit more testing here. Using the vector pmd compiled
> for generic x86_64 and using __builtin_popcountll is approx 35% faster
> for packet IO than the existing fast-path functions. It is also 7% (a
> bit lower than ~10% as I originally stated) slower than the existing
> native-compiled vpmd on a Sandy Bridge platform.
> 
> I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the
> extra instructions. The ~7% performance drop went to ~3%, so we would
> gain a little more with using SSE4.2, but compared to the gain from
> having the vector driver at all, it's not that much. [I don't have a
> system handy with AVX2 support to see what boosts might come from
> compiling with that instruction set enabled.]
> 
> Because of this, I'd take the ~35% speed boost for now, and try and find
> what would be the best general way to solve this problem across all
> libraries. Also, I think that anyone who needs that extra 4% performance
> probably wants the other 3% too, and so will compile up the code from
> source using the "native" compilation target. :-)
> 


Wait a moment, I'm not entirely sure what you did here.  I understand that you
replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with
__builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8
intrinsic which is only available with sse4.2 from what I can see. did you
replace those calls with a __builtin_shuffle variant?  Otherwise, how did you
get the pmd to build?  I'm asking because this is what I tried in the first pass
and Konstantin gave some pretty convicing evidence that this was an unworkable
solution:
http://dpdk.org/ml/archives/dev/2014-July/004443.html

Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 13:36                     ` Neil Horman
@ 2014-08-01 13:56                       ` Ananyev, Konstantin
  2014-08-01 14:26                         ` Venkatesan, Venky
  2014-08-01 14:27                         ` Neil Horman
  0 siblings, 2 replies; 45+ messages in thread
From: Ananyev, Konstantin @ 2014-08-01 13:56 UTC (permalink / raw)
  To: Neil Horman, Richardson, Bruce; +Cc: dev

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> Sent: Friday, August 01, 2014 2:37 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
> 
> On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote:
> > On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > >
> > > > I think a good first step here that I can't see anyone objecting to is
> > > > to enable the ixgbe driver to use the vector code path for a generic
> > > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> > > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> > > > allows a compile for machine type default, and testpmd can still forward
> > > > packets at a good rate (roughly perf down about 10% vs native compile on
> > > > SNB).
> > > > The ACL is a tougher nut to crack, but anyone see any issues with that
> > > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> > > > set thread, do you want to submit an official patch here, or would you prefer I
> > > > do so?]
> > > >
> > >
> > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
> > > instructions in that path seems significant, isn't it? Given that performance
> > > delta, it seems like it would still be preferable to have a path that used the
> > > sse4.2 instructions when they're available.  Or am I misreading what you mean
> > > when you say down 10%
> > >
> > > Neil
> > >
> > Ok, I did a little bit more testing here. Using the vector pmd compiled
> > for generic x86_64 and using __builtin_popcountll is approx 35% faster
> > for packet IO than the existing fast-path functions. It is also 7% (a
> > bit lower than ~10% as I originally stated) slower than the existing
> > native-compiled vpmd on a Sandy Bridge platform.
> >
> > I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the
> > extra instructions. The ~7% performance drop went to ~3%, so we would
> > gain a little more with using SSE4.2, but compared to the gain from
> > having the vector driver at all, it's not that much. [I don't have a
> > system handy with AVX2 support to see what boosts might come from
> > compiling with that instruction set enabled.]
> >
> > Because of this, I'd take the ~35% speed boost for now, and try and find
> > what would be the best general way to solve this problem across all
> > libraries. Also, I think that anyone who needs that extra 4% performance
> > probably wants the other 3% too, and so will compile up the code from
> > source using the "native" compilation target. :-)
> >
> 
> 
> Wait a moment, I'm not entirely sure what you did here.  I understand that you
> replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with
> __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8
> intrinsic which is only available with sse4.2 from what I can see. did you
> replace those calls with a __builtin_shuffle variant?  Otherwise, how did you
> get the pmd to build?  I'm asking because this is what I tried in the first pass
> and Konstantin gave some pretty convicing evidence that this was an unworkable
> solution:
> http://dpdk.org/ml/archives/dev/2014-July/004443.html
> 

I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3.
So I presume, there is no need for replacement.
Konstantin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01  8:46                       ` Vincent JARDIN
@ 2014-08-01 14:06                         ` Neil Horman
  2014-08-01 14:57                           ` Vincent JARDIN
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-08-01 14:06 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev

On Fri, Aug 01, 2014 at 10:46:33AM +0200, Vincent JARDIN wrote:
> On 31/07/2014 22:32, John W. Linville wrote:
> >>BTW: what is FESCO?
> >Fedora Engineering Steering Committee
> >
> >Neil and I have already felt the hot breath of FESCO on our necks
> >regarding the Fedora DPDK package...
> >
> 
> I do confirm and feel that we should go step by step.
> 
> Having multiple library as Bruce suggest could be an option, I like this
> idea.
> 
Its not an option (reasons described further down in the thread).

> Another one could be to get a new ELF format (like executables on Mac or
> Windows) that allows to support multiple binaries optimized for each CPUs. I
> am not aware of such options with Linux loader. But here, as a DPDK
> community, we cannot push it. Any one at Fedora?
> 
This is definately not an option, at least not without significant justification
or need.  What you're asking for here is the development of an entirely new
binary file format, the kernel and glibc support to interpret and execute it,
and the compiler tooling to emit code in that format.  Thats a huge undertaking,
its not going to be done just because a single library would like to ship
multiple binaries to be optimized for different cpu variants within the same
family.  Thats a multi year effort, and not something I'm prepared to even
consider undertaking.

Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 13:56                       ` Ananyev, Konstantin
@ 2014-08-01 14:26                         ` Venkatesan, Venky
  2014-08-01 14:27                         ` Neil Horman
  1 sibling, 0 replies; 45+ messages in thread
From: Venkatesan, Venky @ 2014-08-01 14:26 UTC (permalink / raw)
  To: dev


On 8/1/2014 6:56 AM, Ananyev, Konstantin wrote:
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
>> Sent: Friday, August 01, 2014 2:37 PM
>> To: Richardson, Bruce
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
>>
>> On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote:
>>> On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote:
>>>> On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
>>>>> I think a good first step here that I can't see anyone objecting to is
>>>>> to enable the ixgbe driver to use the vector code path for a generic
>>>>> x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
>>>>> to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
>>>>> allows a compile for machine type default, and testpmd can still forward
>>>>> packets at a good rate (roughly perf down about 10% vs native compile on
>>>>> SNB).
>>>>> The ACL is a tougher nut to crack, but anyone see any issues with that
>>>>> two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
>>>>> set thread, do you want to submit an official patch here, or would you prefer I
>>>>> do so?]
>>>>>
>>>> I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
>>>> instructions in that path seems significant, isn't it? Given that performance
>>>> delta, it seems like it would still be preferable to have a path that used the
>>>> sse4.2 instructions when they're available.  Or am I misreading what you mean
>>>> when you say down 10%
>>>>
>>>> Neil
>>>>
>>> Ok, I did a little bit more testing here. Using the vector pmd compiled
>>> for generic x86_64 and using __builtin_popcountll is approx 35% faster
>>> for packet IO than the existing fast-path functions. It is also 7% (a
>>> bit lower than ~10% as I originally stated) slower than the existing
>>> native-compiled vpmd on a Sandy Bridge platform.
>>>
>>> I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the
>>> extra instructions. The ~7% performance drop went to ~3%, so we would
>>> gain a little more with using SSE4.2, but compared to the gain from
>>> having the vector driver at all, it's not that much. [I don't have a
>>> system handy with AVX2 support to see what boosts might come from
>>> compiling with that instruction set enabled.]
>>>
>>> Because of this, I'd take the ~35% speed boost for now, and try and find
>>> what would be the best general way to solve this problem across all
>>> libraries. Also, I think that anyone who needs that extra 4% performance
>>> probably wants the other 3% too, and so will compile up the code from
>>> source using the "native" compilation target. :-)
So if I read this right, the fast path scalar to the new "generic" 
vector implementation is 35%? That's is a bit higher than anticipated, 
but great!! One caution - we should probably get a performance read on 
Atom cores before we remove the scalar fast path completely.
>>
>> Wait a moment, I'm not entirely sure what you did here.  I understand that you
>> replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with
>> __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8
>> intrinsic which is only available with sse4.2 from what I can see. did you
>> replace those calls with a __builtin_shuffle variant?  Otherwise, how did you
>> get the pmd to build?  I'm asking because this is what I tried in the first pass
>> and Konstantin gave some pretty convicing evidence that this was an unworkable
>> solution:
>> http://dpdk.org/ml/archives/dev/2014-July/004443.html
>>
> I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3.
> So I presume, there is no need for replacement.
> Konstantin

The change is really to keep the __mm_shuffle_epi8 and replace the 
_mm_popcnt_u64 with the builtin variant. That should allow compilation 
all the way up from SSSE3.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 13:56                       ` Ananyev, Konstantin
  2014-08-01 14:26                         ` Venkatesan, Venky
@ 2014-08-01 14:27                         ` Neil Horman
  1 sibling, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-08-01 14:27 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Fri, Aug 01, 2014 at 01:56:24PM +0000, Ananyev, Konstantin wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> > Sent: Friday, August 01, 2014 2:37 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
> > 
> > On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote:
> > > On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote:
> > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > > >
> > > > > I think a good first step here that I can't see anyone objecting to is
> > > > > to enable the ixgbe driver to use the vector code path for a generic
> > > > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64"
> > > > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin]
> > > > > allows a compile for machine type default, and testpmd can still forward
> > > > > packets at a good rate (roughly perf down about 10% vs native compile on
> > > > > SNB).
> > > > > The ACL is a tougher nut to crack, but anyone see any issues with that
> > > > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch
> > > > > set thread, do you want to submit an official patch here, or would you prefer I
> > > > > do so?]
> > > > >
> > > >
> > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2
> > > > instructions in that path seems significant, isn't it? Given that performance
> > > > delta, it seems like it would still be preferable to have a path that used the
> > > > sse4.2 instructions when they're available.  Or am I misreading what you mean
> > > > when you say down 10%
> > > >
> > > > Neil
> > > >
> > > Ok, I did a little bit more testing here. Using the vector pmd compiled
> > > for generic x86_64 and using __builtin_popcountll is approx 35% faster
> > > for packet IO than the existing fast-path functions. It is also 7% (a
> > > bit lower than ~10% as I originally stated) slower than the existing
> > > native-compiled vpmd on a Sandy Bridge platform.
> > >
> > > I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the
> > > extra instructions. The ~7% performance drop went to ~3%, so we would
> > > gain a little more with using SSE4.2, but compared to the gain from
> > > having the vector driver at all, it's not that much. [I don't have a
> > > system handy with AVX2 support to see what boosts might come from
> > > compiling with that instruction set enabled.]
> > >
> > > Because of this, I'd take the ~35% speed boost for now, and try and find
> > > what would be the best general way to solve this problem across all
> > > libraries. Also, I think that anyone who needs that extra 4% performance
> > > probably wants the other 3% too, and so will compile up the code from
> > > source using the "native" compilation target. :-)
> > >
> > 
> > 
> > Wait a moment, I'm not entirely sure what you did here.  I understand that you
> > replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with
> > __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8
> > intrinsic which is only available with sse4.2 from what I can see. did you
> > replace those calls with a __builtin_shuffle variant?  Otherwise, how did you
> > get the pmd to build?  I'm asking because this is what I tried in the first pass
> > and Konstantin gave some pretty convicing evidence that this was an unworkable
> > solution:
> > http://dpdk.org/ml/archives/dev/2014-July/004443.html
> > 
> 
> I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3.
> So I presume, there is no need for replacement.
Ah, I see, its just because we're using the nmmintrinsic.h header.  We need to
replace the popcount instruction and change the include header to tmmintrins.h
to avoid the #error from the failed sse4.2 check

Thanks!
Neil

> Konstantin
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 14:06                         ` Neil Horman
@ 2014-08-01 14:57                           ` Vincent JARDIN
  2014-08-01 15:19                             ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Vincent JARDIN @ 2014-08-01 14:57 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On 01/08/2014 16:06, Neil Horman wrote:
> Thats a multi year effort, and not something I'm prepared to even
> consider undertaking.

Sorry: I am not pushing you, it was just an open comment. I do agree 
that it is a multi year effort to get it down into a wide "agreed" 
community. DPDK community cannot manage it.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-07-31 20:25                   ` Bruce Richardson
@ 2014-08-01 15:06                     ` Neil Horman
  2014-08-01 19:22                       ` Bruce Richardson
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-08-01 15:06 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote:
> On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > > Hey all-
> > > 
> > > With regards to the general approach for runtime detection of software
> > > functions, I wonder if something like this can be handled by the
> > > packaging system? Is it possible to ship out a set of shared libs
> > > compiled up for different instruction sets, and then at rpm install
> > > time, symlink the appropriate library? This would push the whole issue
> > > of detection of code paths outside of code, work across all our
> > > libraries and ensure each user got the best performance they could get
> > > form a binary?
> > > Has something like this been done before? The building of all the
> > > libraries could be scripted easy enough, just do multiple builds using
> > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > each run.
> > > 
> > 
> > Sorry, I missed this in my last reply.
> > 
> > In answer to your question, the short version is that such a thing is roughly
> > possible from a packaging standpoint, but completely unworkable from a
> > distribution standpoint.  We could certainly build the dpdk multiple times and
> > rename all the shared objects to some variant name representative of the
> > optimzations we build in for certain cpu flags, but then we woudl be shipping X
> > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> > would need to provide a version linked against each variant to be useful when
> > making a product, and each end user would need to manually select (or run a
> > script to select) which variant is most optimized for the system at hand.  Its
> > just not a reasonable way to package a library.
> 
> Sorry, perhaps I was not clear, having the user have to select the
> appropriate library was not what I was suggesting. Instead, I was
> suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
> "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
> post-install script would look at the cpuflags in cpuinfo and then
> symlink librte_pmd_ixgbe.so to the best-match version. That way the user
> only has to link against "librte_pmd_ixgbe.so" and depending on the
> system its run on, the loader will automatically resolve the symbols
> from the appropriate instruction-set specific .so file.
> 

This is an absolute packaging nightmare, it will potentially break all sorts of
corner cases, and support processes.  To cite a few examples:

1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced
at some point in the future?  The above strategy has no way to know that a given
update has more advanced requirements than a previous update, and when the
update is installed, the previously linked library for the old base will
dissappear, leaving broken applications behind.

2) Debugging - Its going to be near impossible to support an application built
with a package put together this way, because you'll never be sure as to which
version of the library was running when the crash occured.  You can figure it
out for certain, but for support/development people to need to remember to
figure this out is going to be a major turn off for them, and the result will be
that they simply won't use the dpdk.  Its Anathema to the expectations of linux
user space.

3) QA - Building multiple versions of a library means needing to QA multiple
versions of a library.  If you have to have 4 builds to support different levels
of optimization, you've created a 4x increase in the amount of testing you need
to do to ensure consistent behavior.  You need to be aware of how many different
builds are available in the single rpm at all times, and find systems on which
to QA which will ensure that all of the builds get tested (as they are in fact,
unique builds).  While you may not hit all code paths in a single build, you
will at least test all the common paths.

The bottom line is that Distribution packaging is all about consistency and
commonality.  If you install something for an arch on multiple systems, its the
same thing on each system, and it works in the same way, all the time.  This
strategy breaks that.  Thats why we do run time checks for things.

Neil

> > 
> > When pacaging software, the only consideration given to code variance at pacakge
> > time is architecture (x86/x86_64/ppc/s390/etc).  If you install a package for
> > your a given architecture, its expected to run on that architecture.  Optional
> > code paths are just that, optional, and executed based on run time tests.  Its a
> > requirement that we build for the lowest common demoniator system that is
> > supported, and enable accelerative code paths optionally at run time when the
> > cpu indicates support for them.
> > 
> > Neil
> > 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 14:57                           ` Vincent JARDIN
@ 2014-08-01 15:19                             ` Neil Horman
  0 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-08-01 15:19 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev

On Fri, Aug 01, 2014 at 04:57:56PM +0200, Vincent JARDIN wrote:
> On 01/08/2014 16:06, Neil Horman wrote:
> >Thats a multi year effort, and not something I'm prepared to even
> >consider undertaking.
> 
> Sorry: I am not pushing you, it was just an open comment. I do agree that it
> is a multi year effort to get it down into a wide "agreed" community. DPDK
> community cannot manage it.
> 

I understand, but I still don't think it makes sense to do.  The fat elf format
was origionally intended to allow transparent movement bewteen major
architectures.  While it certainly could be used to migrate between a smaller
granularity within the same arch family, it seems like a lousy tradeoff to me,
especially as the fanout for variants increases.  Right now I think there are at
least 4 variants within the dpdk (sse3, sse4.2, avx, and avx512).  Thats a 4x
increase in binary size.  While that might make sense for one highly optimized
application, it doesn't seem like it would make sense as a general purpose
utility, as the disk space / speedup tradeoff is not super great.

Neil

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 15:06                     ` Neil Horman
@ 2014-08-01 19:22                       ` Bruce Richardson
  2014-08-01 20:43                         ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-08-01 19:22 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote:
> On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote:
> > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > > > Hey all-
> > > > 
> > > > With regards to the general approach for runtime detection of software
> > > > functions, I wonder if something like this can be handled by the
> > > > packaging system? Is it possible to ship out a set of shared libs
> > > > compiled up for different instruction sets, and then at rpm install
> > > > time, symlink the appropriate library? This would push the whole issue
> > > > of detection of code paths outside of code, work across all our
> > > > libraries and ensure each user got the best performance they could get
> > > > form a binary?
> > > > Has something like this been done before? The building of all the
> > > > libraries could be scripted easy enough, just do multiple builds using
> > > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > > each run.
> > > > 
> > > 
> > > Sorry, I missed this in my last reply.
> > > 
> > > In answer to your question, the short version is that such a thing is roughly
> > > possible from a packaging standpoint, but completely unworkable from a
> > > distribution standpoint.  We could certainly build the dpdk multiple times and
> > > rename all the shared objects to some variant name representative of the
> > > optimzations we build in for certain cpu flags, but then we woudl be shipping X
> > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> > > would need to provide a version linked against each variant to be useful when
> > > making a product, and each end user would need to manually select (or run a
> > > script to select) which variant is most optimized for the system at hand.  Its
> > > just not a reasonable way to package a library.
> > 
> > Sorry, perhaps I was not clear, having the user have to select the
> > appropriate library was not what I was suggesting. Instead, I was
> > suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
> > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
> > post-install script would look at the cpuflags in cpuinfo and then
> > symlink librte_pmd_ixgbe.so to the best-match version. That way the user
> > only has to link against "librte_pmd_ixgbe.so" and depending on the
> > system its run on, the loader will automatically resolve the symbols
> > from the appropriate instruction-set specific .so file.
> > 
> 
> This is an absolute packaging nightmare, it will potentially break all sorts of
> corner cases, and support processes.  To cite a few examples:
> 
> 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced
> at some point in the future?  The above strategy has no way to know that a given
> update has more advanced requirements than a previous update, and when the
> update is installed, the previously linked library for the old base will
> dissappear, leaving broken applications behind.

Firstly, I didn't know we could actually specify minimum cpu
requirements for packaging, that is something that could be useful :-)
Secondly, what is the normal case for handling something like this,
where an upgrade has enhanced requirements compared to the previous
version? Presumably you either need to prevent the upgrade from
happening or else accept a broken app. Can the same mechanism not also
be used to prevent upgrades using a multi-lib scheme?

> 
> 2) Debugging - Its going to be near impossible to support an application built
> with a package put together this way, because you'll never be sure as to which
> version of the library was running when the crash occured.  You can figure it
> out for certain, but for support/development people to need to remember to
> figure this out is going to be a major turn off for them, and the result will be
> that they simply won't use the dpdk.  Its Anathema to the expectations of linux
> user space.

Sorry, I just don't see this as being any harder to support than
multiple code paths for the same functionality. In fact, it will surely make
debugging easier, since you only have the one code path, just compiled
up in different ways.

> 
> 3) QA - Building multiple versions of a library means needing to QA multiple
> versions of a library.  If you have to have 4 builds to support different levels
> of optimization, you've created a 4x increase in the amount of testing you need
> to do to ensure consistent behavior.  You need to be aware of how many different
> builds are available in the single rpm at all times, and find systems on which
> to QA which will ensure that all of the builds get tested (as they are in fact,
> unique builds).  While you may not hit all code paths in a single build, you
> will at least test all the common paths.

Again, the exact same QA conditions will also apply to an approach using
multiple code paths bundled into the same library. Given a choice
between one code path with multiple compiles, vs multiple code paths
each compiled only once, the multiple code paths option leaves far
greater scope for bugs, and when bugs do occur means that you always
have to find out what specific hardware it was being run on. Using the
exact same code multiply compiled, the vast, vast majority of bugs are
going to occur across all platforms and systems so you should rarely
need to ask what the specific platform being used is.

> 
> The bottom line is that Distribution packaging is all about consistency and
> commonality.  If you install something for an arch on multiple systems, its the
> same thing on each system, and it works in the same way, all the time.  This
> strategy breaks that.  Thats why we do run time checks for things.

If you want to have the best tuned code running for each instruction
set, then commonality and consistency goes out the window anyway,
because two different machines calling the same function are going to
execute different sets of instructions. The decision then becomes:
a) whether you need multiple sets of instructions - if no then you pay
with lack of performance
b) how you get those multiple sets of instructions
c) how you validate those multiple sets of instructions.

As is clear by now :-), my preference by far is to have multiple sets of
instructions come from a single code base, as less code means less
maintenance, and above all, fewer bugs. If that can't be done, then we
need to look carefully at each code path being added and do a
cost-benefit analysis on it.

Regards,
/Bruce

> 
> Neil
> 
> > > 
> > > When pacaging software, the only consideration given to code variance at pacakge
> > > time is architecture (x86/x86_64/ppc/s390/etc).  If you install a package for
> > > your a given architecture, its expected to run on that architecture.  Optional
> > > code paths are just that, optional, and executed based on run time tests.  Its a
> > > requirement that we build for the lowest common demoniator system that is
> > > supported, and enable accelerative code paths optionally at run time when the
> > > cpu indicates support for them.
> > > 
> > > Neil
> > > 
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 19:22                       ` Bruce Richardson
@ 2014-08-01 20:43                         ` Neil Horman
  2014-08-01 21:08                           ` Bruce Richardson
  0 siblings, 1 reply; 45+ messages in thread
From: Neil Horman @ 2014-08-01 20:43 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote:
> On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote:
> > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote:
> > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > > > > Hey all-
> > > > > 
> > > > > With regards to the general approach for runtime detection of software
> > > > > functions, I wonder if something like this can be handled by the
> > > > > packaging system? Is it possible to ship out a set of shared libs
> > > > > compiled up for different instruction sets, and then at rpm install
> > > > > time, symlink the appropriate library? This would push the whole issue
> > > > > of detection of code paths outside of code, work across all our
> > > > > libraries and ensure each user got the best performance they could get
> > > > > form a binary?
> > > > > Has something like this been done before? The building of all the
> > > > > libraries could be scripted easy enough, just do multiple builds using
> > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > > > each run.
> > > > > 
> > > > 
> > > > Sorry, I missed this in my last reply.
> > > > 
> > > > In answer to your question, the short version is that such a thing is roughly
> > > > possible from a packaging standpoint, but completely unworkable from a
> > > > distribution standpoint.  We could certainly build the dpdk multiple times and
> > > > rename all the shared objects to some variant name representative of the
> > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X
> > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> > > > would need to provide a version linked against each variant to be useful when
> > > > making a product, and each end user would need to manually select (or run a
> > > > script to select) which variant is most optimized for the system at hand.  Its
> > > > just not a reasonable way to package a library.
> > > 
> > > Sorry, perhaps I was not clear, having the user have to select the
> > > appropriate library was not what I was suggesting. Instead, I was
> > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
> > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
> > > post-install script would look at the cpuflags in cpuinfo and then
> > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user
> > > only has to link against "librte_pmd_ixgbe.so" and depending on the
> > > system its run on, the loader will automatically resolve the symbols
> > > from the appropriate instruction-set specific .so file.
> > > 
> > 
> > This is an absolute packaging nightmare, it will potentially break all sorts of
> > corner cases, and support processes.  To cite a few examples:
> > 
> > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced
> > at some point in the future?  The above strategy has no way to know that a given
> > update has more advanced requirements than a previous update, and when the
> > update is installed, the previously linked library for the old base will
> > dissappear, leaving broken applications behind.
> 
> Firstly, I didn't know we could actually specify minimum cpu
> requirements for packaging, that is something that could be useful :-)
You misread my comment :).  I didn't say we could specify minimum cpu
requirements at packaging (you can't, beyond general arch), I said "what if the
dpdk's cpu requriements were raised?".  Completely different thing.  Currently
teh default, lowest common denominator system that dpdk appears to build for is
core2 (as listed in the old default config).  What if at some point you raise
those requirements and decide that SSE4.2 really is required to achieve maximum
performance. Using the above strategy any system that doesn't meet the new
requirements will silently break on such an update.  Thats not acceptable.

> Secondly, what is the normal case for handling something like this,
> where an upgrade has enhanced requirements compared to the previous
> version? Presumably you either need to prevent the upgrade from
> happening or else accept a broken app. Can the same mechanism not also
> be used to prevent upgrades using a multi-lib scheme?
> 
The case for handling something like this is: Don't do it.  When you package
something for Fedora (or any distro), you provide an implicit guaratee that it
will run (or fail gracefully) on all supported systems.  You can add support for
systems as you go forward, but you can't deprecate support for systems within a
major release.  That is to say, if something runs on F20 now, its got to keep
running on F20 for the lifetime of F20.  If it stops running, thats a
regression, the user opens a bug and you fix it.

The DPDK is way off the reservation in regards to this.  Application packages,
as a general rule don't build with specific cpu features in mind, because
performance, while important isn't on the same scale as what you're trying to do
in the dpdk.  A process getting scheduled off the cpu while we handle an
interrupt wipes out any speedup gains made by any micro-optimizations, so theres
no point in doing so.  The DPDK is different, I understand that, but the
drawback is that it (the DPDK) needs to make optimizations that really aren't
considered particularly important to the rest of user space.  I'm trying to
opportunistically make the DPDK as fast as possible, but I need to do it in a
single binary, that works on a lowest common demoninator system.

> > 
> > 2) Debugging - Its going to be near impossible to support an application built
> > with a package put together this way, because you'll never be sure as to which
> > version of the library was running when the crash occured.  You can figure it
> > out for certain, but for support/development people to need to remember to
> > figure this out is going to be a major turn off for them, and the result will be
> > that they simply won't use the dpdk.  Its Anathema to the expectations of linux
> > user space.
> 
> Sorry, I just don't see this as being any harder to support than
> multiple code paths for the same functionality. In fact, it will surely make
> debugging easier, since you only have the one code path, just compiled
> up in different ways.
> 

Well, then by all means, become a Fedora packager, an you can take over the DPDK
maintenece there :).  Until then, you'll just have to trust me.  If you have
multiple optional code paths (especialy if they're limited to isolated features) 
its manageable.  But regardless of how you look at it, building the same
source multiple times with different cpu support means completely different
binaries.  The assembly and optimization are just plain different.  They may be
close, but they're not the same, and they need to be QA-ed independently.  With
a single build and optional code paths, all the common code is executed no
matter what system you're running on, and its always the same. Multiple builds
with different instruction support means that code that is identical at a source
level may well be significantly different at a binary level, and thats not
something I can sanely manage in a general purpose environment.

> 
> > 3) QA - Building multiple versions of a library means needing to QA multiple
> > versions of a library.  If you have to have 4 builds to support different levels
> > of optimization, you've created a 4x increase in the amount of testing you need
> > to do to ensure consistent behavior.  You need to be aware of how many different
> > builds are available in the single rpm at all times, and find systems on which
> > to QA which will ensure that all of the builds get tested (as they are in fact,
> > unique builds).  While you may not hit all code paths in a single build, you
> > will at least test all the common paths.
> 
> Again, the exact same QA conditions will also apply to an approach using
> multiple code paths bundled into the same library. Given a choice
> between one code path with multiple compiles, vs multiple code paths
> each compiled only once, the multiple code paths option leaves far
> greater scope for bugs, and when bugs do occur means that you always
> have to find out what specific hardware it was being run on. Using the
> exact same code multiply compiled, the vast, vast majority of bugs are
> going to occur across all platforms and systems so you should rarely
> need to ask what the specific platform being used is.
> 

No, they won't (see above).  Enabling insructions will enable the compiler to
emit and optimize common paths differently, so identical source code will lead
to different binary code.  I need to have a single binary so that I know what
I'm working with when someone opens a bug.  I don't have that using a multiple
binary approach.  At least with multiple runtime paths (especially/specifically
with the run time paths we've been discussing, the igbe rx vector path and the
acl library, which are isolated), I know that, if I get a bug report and the
backtrace ends in either location, I know I'm sepecifically dealing with that
code.  With your multiple binary approach, if I get a crash in, say
rte_eal_init, I need to figure out if this crash happened in the sse3 compiled
binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the
core2 binary.  You can say thats easy, but its easy to say that when you're not
the one that has to support it.

> > 
> > The bottom line is that Distribution packaging is all about consistency and
> > commonality.  If you install something for an arch on multiple systems, its the
> > same thing on each system, and it works in the same way, all the time.  This
> > strategy breaks that.  Thats why we do run time checks for things.
> 
> If you want to have the best tuned code running for each instruction
> set, then commonality and consistency goes out the window anyway,
So, this is perhaps where communication is breaking down.  I don't want to have the
best tuned code running for each instruction set.  What I want is for the dpdk
to run on a lowest common denominator platform, and be able to opportunistically
take advantage of accelerated code paths that require advanced cpu featrues.


Lets take the ixgbe code as an example. Note I didn't add any code paths there,
at all (in fact I didn't add any anywhere).  The ixgbe rx_burst method gets set
according to compile time configuration.  You can pick the bulk_alloc rx method,
or the vectorized rx method at compile time (or some others I think, but thats
not relevant). As it happened the vectorized rx path option had an implicit
dependency on SSE4.2.  Instead of requiring that all cpus that run the dpdk have
SSE4.2, I instead chose to move that compile time decision to a run tmie
decision, by building only the vectorized path with sse4.2 and only using it if
we see that the cpu supports sse4.2 at run time.  No new paths created, no new
support requirements, you're still supporting the same options upstream, the
only difference is I was able to include them both in a single binary.  Thats
better for our end users because the single binary still works everywhere.
Thats better for our QA group because For whatever set of tests they perform,
they only need an sse4.2 enabled system to test the one isolated path for that
vector rx code.  The rest of their tests can be conducted once, on any system,
because the binary is exactly the same.  If we compile multiple binaries,
testing on one system doesn't mean we've tested all the code.

> because two different machines calling the same function are going to
> execute different sets of instructions. The decision then becomes:
But thats not at all what I wanted.  I want two different machines calling the
same function to execute the same instructions 99.9999% of the time.  The only
time I want to diverge from that is in isolated paths where we can take
advantage of a feature that we otherwise could not (i.e. the ixgbe and acl
code).  I look at it like the alternatives code in linux.  There are these
isolated areas where you have limited bits of code that at run time are
re-written to use available cpu features.  99.9% of the code is identical, but
in these little spots its ok to diverge from simmilarity because they'er isolated,
easily identifiable

> a) whether you need multiple sets of instructions - if no then you pay
> with lack of performance
> b) how you get those multiple sets of instructions
> c) how you validate those multiple sets of instructions.
> 
> As is clear by now :-), my preference by far is to have multiple sets of
> instructions come from a single code base, as less code means less
> maintenance, and above all, fewer bugs. If that can't be done, then we
> need to look carefully at each code path being added and do a
> cost-benefit analysis on it.
> 

Yes, its quite clear :), I think its equally clear that I need a single binary,
and would like to opportunisitcally enhance it where possible without losing the
fact that its a single binary.

I suppose its all somewhat moot at this point though,  The reduction to sse3 for
ixgbe seems agreeable to everyone, and it lets me preserve single binary builds
there.  I'm currently working on the ACL library, as you noted thats a tougher
nut to crack.  I think I'll have it done early next week (though i'm sure my
translation of the instruction set reference to C will need some through testing
:)).  I'll post it when its ready.

> Regards,
> /Bruce
> 
> > 
> > Neil
> > 
> > > > 
> > > > When pacaging software, the only consideration given to code variance at pacakge
> > > > time is architecture (x86/x86_64/ppc/s390/etc).  If you install a package for
> > > > your a given architecture, its expected to run on that architecture.  Optional
> > > > code paths are just that, optional, and executed based on run time tests.  Its a
> > > > requirement that we build for the lowest common demoniator system that is
> > > > supported, and enable accelerative code paths optionally at run time when the
> > > > cpu indicates support for them.
> > > > 
> > > > Neil
> > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 20:43                         ` Neil Horman
@ 2014-08-01 21:08                           ` Bruce Richardson
  2014-08-02 12:56                             ` Neil Horman
  0 siblings, 1 reply; 45+ messages in thread
From: Bruce Richardson @ 2014-08-01 21:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Fri, Aug 01, 2014 at 04:43:52PM -0400, Neil Horman wrote:
> On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote:
> > On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote:
> > > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote:
> > > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > > > > > Hey all-
> > > > > > 
> > > > > > With regards to the general approach for runtime detection of software
> > > > > > functions, I wonder if something like this can be handled by the
> > > > > > packaging system? Is it possible to ship out a set of shared libs
> > > > > > compiled up for different instruction sets, and then at rpm install
> > > > > > time, symlink the appropriate library? This would push the whole issue
> > > > > > of detection of code paths outside of code, work across all our
> > > > > > libraries and ensure each user got the best performance they could get
> > > > > > form a binary?
> > > > > > Has something like this been done before? The building of all the
> > > > > > libraries could be scripted easy enough, just do multiple builds using
> > > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > > > > each run.
> > > > > > 
> > > > > 
> > > > > Sorry, I missed this in my last reply.
> > > > > 
> > > > > In answer to your question, the short version is that such a thing is roughly
> > > > > possible from a packaging standpoint, but completely unworkable from a
> > > > > distribution standpoint.  We could certainly build the dpdk multiple times and
> > > > > rename all the shared objects to some variant name representative of the
> > > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X
> > > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> > > > > would need to provide a version linked against each variant to be useful when
> > > > > making a product, and each end user would need to manually select (or run a
> > > > > script to select) which variant is most optimized for the system at hand.  Its
> > > > > just not a reasonable way to package a library.
> > > > 
> > > > Sorry, perhaps I was not clear, having the user have to select the
> > > > appropriate library was not what I was suggesting. Instead, I was
> > > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
> > > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
> > > > post-install script would look at the cpuflags in cpuinfo and then
> > > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user
> > > > only has to link against "librte_pmd_ixgbe.so" and depending on the
> > > > system its run on, the loader will automatically resolve the symbols
> > > > from the appropriate instruction-set specific .so file.
> > > > 
> > > 
> > > This is an absolute packaging nightmare, it will potentially break all sorts of
> > > corner cases, and support processes.  To cite a few examples:
> > > 
> > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced
> > > at some point in the future?  The above strategy has no way to know that a given
> > > update has more advanced requirements than a previous update, and when the
> > > update is installed, the previously linked library for the old base will
> > > dissappear, leaving broken applications behind.
> > 
> > Firstly, I didn't know we could actually specify minimum cpu
> > requirements for packaging, that is something that could be useful :-)
> You misread my comment :).  I didn't say we could specify minimum cpu
> requirements at packaging (you can't, beyond general arch), I said "what if the
> dpdk's cpu requriements were raised?".  Completely different thing.  Currently
> teh default, lowest common denominator system that dpdk appears to build for is
> core2 (as listed in the old default config).  What if at some point you raise
> those requirements and decide that SSE4.2 really is required to achieve maximum
> performance. Using the above strategy any system that doesn't meet the new
> requirements will silently break on such an update.  Thats not acceptable.

Core2 was the first Intel set of chips that had x86_64 instruction set
before (Core microarchitecture), so that's why it's listed as the
minimum - it's the same thing as generic x86_64 support. :-)

> 
> > Secondly, what is the normal case for handling something like this,
> > where an upgrade has enhanced requirements compared to the previous
> > version? Presumably you either need to prevent the upgrade from
> > happening or else accept a broken app. Can the same mechanism not also
> > be used to prevent upgrades using a multi-lib scheme?
> > 
> The case for handling something like this is: Don't do it.  When you package
> something for Fedora (or any distro), you provide an implicit guaratee that it
> will run (or fail gracefully) on all supported systems.  You can add support for
> systems as you go forward, but you can't deprecate support for systems within a
> major release.  That is to say, if something runs on F20 now, its got to keep
> running on F20 for the lifetime of F20.  If it stops running, thats a
> regression, the user opens a bug and you fix it.
> 
> The DPDK is way off the reservation in regards to this.  Application packages,
> as a general rule don't build with specific cpu features in mind, because
> performance, while important isn't on the same scale as what you're trying to do
> in the dpdk.  A process getting scheduled off the cpu while we handle an
> interrupt wipes out any speedup gains made by any micro-optimizations, so theres
> no point in doing so.  The DPDK is different, I understand that, but the
> drawback is that it (the DPDK) needs to make optimizations that really aren't
> considered particularly important to the rest of user space.  I'm trying to
> opportunistically make the DPDK as fast as possible, but I need to do it in a
> single binary, that works on a lowest common demoninator system.
> 
> > > 
> > > 2) Debugging - Its going to be near impossible to support an application built
> > > with a package put together this way, because you'll never be sure as to which
> > > version of the library was running when the crash occured.  You can figure it
> > > out for certain, but for support/development people to need to remember to
> > > figure this out is going to be a major turn off for them, and the result will be
> > > that they simply won't use the dpdk.  Its Anathema to the expectations of linux
> > > user space.
> > 
> > Sorry, I just don't see this as being any harder to support than
> > multiple code paths for the same functionality. In fact, it will surely make
> > debugging easier, since you only have the one code path, just compiled
> > up in different ways.
> > 
> 
> Well, then by all means, become a Fedora packager, an you can take over the DPDK
> maintenece there :).  Until then, you'll just have to trust me.  If you have
> multiple optional code paths (especialy if they're limited to isolated features) 
> its manageable.  But regardless of how you look at it, building the same
> source multiple times with different cpu support means completely different
> binaries.  The assembly and optimization are just plain different.  They may be
> close, but they're not the same, and they need to be QA-ed independently.  With
> a single build and optional code paths, all the common code is executed no
> matter what system you're running on, and its always the same. Multiple builds
> with different instruction support means that code that is identical at a source
> level may well be significantly different at a binary level, and thats not
> something I can sanely manage in a general purpose environment.
> 
> > 
> > > 3) QA - Building multiple versions of a library means needing to QA multiple
> > > versions of a library.  If you have to have 4 builds to support different levels
> > > of optimization, you've created a 4x increase in the amount of testing you need
> > > to do to ensure consistent behavior.  You need to be aware of how many different
> > > builds are available in the single rpm at all times, and find systems on which
> > > to QA which will ensure that all of the builds get tested (as they are in fact,
> > > unique builds).  While you may not hit all code paths in a single build, you
> > > will at least test all the common paths.
> > 
> > Again, the exact same QA conditions will also apply to an approach using
> > multiple code paths bundled into the same library. Given a choice
> > between one code path with multiple compiles, vs multiple code paths
> > each compiled only once, the multiple code paths option leaves far
> > greater scope for bugs, and when bugs do occur means that you always
> > have to find out what specific hardware it was being run on. Using the
> > exact same code multiply compiled, the vast, vast majority of bugs are
> > going to occur across all platforms and systems so you should rarely
> > need to ask what the specific platform being used is.
> > 
> 
> No, they won't (see above).  Enabling insructions will enable the compiler to
> emit and optimize common paths differently, so identical source code will lead
> to different binary code.  I need to have a single binary so that I know what
> I'm working with when someone opens a bug.  I don't have that using a multiple
> binary approach.  At least with multiple runtime paths (especially/specifically
> with the run time paths we've been discussing, the igbe rx vector path and the
> acl library, which are isolated), I know that, if I get a bug report and the
> backtrace ends in either location, I know I'm sepecifically dealing with that
> code.  With your multiple binary approach, if I get a crash in, say
> rte_eal_init, I need to figure out if this crash happened in the sse3 compiled
> binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the
> core2 binary.  You can say thats easy, but its easy to say that when you're not
> the one that has to support it.
> 
> > > 
> > > The bottom line is that Distribution packaging is all about consistency and
> > > commonality.  If you install something for an arch on multiple systems, its the
> > > same thing on each system, and it works in the same way, all the time.  This
> > > strategy breaks that.  Thats why we do run time checks for things.
> > 
> > If you want to have the best tuned code running for each instruction
> > set, then commonality and consistency goes out the window anyway,
> So, this is perhaps where communication is breaking down.  I don't want to have the
> best tuned code running for each instruction set.  What I want is for the dpdk
> to run on a lowest common denominator platform, and be able to opportunistically
> take advantage of accelerated code paths that require advanced cpu featrues.
> 
> 
> Lets take the ixgbe code as an example. Note I didn't add any code paths there,
> at all (in fact I didn't add any anywhere).  The ixgbe rx_burst method gets set
> according to compile time configuration.  You can pick the bulk_alloc rx method,
> or the vectorized rx method at compile time (or some others I think, but thats
> not relevant). As it happened the vectorized rx path option had an implicit
> dependency on SSE4.2.  Instead of requiring that all cpus that run the dpdk have
> SSE4.2, I instead chose to move that compile time decision to a run tmie
> decision, by building only the vectorized path with sse4.2 and only using it if
> we see that the cpu supports sse4.2 at run time.  No new paths created, no new
> support requirements, you're still supporting the same options upstream, the
> only difference is I was able to include them both in a single binary.  Thats
> better for our end users because the single binary still works everywhere.
> Thats better for our QA group because For whatever set of tests they perform,
> they only need an sse4.2 enabled system to test the one isolated path for that
> vector rx code.  The rest of their tests can be conducted once, on any system,
> because the binary is exactly the same.  If we compile multiple binaries,
> testing on one system doesn't mean we've tested all the code.
> 
> > because two different machines calling the same function are going to
> > execute different sets of instructions. The decision then becomes:
> But thats not at all what I wanted.  I want two different machines calling the
> same function to execute the same instructions 99.9999% of the time.  The only
> time I want to diverge from that is in isolated paths where we can take
> advantage of a feature that we otherwise could not (i.e. the ixgbe and acl
> code).  I look at it like the alternatives code in linux.  There are these
> isolated areas where you have limited bits of code that at run time are
> re-written to use available cpu features.  99.9% of the code is identical, but
> in these little spots its ok to diverge from simmilarity because they'er isolated,
> easily identifiable
> 
> > a) whether you need multiple sets of instructions - if no then you pay
> > with lack of performance
> > b) how you get those multiple sets of instructions
> > c) how you validate those multiple sets of instructions.
> > 
> > As is clear by now :-), my preference by far is to have multiple sets of
> > instructions come from a single code base, as less code means less
> > maintenance, and above all, fewer bugs. If that can't be done, then we
> > need to look carefully at each code path being added and do a
> > cost-benefit analysis on it.
> > 
> 
> Yes, its quite clear :), I think its equally clear that I need a single binary,
> and would like to opportunisitcally enhance it where possible without losing the
> fact that its a single binary.
> 
> I suppose its all somewhat moot at this point though,  The reduction to sse3 for
> ixgbe seems agreeable to everyone, and it lets me preserve single binary builds
> there.  I'm currently working on the ACL library, as you noted thats a tougher
> nut to crack.  I think I'll have it done early next week (though i'm sure my
> translation of the instruction set reference to C will need some through testing
> :)).  I'll post it when its ready.
> 

Agreed, lets get everything working to a common baseline anyway. In
terms of the number of RX and TX functions, you mentioned, I'd hope in future 
we could cut the number of them down a bit as we make the vector versions more
generally applicable, but that's a whole discussion for another day.

/Bruce

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features
  2014-08-01 21:08                           ` Bruce Richardson
@ 2014-08-02 12:56                             ` Neil Horman
  0 siblings, 0 replies; 45+ messages in thread
From: Neil Horman @ 2014-08-02 12:56 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Fri, Aug 01, 2014 at 02:08:22PM -0700, Bruce Richardson wrote:
> On Fri, Aug 01, 2014 at 04:43:52PM -0400, Neil Horman wrote:
> > On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote:
> > > On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote:
> > > > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote:
> > > > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote:
> > > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote:
> > > > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote:
> > > > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote:
> > > > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote:
> > > > > > > > > > 2014-07-31 09:13, Neil Horman:
> > > > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote:
> > > > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote:
> > > > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote:
> > > > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote:
> > > > > > > > > > > > > > > Hey all-
> > > > > > > 
> > > > > > > With regards to the general approach for runtime detection of software
> > > > > > > functions, I wonder if something like this can be handled by the
> > > > > > > packaging system? Is it possible to ship out a set of shared libs
> > > > > > > compiled up for different instruction sets, and then at rpm install
> > > > > > > time, symlink the appropriate library? This would push the whole issue
> > > > > > > of detection of code paths outside of code, work across all our
> > > > > > > libraries and ensure each user got the best performance they could get
> > > > > > > form a binary?
> > > > > > > Has something like this been done before? The building of all the
> > > > > > > libraries could be scripted easy enough, just do multiple builds using
> > > > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after
> > > > > > > each run.
> > > > > > > 
> > > > > > 
> > > > > > Sorry, I missed this in my last reply.
> > > > > > 
> > > > > > In answer to your question, the short version is that such a thing is roughly
> > > > > > possible from a packaging standpoint, but completely unworkable from a
> > > > > > distribution standpoint.  We could certainly build the dpdk multiple times and
> > > > > > rename all the shared objects to some variant name representative of the
> > > > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X
> > > > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk
> > > > > > would need to provide a version linked against each variant to be useful when
> > > > > > making a product, and each end user would need to manually select (or run a
> > > > > > script to select) which variant is most optimized for the system at hand.  Its
> > > > > > just not a reasonable way to package a library.
> > > > > 
> > > > > Sorry, perhaps I was not clear, having the user have to select the
> > > > > appropriate library was not what I was suggesting. Instead, I was
> > > > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic",
> > > > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm
> > > > > post-install script would look at the cpuflags in cpuinfo and then
> > > > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user
> > > > > only has to link against "librte_pmd_ixgbe.so" and depending on the
> > > > > system its run on, the loader will automatically resolve the symbols
> > > > > from the appropriate instruction-set specific .so file.
> > > > > 
> > > > 
> > > > This is an absolute packaging nightmare, it will potentially break all sorts of
> > > > corner cases, and support processes.  To cite a few examples:
> > > > 
> > > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced
> > > > at some point in the future?  The above strategy has no way to know that a given
> > > > update has more advanced requirements than a previous update, and when the
> > > > update is installed, the previously linked library for the old base will
> > > > dissappear, leaving broken applications behind.
> > > 
> > > Firstly, I didn't know we could actually specify minimum cpu
> > > requirements for packaging, that is something that could be useful :-)
> > You misread my comment :).  I didn't say we could specify minimum cpu
> > requirements at packaging (you can't, beyond general arch), I said "what if the
> > dpdk's cpu requriements were raised?".  Completely different thing.  Currently
> > teh default, lowest common denominator system that dpdk appears to build for is
> > core2 (as listed in the old default config).  What if at some point you raise
> > those requirements and decide that SSE4.2 really is required to achieve maximum
> > performance. Using the above strategy any system that doesn't meet the new
> > requirements will silently break on such an update.  Thats not acceptable.
> 
> Core2 was the first Intel set of chips that had x86_64 instruction set
> before (Core microarchitecture), so that's why it's listed as the
> minimum - it's the same thing as generic x86_64 support. :-)
> 
Yes, and that matches with the base cpu that Fedora supports for x86_64.

> > 
> > > Secondly, what is the normal case for handling something like this,
> > > where an upgrade has enhanced requirements compared to the previous
> > > version? Presumably you either need to prevent the upgrade from
> > > happening or else accept a broken app. Can the same mechanism not also
> > > be used to prevent upgrades using a multi-lib scheme?
> > > 
> > The case for handling something like this is: Don't do it.  When you package
> > something for Fedora (or any distro), you provide an implicit guaratee that it
> > will run (or fail gracefully) on all supported systems.  You can add support for
> > systems as you go forward, but you can't deprecate support for systems within a
> > major release.  That is to say, if something runs on F20 now, its got to keep
> > running on F20 for the lifetime of F20.  If it stops running, thats a
> > regression, the user opens a bug and you fix it.
> > 
> > The DPDK is way off the reservation in regards to this.  Application packages,
> > as a general rule don't build with specific cpu features in mind, because
> > performance, while important isn't on the same scale as what you're trying to do
> > in the dpdk.  A process getting scheduled off the cpu while we handle an
> > interrupt wipes out any speedup gains made by any micro-optimizations, so theres
> > no point in doing so.  The DPDK is different, I understand that, but the
> > drawback is that it (the DPDK) needs to make optimizations that really aren't
> > considered particularly important to the rest of user space.  I'm trying to
> > opportunistically make the DPDK as fast as possible, but I need to do it in a
> > single binary, that works on a lowest common demoninator system.
> > 
> > > > 
> > > > 2) Debugging - Its going to be near impossible to support an application built
> > > > with a package put together this way, because you'll never be sure as to which
> > > > version of the library was running when the crash occured.  You can figure it
> > > > out for certain, but for support/development people to need to remember to
> > > > figure this out is going to be a major turn off for them, and the result will be
> > > > that they simply won't use the dpdk.  Its Anathema to the expectations of linux
> > > > user space.
> > > 
> > > Sorry, I just don't see this as being any harder to support than
> > > multiple code paths for the same functionality. In fact, it will surely make
> > > debugging easier, since you only have the one code path, just compiled
> > > up in different ways.
> > > 
> > 
> > Well, then by all means, become a Fedora packager, an you can take over the DPDK
> > maintenece there :).  Until then, you'll just have to trust me.  If you have
> > multiple optional code paths (especialy if they're limited to isolated features) 
> > its manageable.  But regardless of how you look at it, building the same
> > source multiple times with different cpu support means completely different
> > binaries.  The assembly and optimization are just plain different.  They may be
> > close, but they're not the same, and they need to be QA-ed independently.  With
> > a single build and optional code paths, all the common code is executed no
> > matter what system you're running on, and its always the same. Multiple builds
> > with different instruction support means that code that is identical at a source
> > level may well be significantly different at a binary level, and thats not
> > something I can sanely manage in a general purpose environment.
> > 
> > > 
> > > > 3) QA - Building multiple versions of a library means needing to QA multiple
> > > > versions of a library.  If you have to have 4 builds to support different levels
> > > > of optimization, you've created a 4x increase in the amount of testing you need
> > > > to do to ensure consistent behavior.  You need to be aware of how many different
> > > > builds are available in the single rpm at all times, and find systems on which
> > > > to QA which will ensure that all of the builds get tested (as they are in fact,
> > > > unique builds).  While you may not hit all code paths in a single build, you
> > > > will at least test all the common paths.
> > > 
> > > Again, the exact same QA conditions will also apply to an approach using
> > > multiple code paths bundled into the same library. Given a choice
> > > between one code path with multiple compiles, vs multiple code paths
> > > each compiled only once, the multiple code paths option leaves far
> > > greater scope for bugs, and when bugs do occur means that you always
> > > have to find out what specific hardware it was being run on. Using the
> > > exact same code multiply compiled, the vast, vast majority of bugs are
> > > going to occur across all platforms and systems so you should rarely
> > > need to ask what the specific platform being used is.
> > > 
> > 
> > No, they won't (see above).  Enabling insructions will enable the compiler to
> > emit and optimize common paths differently, so identical source code will lead
> > to different binary code.  I need to have a single binary so that I know what
> > I'm working with when someone opens a bug.  I don't have that using a multiple
> > binary approach.  At least with multiple runtime paths (especially/specifically
> > with the run time paths we've been discussing, the igbe rx vector path and the
> > acl library, which are isolated), I know that, if I get a bug report and the
> > backtrace ends in either location, I know I'm sepecifically dealing with that
> > code.  With your multiple binary approach, if I get a crash in, say
> > rte_eal_init, I need to figure out if this crash happened in the sse3 compiled
> > binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the
> > core2 binary.  You can say thats easy, but its easy to say that when you're not
> > the one that has to support it.
> > 
> > > > 
> > > > The bottom line is that Distribution packaging is all about consistency and
> > > > commonality.  If you install something for an arch on multiple systems, its the
> > > > same thing on each system, and it works in the same way, all the time.  This
> > > > strategy breaks that.  Thats why we do run time checks for things.
> > > 
> > > If you want to have the best tuned code running for each instruction
> > > set, then commonality and consistency goes out the window anyway,
> > So, this is perhaps where communication is breaking down.  I don't want to have the
> > best tuned code running for each instruction set.  What I want is for the dpdk
> > to run on a lowest common denominator platform, and be able to opportunistically
> > take advantage of accelerated code paths that require advanced cpu featrues.
> > 
> > 
> > Lets take the ixgbe code as an example. Note I didn't add any code paths there,
> > at all (in fact I didn't add any anywhere).  The ixgbe rx_burst method gets set
> > according to compile time configuration.  You can pick the bulk_alloc rx method,
> > or the vectorized rx method at compile time (or some others I think, but thats
> > not relevant). As it happened the vectorized rx path option had an implicit
> > dependency on SSE4.2.  Instead of requiring that all cpus that run the dpdk have
> > SSE4.2, I instead chose to move that compile time decision to a run tmie
> > decision, by building only the vectorized path with sse4.2 and only using it if
> > we see that the cpu supports sse4.2 at run time.  No new paths created, no new
> > support requirements, you're still supporting the same options upstream, the
> > only difference is I was able to include them both in a single binary.  Thats
> > better for our end users because the single binary still works everywhere.
> > Thats better for our QA group because For whatever set of tests they perform,
> > they only need an sse4.2 enabled system to test the one isolated path for that
> > vector rx code.  The rest of their tests can be conducted once, on any system,
> > because the binary is exactly the same.  If we compile multiple binaries,
> > testing on one system doesn't mean we've tested all the code.
> > 
> > > because two different machines calling the same function are going to
> > > execute different sets of instructions. The decision then becomes:
> > But thats not at all what I wanted.  I want two different machines calling the
> > same function to execute the same instructions 99.9999% of the time.  The only
> > time I want to diverge from that is in isolated paths where we can take
> > advantage of a feature that we otherwise could not (i.e. the ixgbe and acl
> > code).  I look at it like the alternatives code in linux.  There are these
> > isolated areas where you have limited bits of code that at run time are
> > re-written to use available cpu features.  99.9% of the code is identical, but
> > in these little spots its ok to diverge from simmilarity because they'er isolated,
> > easily identifiable
> > 
> > > a) whether you need multiple sets of instructions - if no then you pay
> > > with lack of performance
> > > b) how you get those multiple sets of instructions
> > > c) how you validate those multiple sets of instructions.
> > > 
> > > As is clear by now :-), my preference by far is to have multiple sets of
> > > instructions come from a single code base, as less code means less
> > > maintenance, and above all, fewer bugs. If that can't be done, then we
> > > need to look carefully at each code path being added and do a
> > > cost-benefit analysis on it.
> > > 
> > 
> > Yes, its quite clear :), I think its equally clear that I need a single binary,
> > and would like to opportunisitcally enhance it where possible without losing the
> > fact that its a single binary.
> > 
> > I suppose its all somewhat moot at this point though,  The reduction to sse3 for
> > ixgbe seems agreeable to everyone, and it lets me preserve single binary builds
> > there.  I'm currently working on the ACL library, as you noted thats a tougher
> > nut to crack.  I think I'll have it done early next week (though i'm sure my
> > translation of the instruction set reference to C will need some through testing
> > :)).  I'll post it when its ready.
> > 
> 
> Agreed, lets get everything working to a common baseline anyway. In
> terms of the number of RX and TX functions, you mentioned, I'd hope in future 
> we could cut the number of them down a bit as we make the vector versions more
> generally applicable, but that's a whole discussion for another day.
> 
As long as all paths can continue to work with a generic x86_64 compilation, I'm
good with that.  I'd still like to see some options be run time, rather than
compile time selectable, but we can deal with that much later.

Best
Neil

> /Bruce
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-08-02 12:54 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman
2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
2014-07-30 13:01   ` Neil Horman
2014-07-30 13:44     ` Ananyev, Konstantin
2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman
2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman
2014-07-30 14:49   ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman
2014-07-30 15:36   ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin
2014-07-30 19:03   ` Venky Venkatesan
2014-07-30 19:17     ` Neil Horman
2014-07-30 19:34     ` Neil Horman
2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson
2014-07-30 19:28   ` Neil Horman
2014-07-30 21:09     ` Bruce Richardson
2014-07-31  9:30       ` Thomas Monjalon
2014-07-31 11:36       ` Ananyev, Konstantin
2014-07-31 13:13       ` Neil Horman
2014-07-31 13:26         ` Thomas Monjalon
2014-07-31 14:32           ` Neil Horman
2014-07-31 18:10             ` Neil Horman
2014-07-31 18:36               ` Bruce Richardson
2014-07-31 19:01                 ` Neil Horman
2014-07-31 20:19                   ` Bruce Richardson
2014-08-01 13:36                     ` Neil Horman
2014-08-01 13:56                       ` Ananyev, Konstantin
2014-08-01 14:26                         ` Venkatesan, Venky
2014-08-01 14:27                         ` Neil Horman
2014-07-31 19:58                 ` John W. Linville
2014-07-31 20:20                   ` Bruce Richardson
2014-07-31 20:32                     ` John W. Linville
2014-08-01  8:46                       ` Vincent JARDIN
2014-08-01 14:06                         ` Neil Horman
2014-08-01 14:57                           ` Vincent JARDIN
2014-08-01 15:19                             ` Neil Horman
2014-07-31 20:10                 ` Neil Horman
2014-07-31 20:25                   ` Bruce Richardson
2014-08-01 15:06                     ` Neil Horman
2014-08-01 19:22                       ` Bruce Richardson
2014-08-01 20:43                         ` Neil Horman
2014-08-01 21:08                           ` Bruce Richardson
2014-08-02 12:56                             ` Neil Horman
2014-07-31 21:53               ` Thomas Monjalon
2014-07-31 21:25             ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).