* [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features @ 2014-07-29 20:24 Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman ` (4 more replies) 0 siblings, 5 replies; 45+ messages in thread From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw) To: dev Hey all- I've been trying to update the fedora dpdk package to support VFIO enabled drivers and ran into a problem in which ixgbe didn't compile because the rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the default config I have. I tried to remedy this by replacing the intrinsics with the __builtin macros, but it was pointed out (correctly), that this doesn't work properly. So this is my second attempt, which I actually like a bit better. I noted that code that uses intrinsics (ixgbe and the acl library), don't need to have those instructions turned on build-wide. Rather, we can just enable the instructions in the specific code we want to build with support for that, and test for instruction support dynamically at run time. This allows me to build the dpdk for a generic platform, but in such a way that some optimizations can be used if the executing cpu supports them at run time. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> ^ permalink raw reply [flat|nested] 45+ messages in thread
* [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman @ 2014-07-29 20:24 ` Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman ` (3 subsequent siblings) 4 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw) To: dev The ixgbe vector receive code uses the sse4.2 intrinsics directly. Instead of requiring that they be enabled at build time for the entire dpdk library, just enable sse4.2 for the ixgbe_rxtx_vec.c file, and test for support at runtime when using it Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_pmd_ixgbe/Makefile | 6 ++++++ lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 3 +++ 2 files changed, 9 insertions(+) diff --git a/lib/librte_pmd_ixgbe/Makefile b/lib/librte_pmd_ixgbe/Makefile index 00ccedb..c002239 100644 --- a/lib/librte_pmd_ixgbe/Makefile +++ b/lib/librte_pmd_ixgbe/Makefile @@ -39,6 +39,12 @@ LIB = librte_pmd_ixgbe.a CFLAGS += -O3 CFLAGS += $(WERROR_FLAGS) +# +# the vectorized recieve functions need sse4.2 instruction +# intrinsics, make sure we emit them from the compiler +# +CFLAGS_ixgbe_rxtx_vec.o += -msse4.2 + ifeq ($(CC), icc) # # CFLAGS for icc diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c index 09e19a3..ae615bc 100644 --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c @@ -34,6 +34,7 @@ #include <stdint.h> #include <rte_ethdev.h> #include <rte_malloc.h> +#include <rte_cpuflags.h> #include "ixgbe_ethdev.h" #include "ixgbe_rxtx.h" @@ -679,6 +680,8 @@ int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev) struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode; struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf; + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2)) + return -1; #ifndef RTE_IXGBE_RX_OLFLAGS_ENABLE /* whithout rx ol_flags, no VP flag report */ if (rxmode->hw_vlan_strip != 0 || -- 1.8.3.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman @ 2014-07-29 20:24 ` Neil Horman 2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin ` (2 subsequent siblings) 4 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-29 20:24 UTC (permalink / raw) To: dev The ACL library relies on sse4.2 intrinsics to operate, but we don't have to enable sse4.2 in the entire build. Instead enable it for the ACL library alone, and use rte_acl_create as a choke point, at which we can test for sse4.2 support, and return NULL. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_acl/Makefile | 2 +- lib/librte_acl/rte_acl.c | 9 +++++++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile index 4fe4593..3646439 100644 --- a/lib/librte_acl/Makefile +++ b/lib/librte_acl/Makefile @@ -35,7 +35,7 @@ include $(RTE_SDK)/mk/rte.vars.mk LIB = librte_acl.a CFLAGS += -O3 -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -msse4.2 # all source are stored in SRCS-y SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c index 7c288bd..14ea7e0 100644 --- a/lib/librte_acl/rte_acl.c +++ b/lib/librte_acl/rte_acl.c @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param) struct rte_acl_list *acl_list; struct rte_tailq_entry *te; char name[sizeof(ctx->name)]; + static int acl_supported = -1; + + if (acl_supported == -1) + acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2); + + if (!acl_supported) { + rte_errno = EOPNOTSUPP; + return NULL; + } /* check that we have an initialised tail queue */ acl_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_ACL, rte_acl_list); -- 1.8.3.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman @ 2014-07-30 12:07 ` Ananyev, Konstantin 2014-07-30 13:01 ` Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman 2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson 4 siblings, 1 reply; 45+ messages in thread From: Ananyev, Konstantin @ 2014-07-30 12:07 UTC (permalink / raw) To: Neil Horman, dev Hi Neil, > Hey all- > I've been trying to update the fedora dpdk package to support VFIO > enabled drivers and ran into a problem in which ixgbe didn't compile because the > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > default config I have. I tried to remedy this by replacing the intrinsics with > the __builtin macros, but it was pointed out (correctly), that this doesn't work > properly. So this is my second attempt, which I actually like a bit better. I > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > have those instructions turned on build-wide. Rather, we can just enable the > instructions in the specific code we want to build with support for that, and > test for instruction support dynamically at run time. This allows me to build > the dpdk for a generic platform, but in such a way that some optimizations can > be used if the executing cpu supports them at run time. Indeed it looks much better to me too. Just few nits from me: 1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param) > struct rte_acl_list *acl_list; > struct rte_tailq_entry *te; > char name[sizeof(ctx->name)]; > + static int acl_supported = -1; > + > + if (acl_supported == -1) > + acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2); Do we really need acl_supported here? It seems not a big deal to just always call rte_cpu_get_flag_enabled(). After all it is a create function, and no-one expects it to be extremely fast. 2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported? Konstantin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin @ 2014-07-30 13:01 ` Neil Horman 2014-07-30 13:44 ` Ananyev, Konstantin 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-07-30 13:01 UTC (permalink / raw) To: Ananyev, Konstantin; +Cc: dev On Wed, Jul 30, 2014 at 12:07:39PM +0000, Ananyev, Konstantin wrote: > > > Hi Neil, > > > Hey all- > > I've been trying to update the fedora dpdk package to support VFIO > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > default config I have. I tried to remedy this by replacing the intrinsics with > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > properly. So this is my second attempt, which I actually like a bit better. I > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > have those instructions turned on build-wide. Rather, we can just enable the > > instructions in the specific code we want to build with support for that, and > > test for instruction support dynamically at run time. This allows me to build > > the dpdk for a generic platform, but in such a way that some optimizations can > > be used if the executing cpu supports them at run time. > > Indeed it looks much better to me too. > Just few nits from me: > > 1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param) > > struct rte_acl_list *acl_list; > > struct rte_tailq_entry *te; > > char name[sizeof(ctx->name)]; > > + static int acl_supported = -1; > > + > > + if (acl_supported == -1) > > + acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2); > > Do we really need acl_supported here? > It seems not a big deal to just always call rte_cpu_get_flag_enabled(). > After all it is a create function, and no-one expects it to be extremely fast. > Need, no. My only thought was that some poorly behaved application will call rte_acl_create multiple times regardless of the error returned, and doing so will cause large volumes of calls to cpuid, which evicts several high-use registers, so I didn't want to call it more than needed. If you think its ok to call it multiple times though, I'm fine with removing it. > 2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported? > Absolutely, v2 shortly. Neil > Konstantin > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 13:01 ` Neil Horman @ 2014-07-30 13:44 ` Ananyev, Konstantin 0 siblings, 0 replies; 45+ messages in thread From: Ananyev, Konstantin @ 2014-07-30 13:44 UTC (permalink / raw) To: Neil Horman; +Cc: dev > -----Original Message----- > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Wednesday, July 30, 2014 2:01 PM > To: Ananyev, Konstantin > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features > > On Wed, Jul 30, 2014 at 12:07:39PM +0000, Ananyev, Konstantin wrote: > > > > > > Hi Neil, > > > > > Hey all- > > > I've been trying to update the fedora dpdk package to support VFIO > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > properly. So this is my second attempt, which I actually like a bit better. I > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > have those instructions turned on build-wide. Rather, we can just enable the > > > instructions in the specific code we want to build with support for that, and > > > test for instruction support dynamically at run time. This allows me to build > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > be used if the executing cpu supports them at run time. > > > > Indeed it looks much better to me too. > > Just few nits from me: > > > > 1. > @@ -112,6 +112,15 @@ rte_acl_create(const struct rte_acl_param *param) > > > struct rte_acl_list *acl_list; > > > struct rte_tailq_entry *te; > > > char name[sizeof(ctx->name)]; > > > + static int acl_supported = -1; > > > + > > > + if (acl_supported == -1) > > > + acl_supported = rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2); > > > > Do we really need acl_supported here? > > It seems not a big deal to just always call rte_cpu_get_flag_enabled(). > > After all it is a create function, and no-one expects it to be extremely fast. > > > Need, no. My only thought was that some poorly behaved application will call > rte_acl_create multiple times regardless of the error returned, and doing so > will cause large volumes of calls to cpuid, which evicts several high-use > registers, so I didn't want to call it more than needed. If you think its ok to > call it multiple times though, I'm fine with removing it. >From my thought rte_acl_create() is not supposed to be called in the middle packet processing. It is sort of setup function. That's why I think nothing wrong would happen even if cpuid would be called several times. Again ixgbe_rx_vec_condition_check() would probably be called much more often (for each ixgbe rx queue we are going to use). > > 2. Can you add RTE_LOG(ERR, ...) for re_acl_create() and ixgbe_rx_vec_condition_check() if sse4.2 is not supported? > > > Absolutely, v2 shortly. > Neil > > > Konstantin > > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman ` (2 preceding siblings ...) 2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin @ 2014-07-30 14:49 ` Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman ` (3 more replies) 2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson 4 siblings, 4 replies; 45+ messages in thread From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw) To: dev Hey all- I've been trying to update the fedora dpdk package to support VFIO enabled drivers and ran into a problem in which ixgbe didn't compile because the rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the default config I have. I tried to remedy this by replacing the intrinsics with the __builtin macros, but it was pointed out (correctly), that this doesn't work properly. So this is my second attempt, which I actually like a bit better. I noted that code that uses intrinsics (ixgbe and the acl library), don't need to have those instructions turned on build-wide. Rather, we can just enable the instructions in the specific code we want to build with support for that, and test for instruction support dynamically at run time. This allows me to build the dpdk for a generic platform, but in such a way that some optimizations can be used if the executing cpu supports them at run time. Change notes: v2) * Added Log messages to run time check failures per Konstantin * Removed run time check caching in acl per Konstantin Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> ^ permalink raw reply [flat|nested] 45+ messages in thread
* [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman @ 2014-07-30 14:49 ` Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman ` (2 subsequent siblings) 3 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw) To: dev The ixgbe vector receive code uses the sse4.2 intrinsics directly. Instead of requiring that they be enabled at build time for the entire dpdk library, just enable sse4.2 for the ixgbe_rxtx_vec.c file, and test for support at runtime when using it Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_pmd_ixgbe/Makefile | 6 ++++++ lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 5 +++++ 2 files changed, 11 insertions(+) diff --git a/lib/librte_pmd_ixgbe/Makefile b/lib/librte_pmd_ixgbe/Makefile index 00ccedb..c002239 100644 --- a/lib/librte_pmd_ixgbe/Makefile +++ b/lib/librte_pmd_ixgbe/Makefile @@ -39,6 +39,12 @@ LIB = librte_pmd_ixgbe.a CFLAGS += -O3 CFLAGS += $(WERROR_FLAGS) +# +# the vectorized recieve functions need sse4.2 instruction +# intrinsics, make sure we emit them from the compiler +# +CFLAGS_ixgbe_rxtx_vec.o += -msse4.2 + ifeq ($(CC), icc) # # CFLAGS for icc diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c index 09e19a3..18e8bfe 100644 --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c @@ -34,6 +34,7 @@ #include <stdint.h> #include <rte_ethdev.h> #include <rte_malloc.h> +#include <rte_cpuflags.h> #include "ixgbe_ethdev.h" #include "ixgbe_rxtx.h" @@ -679,6 +680,10 @@ int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev) struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode; struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf; + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2)) { + RTE_LOG(ERR, PMD, "ixgbe vector rx needs sse4.2!\n"); + return -1; + } #ifndef RTE_IXGBE_RX_OLFLAGS_ENABLE /* whithout rx ol_flags, no VP flag report */ if (rxmode->hw_vlan_strip != 0 || -- 1.8.3.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman @ 2014-07-30 14:49 ` Neil Horman 2014-07-30 15:36 ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin 2014-07-30 19:03 ` Venky Venkatesan 3 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-30 14:49 UTC (permalink / raw) To: dev The ACL library relies on sse4.2 intrinsics to operate, but we don't have to enable sse4.2 in the entire build. Instead enable it for the ACL library alone, and use rte_acl_create as a choke point, at which we can test for sse4.2 support, and return NULL. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_acl/Makefile | 2 +- lib/librte_acl/rte_acl.c | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile index 4fe4593..3646439 100644 --- a/lib/librte_acl/Makefile +++ b/lib/librte_acl/Makefile @@ -35,7 +35,7 @@ include $(RTE_SDK)/mk/rte.vars.mk LIB = librte_acl.a CFLAGS += -O3 -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -msse4.2 # all source are stored in SRCS-y SRCS-$(CONFIG_RTE_LIBRTE_ACL) += tb_mem.c diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c index 7c288bd..37a590b 100644 --- a/lib/librte_acl/rte_acl.c +++ b/lib/librte_acl/rte_acl.c @@ -112,6 +112,12 @@ rte_acl_create(const struct rte_acl_param *param) struct rte_acl_list *acl_list; struct rte_tailq_entry *te; char name[sizeof(ctx->name)]; + + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_2)) { + RTE_LOG(ERR, ACL, "librte_acl requires sse4.2 instructions!\n"); + rte_errno = EOPNOTSUPP; + return NULL; + } /* check that we have an initialised tail queue */ acl_list = RTE_TAILQ_LOOKUP_BY_IDX(RTE_TAILQ_ACL, rte_acl_list); -- 1.8.3.1 ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman @ 2014-07-30 15:36 ` Ananyev, Konstantin 2014-07-30 19:03 ` Venky Venkatesan 3 siblings, 0 replies; 45+ messages in thread From: Ananyev, Konstantin @ 2014-07-30 15:36 UTC (permalink / raw) To: Neil Horman, dev > -----Original Message----- > From: Neil Horman [mailto:nhorman@tuxdriver.com] > Sent: Wednesday, July 30, 2014 3:49 PM > To: dev@dpdk.org > Cc: Ananyev, Konstantin; Neil Horman; Thomas Monjalon > Subject: [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features > > Hey all- > I've been trying to update the fedora dpdk package to support VFIO > enabled drivers and ran into a problem in which ixgbe didn't compile because the > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > default config I have. I tried to remedy this by replacing the intrinsics with > the __builtin macros, but it was pointed out (correctly), that this doesn't work > properly. So this is my second attempt, which I actually like a bit better. I > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > have those instructions turned on build-wide. Rather, we can just enable the > instructions in the specific code we want to build with support for that, and > test for instruction support dynamically at run time. This allows me to build > the dpdk for a generic platform, but in such a way that some optimizations can > be used if the executing cpu supports them at run time. > > Change notes: > > v2) > * Added Log messages to run time check failures per Konstantin > * Removed run time check caching in acl per Konstantin > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Thomas Monjalon <thomas.monjalon@6wind.com> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman ` (2 preceding siblings ...) 2014-07-30 15:36 ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin @ 2014-07-30 19:03 ` Venky Venkatesan 2014-07-30 19:17 ` Neil Horman 2014-07-30 19:34 ` Neil Horman 3 siblings, 2 replies; 45+ messages in thread From: Venky Venkatesan @ 2014-07-30 19:03 UTC (permalink / raw) To: dev Neil, > Hey all- > I've been trying to update the fedora dpdk package to support VFIO > enabled drivers and ran into a problem in which ixgbe didn't compile because the > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > default config I have. I tried to remedy this by replacing the intrinsics with > the __builtin macros, but it was pointed out (correctly), that this doesn't work > properly. So this is my second attempt, which I actually like a bit better. I > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > have those instructions turned on build-wide. Rather, we can just enable the > instructions in the specific code we want to build with support for that, and > test for instruction support dynamically at run time. This allows me to build > the dpdk for a generic platform, but in such a way that some optimizations can > be used if the executing cpu supports them at run time. > > Change notes: > > v2) > * Added Log messages to run time check failures per Konstantin > * Removed run time check caching in acl per Konstantin > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > One possible change to look at - the _mm_shuffle_epi8 is something that is available all the way from SSSE3, so you should be able to leave that in. The SSE4.2 dependency is only for the popcount (and __builtin_popcountll will work for that anyway). That should make things simpler. -Venky ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 19:03 ` Venky Venkatesan @ 2014-07-30 19:17 ` Neil Horman 2014-07-30 19:34 ` Neil Horman 1 sibling, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-30 19:17 UTC (permalink / raw) To: Venky Venkatesan; +Cc: dev On Wed, Jul 30, 2014 at 12:03:38PM -0700, Venky Venkatesan wrote: > Neil, > > >Hey all- > > I've been trying to update the fedora dpdk package to support VFIO > >enabled drivers and ran into a problem in which ixgbe didn't compile because the > >rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > >default config I have. I tried to remedy this by replacing the intrinsics with > >the __builtin macros, but it was pointed out (correctly), that this doesn't work > >properly. So this is my second attempt, which I actually like a bit better. I > >noted that code that uses intrinsics (ixgbe and the acl library), don't need to > >have those instructions turned on build-wide. Rather, we can just enable the > >instructions in the specific code we want to build with support for that, and > >test for instruction support dynamically at run time. This allows me to build > >the dpdk for a generic platform, but in such a way that some optimizations can > >be used if the executing cpu supports them at run time. > > > >Change notes: > > > >v2) > > * Added Log messages to run time check failures per Konstantin > > * Removed run time check caching in acl per Konstantin > > > >Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > >CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > One possible change to look at - the _mm_shuffle_epi8 is something that is > available all the way from SSSE3, so you should be able to leave that in. > The SSE4.2 dependency is only for the popcount (and __builtin_popcountll > will work for that anyway). That should make things simpler. > Hmm, I'd just as soon leave the use of the __builtin_* intrinsics alone at this point, as I'd like to focus on the optional enablement of these features at run time more than making them work in all cases (at least for right now), as I'd like a distribution build to take advantage of the accelerated instructions when they're available at run time.. But if we can reduce the intrinsic requirement to ssse3 in some locations that would be good. Let me experiment and I'll post a v3 Neil > -Venky > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 19:03 ` Venky Venkatesan 2014-07-30 19:17 ` Neil Horman @ 2014-07-30 19:34 ` Neil Horman 1 sibling, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-07-30 19:34 UTC (permalink / raw) To: Venky Venkatesan; +Cc: dev On Wed, Jul 30, 2014 at 12:03:38PM -0700, Venky Venkatesan wrote: > Neil, > > >Hey all- > > I've been trying to update the fedora dpdk package to support VFIO > >enabled drivers and ran into a problem in which ixgbe didn't compile because the > >rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > >default config I have. I tried to remedy this by replacing the intrinsics with > >the __builtin macros, but it was pointed out (correctly), that this doesn't work > >properly. So this is my second attempt, which I actually like a bit better. I > >noted that code that uses intrinsics (ixgbe and the acl library), don't need to > >have those instructions turned on build-wide. Rather, we can just enable the > >instructions in the specific code we want to build with support for that, and > >test for instruction support dynamically at run time. This allows me to build > >the dpdk for a generic platform, but in such a way that some optimizations can > >be used if the executing cpu supports them at run time. > > > >Change notes: > > > >v2) > > * Added Log messages to run time check failures per Konstantin > > * Removed run time check caching in acl per Konstantin > > > >Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > >CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > One possible change to look at - the _mm_shuffle_epi8 is something that is > available all the way from SSSE3, so you should be able to leave that in. > The SSE4.2 dependency is only for the popcount (and __builtin_popcountll > will work for that anyway). That should make things simpler. > > -Venky > ugh, actually both the ixgbe and acl libraries use the popcnt instruction, so I think I'd rather just keep the sse4.2 instruction set as it is, unless you think the __builtin_popcnt is equally performant. Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman ` (3 preceding siblings ...) 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman @ 2014-07-30 18:59 ` Bruce Richardson 2014-07-30 19:28 ` Neil Horman 4 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-07-30 18:59 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > Hey all- > I've been trying to update the fedora dpdk package to support VFIO > enabled drivers and ran into a problem in which ixgbe didn't compile because the > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > default config I have. I tried to remedy this by replacing the intrinsics with > the __builtin macros, but it was pointed out (correctly), that this doesn't work > properly. So this is my second attempt, which I actually like a bit better. I > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > have those instructions turned on build-wide. Rather, we can just enable the > instructions in the specific code we want to build with support for that, and > test for instruction support dynamically at run time. This allows me to build > the dpdk for a generic platform, but in such a way that some optimizations can > be used if the executing cpu supports them at run time. > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > I'd prefer if a solution could be found based off your original patch set, as it gives us more chance to deprecate the older code paths in future. Looking at the Intel Intrinsics Guide site online, it shows that the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, and so should be available on all 64-bit systems, I believe. The popcount intrinsic is newer, but it's a much more basic instruction so hopefully the __builtin should work for that. Regards, /Bruce ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson @ 2014-07-30 19:28 ` Neil Horman 2014-07-30 21:09 ` Bruce Richardson 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-07-30 19:28 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > Hey all- > > I've been trying to update the fedora dpdk package to support VFIO > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > default config I have. I tried to remedy this by replacing the intrinsics with > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > properly. So this is my second attempt, which I actually like a bit better. I > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > have those instructions turned on build-wide. Rather, we can just enable the > > instructions in the specific code we want to build with support for that, and > > test for instruction support dynamically at run time. This allows me to build > > the dpdk for a generic platform, but in such a way that some optimizations can > > be used if the executing cpu supports them at run time. > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > I'd prefer if a solution could be found based off your original patch > set, as it gives us more chance to deprecate the older code paths in > future. Looking at the Intel Intrinsics Guide site online, it shows that > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > and so should be available on all 64-bit systems, I believe. The > popcount intrinsic is newer, but it's a much more basic instruction so > hopefully the __builtin should work for that. > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer accelerated code paths on systems that can make use of it at run time. If We use the __builtin compiler functions, we will either: 1) Build those code paths with advanced instructions that won't work on older systems (i.e. crash) 2) Build those code paths with less advanced instructions, meaning that we won't speedup execution on systems that are capable of using the more advanced instructions. Using this run time check, we can, at least in these situations, make use of the accelerated paths when the instructions are available, and ignore them when they're not, at run time. What would be ideal, would be an alternative type macro, like the linux kernel employs, but implementing that would require some pretty significant work and testing. This seems like a much simpler approach. Neil > Regards, > /Bruce > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 19:28 ` Neil Horman @ 2014-07-30 21:09 ` Bruce Richardson 2014-07-31 9:30 ` Thomas Monjalon ` (2 more replies) 0 siblings, 3 replies; 45+ messages in thread From: Bruce Richardson @ 2014-07-30 21:09 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > Hey all- > > > I've been trying to update the fedora dpdk package to support VFIO > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > properly. So this is my second attempt, which I actually like a bit better. I > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > have those instructions turned on build-wide. Rather, we can just enable the > > > instructions in the specific code we want to build with support for that, and > > > test for instruction support dynamically at run time. This allows me to build > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > be used if the executing cpu supports them at run time. > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > I'd prefer if a solution could be found based off your original patch > > set, as it gives us more chance to deprecate the older code paths in > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > and so should be available on all 64-bit systems, I believe. The > > popcount intrinsic is newer, but it's a much more basic instruction so > > hopefully the __builtin should work for that. > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > accelerated code paths on systems that can make use of it at run time. If We > use the __builtin compiler functions, we will either: > > 1) Build those code paths with advanced instructions that won't work on older > systems (i.e. crash) > > 2) Build those code paths with less advanced instructions, meaning that we won't > speedup execution on systems that are capable of using the more advanced > instructions. > > Using this run time check, we can, at least in these situations, make use of the > accelerated paths when the instructions are available, and ignore them when > they're not, at run time. > > What would be ideal, would be an alternative type macro, like the linux kernel > employs, but implementing that would require some pretty significant work and > testing. This seems like a much simpler approach. > Ok, I understand where you are coming from indeed. However, within that, I'd like to see us reduce the amount of code that's needed for maintenance. What we should really aim for, is to have common code, with perhaps some small ifdefs or __builtins, and then compile that code multiple times for multiple different architectures. So in this case, it would be nice to use the __builtin, and then compile that code up with and without SSE and select at runtime the code path to be used. Ideally, this could be done at the driver level. However, once you get down this path, you are dealing with more than just SSE. If I compile up the PMD on my system, which has a chip based on Sandy Bridge uarch, I find that there are multiple instructions starting with "vp" which means that they are actually AVX instructions. Even though the code is written using intrinsics which correspond to SSE operations, the compiler is free to use AVX instructions where necessary to improve performance. Therefore, if we go down this road, we need to look to compile up the code for all microarchitectures, rather than just assuming that we will get equivalent performance to "native" by turning on the instruction set indicated by the primitives in the code. This is where having one codepath recompiled multiple times will work far better than having multiple code paths. Regards, /Bruce ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 21:09 ` Bruce Richardson @ 2014-07-31 9:30 ` Thomas Monjalon 2014-07-31 11:36 ` Ananyev, Konstantin 2014-07-31 13:13 ` Neil Horman 2 siblings, 0 replies; 45+ messages in thread From: Thomas Monjalon @ 2014-07-31 9:30 UTC (permalink / raw) To: Bruce Richardson, Neil Horman; +Cc: dev 2014-07-30 14:09, Bruce Richardson: > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > instructions in the specific code we want to build with support for that, and > > > > test for instruction support dynamically at run time. This allows me to build > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > accelerated code paths on systems that can make use of it at run time. If We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > employs, but implementing that would require some pretty significant work and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. Yes, having a runtime fallback in the driver to make it work efficiently on all architectures seems a good idea. We should keep in mind that it's very convenient to functionnaly test a code path on a basic machine. > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. > Even though the code is written using intrinsics which correspond to SSE > operations, the compiler is free to use AVX instructions where necessary > to improve performance. Therefore, if we go down this road, we need to > look to compile up the code for all microarchitectures, rather than just > assuming that we will get equivalent performance to "native" by turning > on the instruction set indicated by the primitives in the code. This is > where having one codepath recompiled multiple times will work far better > than having multiple code paths. Choosing the best instructions for a task is the work of the compiler. Making this choice at runtime is a big task and probably not desirable. For performance, compilation must be done in "native mode" to let compiler make the right decisions. For compatibility, compilation must be done in "default mode". Here the problem is that "default mode" compilation is broken for ixgbe and acl. So we must fix it with, at least, an ifdef against SSE4. Having a runtime decision is better because it brings some performance with a default compilation but it's not really required today. -- Thomas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 21:09 ` Bruce Richardson 2014-07-31 9:30 ` Thomas Monjalon @ 2014-07-31 11:36 ` Ananyev, Konstantin 2014-07-31 13:13 ` Neil Horman 2 siblings, 0 replies; 45+ messages in thread From: Ananyev, Konstantin @ 2014-07-31 11:36 UTC (permalink / raw) To: Richardson, Bruce, Neil Horman; +Cc: dev Hi Bruce, > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > Sent: Wednesday, July 30, 2014 10:09 PM > To: Neil Horman > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > instructions in the specific code we want to build with support for that, and > > > > test for instruction support dynamically at run time. This allows me to build > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > accelerated code paths on systems that can make use of it at run time. If We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > employs, but implementing that would require some pretty significant work and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. > > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. > Even though the code is written using intrinsics which correspond to SSE > operations, the compiler is free to use AVX instructions where necessary > to improve performance. > Therefore, if we go down this road, we need to > look to compile up the code for all microarchitectures, rather than just > assuming that we will get equivalent performance to "native" by turning > on the instruction set indicated by the primitives in the code. This is > where having one codepath recompiled multiple times will work far better > than having multiple code paths. Using your example - as long as we specify '-mavx' compiler can (and does) use AVX instructions even for 'scalar' code (code without any SIMD instrincts). And yes, that probably affects performance. So, as I understand your suggestion, we'll then need to divide our code into: - generic one - compiled to run on all supported platforms - performance critical that will be recompiled for each supported platform. Then generic code would have to make decision at run-time what particular version of recompiled code to use. And that for each PMD and all others performance-critical DPDK libraries. Looks like too much hassle to me. After all - if someone needs a package with binaries optimised for different architectures, he can provide multiple DPDK binaries (build for different architectures) and small install script, that would decide which binary is more appropriate for the given platform. Konstantin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-30 21:09 ` Bruce Richardson 2014-07-31 9:30 ` Thomas Monjalon 2014-07-31 11:36 ` Ananyev, Konstantin @ 2014-07-31 13:13 ` Neil Horman 2014-07-31 13:26 ` Thomas Monjalon 2 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-07-31 13:13 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > Hey all- > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > instructions in the specific code we want to build with support for that, and > > > > test for instruction support dynamically at run time. This allows me to build > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > be used if the executing cpu supports them at run time. > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > I'd prefer if a solution could be found based off your original patch > > > set, as it gives us more chance to deprecate the older code paths in > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > and so should be available on all 64-bit systems, I believe. The > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > hopefully the __builtin should work for that. > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > accelerated code paths on systems that can make use of it at run time. If We > > use the __builtin compiler functions, we will either: > > > > 1) Build those code paths with advanced instructions that won't work on older > > systems (i.e. crash) > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > speedup execution on systems that are capable of using the more advanced > > instructions. > > > > Using this run time check, we can, at least in these situations, make use of the > > accelerated paths when the instructions are available, and ignore them when > > they're not, at run time. > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > employs, but implementing that would require some pretty significant work and > > testing. This seems like a much simpler approach. > > > > Ok, I understand where you are coming from indeed. However, within that, > I'd like to see us reduce the amount of code that's needed for > maintenance. > Ok, but that seems orthogonal to what I've done here. I've added about 10 lines of easily understandable code, which seems reasonable to me. > What we should really aim for, is to have common code, with perhaps some > small ifdefs or __builtins, and then compile that code multiple times > for multiple different architectures. So in this case, it would be nice > to use the __builtin, and then compile that code up with and without SSE > and select at runtime the code path to be used. Ideally, this could be > done at the driver level. > No, that is in direct conflict with what I'm trying to do here. My goal is to enable a build for a least common denominator system, but include those code paths that allow some performance benefit on systems which support the feature, to be determined at run time. My goal is improving performance in the Fedora dpdk package, which we build once for all supported systems on an arch. Building multiple libraries for multiple system configurations is simply an unmaintainable solution. Now, a macro that selected an instruction optimized or generic path is fine, as long as it can happen at run time. The Linux kernel has such a feature, called alternatives. But its a complex subsystem that does run time replacement of instructions based on cpu feature flags. It would be great to have in the DPDK, but its a significant code base and difficult to maintain, which goes against your desire to reduce code. > However, once you get down this path, you are dealing with more than > just SSE. If I compile up the PMD on my system, which has a chip based > on Sandy Bridge uarch, I find that there are multiple instructions > starting with "vp" which means that they are actually AVX instructions. You've done it wrong, you're building for the native machine target. For fedora I build for the default machine target (core2) which is the minimal system that fedora supports. So all the code emitted for the dpdk is functional for all fedora systems. This patch set then just enables the sse4.2 intrinsics for the ixgbe vec rx/tx path and the acl library, and inculdes a run time check for SSE4.2 before either of those paths can be used. > Even though the code is written using intrinsics which correspond to SSE > operations, the compiler is free to use AVX instructions where necessary Not if you use the default machine target. > to improve performance. Therefore, if we go down this road, we need to > look to compile up the code for all microarchitectures, rather than just > assuming that we will get equivalent performance to "native" by turning > on the instruction set indicated by the primitives in the code. This is No, you compile for the least common demonitor system, and enable more performant paths opportunistically as run time checks allow. > where having one codepath recompiled multiple times will work far better > than having multiple code paths. Only if you're only concern is performance. As noted above, my goal is more than just performance, its compatibility accross systems. Multiple builds for multiple cpu flag availability is simply a non-starter for a generic distribution. Regards Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 13:13 ` Neil Horman @ 2014-07-31 13:26 ` Thomas Monjalon 2014-07-31 14:32 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Thomas Monjalon @ 2014-07-31 13:26 UTC (permalink / raw) To: Neil Horman; +Cc: dev 2014-07-31 09:13, Neil Horman: > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > Hey all- > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > instructions in the specific code we want to build with support for that, and > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > set, as it gives us more chance to deprecate the older code paths in > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > and so should be available on all 64-bit systems, I believe. The > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > hopefully the __builtin should work for that. > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > accelerated code paths on systems that can make use of it at run time. If We > > > use the __builtin compiler functions, we will either: > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > systems (i.e. crash) > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > speedup execution on systems that are capable of using the more advanced > > > instructions. > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > accelerated paths when the instructions are available, and ignore them when > > > they're not, at run time. > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > employs, but implementing that would require some pretty significant work and > > > testing. This seems like a much simpler approach. [...] > Now, a macro that selected an instruction optimized or generic path is fine, as > long as it can happen at run time. The Linux kernel has such a feature, called > alternatives. But its a complex subsystem that does run time replacement of > instructions based on cpu feature flags. It would be great to have in the DPDK, > but its a significant code base and difficult to maintain, which goes against > your desire to reduce code. [...] > > Even though the code is written using intrinsics which correspond to SSE > > operations, the compiler is free to use AVX instructions where necessary > Not if you use the default machine target. > > > to improve performance. Therefore, if we go down this road, we need to > > look to compile up the code for all microarchitectures, rather than just > > assuming that we will get equivalent performance to "native" by turning > > on the instruction set indicated by the primitives in the code. This is > No, you compile for the least common demonitor system, and enable more > performant paths opportunistically as run time checks allow. > > > where having one codepath recompiled multiple times will work far better > > than having multiple code paths. > Only if you're only concern is performance. As noted above, my goal is more > than just performance, its compatibility accross systems. Multiple builds for > multiple cpu flag availability is simply a non-starter for a generic > distribution. Neil, we are mixing 2 different problems here. 1) we have to fix default build (without SSE-4.2) 2) we could try to have performance with default build Please, let's focus on the first item and we could discuss about performance later. Having some different code path choosed at runtime is a big rework and imply changing the compilation model (RFC welcome). -- Thomas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 13:26 ` Thomas Monjalon @ 2014-07-31 14:32 ` Neil Horman 2014-07-31 18:10 ` Neil Horman 2014-07-31 21:25 ` Thomas Monjalon 0 siblings, 2 replies; 45+ messages in thread From: Neil Horman @ 2014-07-31 14:32 UTC (permalink / raw) To: Thomas Monjalon; +Cc: dev On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > 2014-07-31 09:13, Neil Horman: > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > Hey all- > > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > > instructions in the specific code we want to build with support for that, and > > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > hopefully the __builtin should work for that. > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > > accelerated code paths on systems that can make use of it at run time. If We > > > > use the __builtin compiler functions, we will either: > > > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > > systems (i.e. crash) > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > > speedup execution on systems that are capable of using the more advanced > > > > instructions. > > > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > > accelerated paths when the instructions are available, and ignore them when > > > > they're not, at run time. > > > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > > employs, but implementing that would require some pretty significant work and > > > > testing. This seems like a much simpler approach. > > [...] > > > Now, a macro that selected an instruction optimized or generic path is fine, as > > long as it can happen at run time. The Linux kernel has such a feature, called > > alternatives. But its a complex subsystem that does run time replacement of > > instructions based on cpu feature flags. It would be great to have in the DPDK, > > but its a significant code base and difficult to maintain, which goes against > > your desire to reduce code. > > [...] > > > > Even though the code is written using intrinsics which correspond to SSE > > > operations, the compiler is free to use AVX instructions where necessary > > Not if you use the default machine target. > > > > > to improve performance. Therefore, if we go down this road, we need to > > > look to compile up the code for all microarchitectures, rather than just > > > assuming that we will get equivalent performance to "native" by turning > > > on the instruction set indicated by the primitives in the code. This is > > No, you compile for the least common demonitor system, and enable more > > performant paths opportunistically as run time checks allow. > > > > > where having one codepath recompiled multiple times will work far better > > > than having multiple code paths. > > Only if you're only concern is performance. As noted above, my goal is more > > than just performance, its compatibility accross systems. Multiple builds for > > multiple cpu flag availability is simply a non-starter for a generic > > distribution. > > Neil, we are mixing 2 different problems here. > 1) we have to fix default build (without SSE-4.2) Thats nothing to fix, thats a configuration issue. Just build for a lesser machine. I've already done that in the fedora build, using the defalut machine target. What exactly is missing from that? > 2) we could try to have performance with default build > Yes, we can, thats what this patch does. It doesn't address every code path, no, but it addresses two paths that are low hanging fruit for doing so, and we can incrementally build on that > Please, let's focus on the first item and we could discuss about performance > later. Having some different code path choosed at runtime is a big rework and > imply changing the compilation model (RFC welcome). > Are you referring to the use of the Alternatives code here? Yes, I absolutely agree, thats a huge rework, and not worth considering yet (though it would be great if we had it). I'm not proposing that though, I'm just proposing that we allow a few isolated paths to be selected at run time if and only if the cpu supports doing so. The check for both of these cases is in the setup path, and so the check itself should not have any significant impact on performance. I really don't see what the conflict is here. Neil > -- > Thomas > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 14:32 ` Neil Horman @ 2014-07-31 18:10 ` Neil Horman 2014-07-31 18:36 ` Bruce Richardson 2014-07-31 21:53 ` Thomas Monjalon 2014-07-31 21:25 ` Thomas Monjalon 1 sibling, 2 replies; 45+ messages in thread From: Neil Horman @ 2014-07-31 18:10 UTC (permalink / raw) To: Thomas Monjalon; +Cc: dev On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > 2014-07-31 09:13, Neil Horman: > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > Hey all- > > > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > > > instructions in the specific code we want to build with support for that, and > > > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > > > accelerated code paths on systems that can make use of it at run time. If We > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > > > systems (i.e. crash) > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > > > speedup execution on systems that are capable of using the more advanced > > > > > instructions. > > > > > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > > > accelerated paths when the instructions are available, and ignore them when > > > > > they're not, at run time. > > > > > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > > > employs, but implementing that would require some pretty significant work and > > > > > testing. This seems like a much simpler approach. > > > > [...] > > > > > Now, a macro that selected an instruction optimized or generic path is fine, as > > > long as it can happen at run time. The Linux kernel has such a feature, called > > > alternatives. But its a complex subsystem that does run time replacement of > > > instructions based on cpu feature flags. It would be great to have in the DPDK, > > > but its a significant code base and difficult to maintain, which goes against > > > your desire to reduce code. > > > > [...] > > > > > > Even though the code is written using intrinsics which correspond to SSE > > > > operations, the compiler is free to use AVX instructions where necessary > > > Not if you use the default machine target. > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > look to compile up the code for all microarchitectures, rather than just > > > > assuming that we will get equivalent performance to "native" by turning > > > > on the instruction set indicated by the primitives in the code. This is > > > No, you compile for the least common demonitor system, and enable more > > > performant paths opportunistically as run time checks allow. > > > > > > > where having one codepath recompiled multiple times will work far better > > > > than having multiple code paths. > > > Only if you're only concern is performance. As noted above, my goal is more > > > than just performance, its compatibility accross systems. Multiple builds for > > > multiple cpu flag availability is simply a non-starter for a generic > > > distribution. > > > > Neil, we are mixing 2 different problems here. > > 1) we have to fix default build (without SSE-4.2) > Thats nothing to fix, thats a configuration issue. Just build for a lesser > machine. I've already done that in the fedora build, using the defalut machine > target. What exactly is missing from that? > Re-reading this, I'm wondering if I missed what you were trying to say, if so I apologize. Were you trying to assert that the right thing to do here is to adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that they are buildable on the default platform? If so, I agree, thats a nice idea, and am supportive of it, though I don't think that fully solves teh problem. In the case of the ixgbe pmd, what we have is 2 code paths, a generic code path, and an optimized code path using sse4.2 intrinsics. In this case, I don't think theres anything to fix, in that I'm fine with the optimized path needing sse4.2 to execute. There I just want to be able to do a run time check and use the optimized path if the cpu supports it, and just use the default path otherwise. In effect we already have exactly what you are looking for there. As far as the ACL library goes, yes, thats more complex. The use of sse4.2 intrinsics there is done througout the code, so theres no easy way to select a path. we're just left with either using the code or returning an error at run time, as my patch does. Certainly we can build some macros that either use the intrinsics for sse4.2 or code up some C-level variants of those instructions based on generic code, and build for the least common demoniator, or compile the code twice (once without sse4.2 support, and once with), and do a runtime selection between the two. Either way, thats going to be a useful, though significant effort. > > 2) we could try to have performance with default build > > > Yes, we can, thats what this patch does. It doesn't address every code path, > no, but it addresses two paths that are low hanging fruit for doing so, and we > can incrementally build on that > > > Please, let's focus on the first item and we could discuss about performance > > later. Having some different code path choosed at runtime is a big rework and > > imply changing the compilation model (RFC welcome). > > Even if I misinterpreted your statement above, I'm still not sure why your asserting this. Fixing the build to work with the default target machine is good, and should be undertaken, and I'll happily do so, but why reject the solution in front of you to wait for it? Even if I write macros to fix up the ACL library, I'd still like to be able to do a run time check and select the optimized version or the generic version based on cpu support. Just doing a compile time check to determine if sse4.2 is available really isn't going to cut it for me, as I don't want the fedora dpdk to have pessimal performance if it doesn't have to. Regards Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 18:10 ` Neil Horman @ 2014-07-31 18:36 ` Bruce Richardson 2014-07-31 19:01 ` Neil Horman ` (2 more replies) 2014-07-31 21:53 ` Thomas Monjalon 1 sibling, 3 replies; 45+ messages in thread From: Bruce Richardson @ 2014-07-31 18:36 UTC (permalink / raw) To: Neil Horman; +Cc: dev Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > 2014-07-31 09:13, Neil Horman: > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > Hey all- > > > > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > > > > instructions in the specific code we want to build with support for that, and > > > > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > > > > accelerated code paths on systems that can make use of it at run time. If We > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > > > > systems (i.e. crash) > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > > > > speedup execution on systems that are capable of using the more advanced > > > > > > instructions. > > > > > > > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > > > > accelerated paths when the instructions are available, and ignore them when > > > > > > they're not, at run time. > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > > > > employs, but implementing that would require some pretty significant work and > > > > > > testing. This seems like a much simpler approach. > > > > > > [...] > > > > > > > Now, a macro that selected an instruction optimized or generic path is fine, as > > > > long as it can happen at run time. The Linux kernel has such a feature, called > > > > alternatives. But its a complex subsystem that does run time replacement of > > > > instructions based on cpu feature flags. It would be great to have in the DPDK, > > > > but its a significant code base and difficult to maintain, which goes against > > > > your desire to reduce code. > > > > > > [...] > > > > > > > > Even though the code is written using intrinsics which correspond to SSE > > > > > operations, the compiler is free to use AVX instructions where necessary > > > > Not if you use the default machine target. > > > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > > look to compile up the code for all microarchitectures, rather than just > > > > > assuming that we will get equivalent performance to "native" by turning > > > > > on the instruction set indicated by the primitives in the code. This is > > > > No, you compile for the least common demonitor system, and enable more > > > > performant paths opportunistically as run time checks allow. > > > > > > > > > where having one codepath recompiled multiple times will work far better > > > > > than having multiple code paths. > > > > Only if you're only concern is performance. As noted above, my goal is more > > > > than just performance, its compatibility accross systems. Multiple builds for > > > > multiple cpu flag availability is simply a non-starter for a generic > > > > distribution. > > > > > > Neil, we are mixing 2 different problems here. > > > 1) we have to fix default build (without SSE-4.2) > > Thats nothing to fix, thats a configuration issue. Just build for a lesser > > machine. I've already done that in the fedora build, using the defalut machine > > target. What exactly is missing from that? > > > Re-reading this, I'm wondering if I missed what you were trying to say, if so I > apologize. Were you trying to assert that the right thing to do here is to > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that > they are buildable on the default platform? If so, I agree, thats a nice idea, > and am supportive of it, though I don't think that fully solves teh problem. In > the case of the ixgbe pmd, what we have is 2 code paths, a generic code path, > and an optimized code path using sse4.2 intrinsics. In this case, I don't think > theres anything to fix, in that I'm fine with the optimized path needing sse4.2 > to execute. There I just want to be able to do a run time check and use the > optimized path if the cpu supports it, and just use the default path otherwise. > In effect we already have exactly what you are looking for there. > > As far as the ACL library goes, yes, thats more complex. The use of sse4.2 > intrinsics there is done througout the code, so theres no easy way to select a > path. we're just left with either using the code or returning an error at run > time, as my patch does. Certainly we can build some macros that either use the > intrinsics for sse4.2 or code up some C-level variants of those instructions > based on generic code, and build for the least common demoniator, or compile the > code twice (once without sse4.2 support, and once with), and do a runtime > selection between the two. Either way, thats going to be a useful, though > significant effort. I think a good first step here that I can't see anyone objecting to is to enable the ixgbe driver to use the vector code path for a generic x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] allows a compile for machine type default, and testpmd can still forward packets at a good rate (roughly perf down about 10% vs native compile on SNB). The ACL is a tougher nut to crack, but anyone see any issues with that two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch set thread, do you want to submit an official patch here, or would you prefer I do so?] > > > > 2) we could try to have performance with default build > > > > > Yes, we can, thats what this patch does. It doesn't address every code path, > > no, but it addresses two paths that are low hanging fruit for doing so, and we > > can incrementally build on that > > > > > Please, let's focus on the first item and we could discuss about performance > > > later. Having some different code path choosed at runtime is a big rework and > > > imply changing the compilation model (RFC welcome). > > > > Even if I misinterpreted your statement above, I'm still not sure why your > asserting this. Fixing the build to work with the default target machine is > good, and should be undertaken, and I'll happily do so, but why reject the > solution in front of you to wait for it? Even if I write macros to fix up the > ACL library, I'd still like to be able to do a run time check and select the > optimized version or the generic version based on cpu support. Just doing a > compile time check to determine if sse4.2 is available really isn't going to cut > it for me, as I don't want the fedora dpdk to have pessimal performance if it > doesn't have to. > > Regards > Neil > With regards to the general approach for runtime detection of software functions, I wonder if something like this can be handled by the packaging system? Is it possible to ship out a set of shared libs compiled up for different instruction sets, and then at rpm install time, symlink the appropriate library? This would push the whole issue of detection of code paths outside of code, work across all our libraries and ensure each user got the best performance they could get form a binary? Has something like this been done before? The building of all the libraries could be scripted easy enough, just do multiple builds using different EXTRA_CFLAGS each time, and move and rename the .so's after each run. /Bruce ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 18:36 ` Bruce Richardson @ 2014-07-31 19:01 ` Neil Horman 2014-07-31 20:19 ` Bruce Richardson 2014-07-31 19:58 ` John W. Linville 2014-07-31 20:10 ` Neil Horman 2 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-07-31 19:01 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > > > > > > > > I've been trying to update the fedora dpdk package to support VFIO > > > > > > > > > enabled drivers and ran into a problem in which ixgbe didn't compile because the > > > > > > > > > rxtx_vec code uses sse4.2 instruction intrinsics, which aren't supported in the > > > > > > > > > default config I have. I tried to remedy this by replacing the intrinsics with > > > > > > > > > the __builtin macros, but it was pointed out (correctly), that this doesn't work > > > > > > > > > properly. So this is my second attempt, which I actually like a bit better. I > > > > > > > > > noted that code that uses intrinsics (ixgbe and the acl library), don't need to > > > > > > > > > have those instructions turned on build-wide. Rather, we can just enable the > > > > > > > > > instructions in the specific code we want to build with support for that, and > > > > > > > > > test for instruction support dynamically at run time. This allows me to build > > > > > > > > > the dpdk for a generic platform, but in such a way that some optimizations can > > > > > > > > > be used if the executing cpu supports them at run time. > > > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > > CC: Thomas Monjalon <thomas.monjalon@6wind.com> > > > > > > > > > > > > > > > > > I'd prefer if a solution could be found based off your original patch > > > > > > > > set, as it gives us more chance to deprecate the older code paths in > > > > > > > > future. Looking at the Intel Intrinsics Guide site online, it shows that > > > > > > > > the _mm_shuffle_epi8 intrinsic came in with SSSE3, rather than SSE4.x, > > > > > > > > and so should be available on all 64-bit systems, I believe. The > > > > > > > > popcount intrinsic is newer, but it's a much more basic instruction so > > > > > > > > hopefully the __builtin should work for that. > > > > > > > > > > > > > > > Yes, but as I look at it, thats somewhat counter to my goal, which is to offer > > > > > > > accelerated code paths on systems that can make use of it at run time. If We > > > > > > > use the __builtin compiler functions, we will either: > > > > > > > > > > > > > > 1) Build those code paths with advanced instructions that won't work on older > > > > > > > systems (i.e. crash) > > > > > > > > > > > > > > 2) Build those code paths with less advanced instructions, meaning that we won't > > > > > > > speedup execution on systems that are capable of using the more advanced > > > > > > > instructions. > > > > > > > > > > > > > > Using this run time check, we can, at least in these situations, make use of the > > > > > > > accelerated paths when the instructions are available, and ignore them when > > > > > > > they're not, at run time. > > > > > > > > > > > > > > What would be ideal, would be an alternative type macro, like the linux kernel > > > > > > > employs, but implementing that would require some pretty significant work and > > > > > > > testing. This seems like a much simpler approach. > > > > > > > > [...] > > > > > > > > > Now, a macro that selected an instruction optimized or generic path is fine, as > > > > > long as it can happen at run time. The Linux kernel has such a feature, called > > > > > alternatives. But its a complex subsystem that does run time replacement of > > > > > instructions based on cpu feature flags. It would be great to have in the DPDK, > > > > > but its a significant code base and difficult to maintain, which goes against > > > > > your desire to reduce code. > > > > > > > > [...] > > > > > > > > > > Even though the code is written using intrinsics which correspond to SSE > > > > > > operations, the compiler is free to use AVX instructions where necessary > > > > > Not if you use the default machine target. > > > > > > > > > > > to improve performance. Therefore, if we go down this road, we need to > > > > > > look to compile up the code for all microarchitectures, rather than just > > > > > > assuming that we will get equivalent performance to "native" by turning > > > > > > on the instruction set indicated by the primitives in the code. This is > > > > > No, you compile for the least common demonitor system, and enable more > > > > > performant paths opportunistically as run time checks allow. > > > > > > > > > > > where having one codepath recompiled multiple times will work far better > > > > > > than having multiple code paths. > > > > > Only if you're only concern is performance. As noted above, my goal is more > > > > > than just performance, its compatibility accross systems. Multiple builds for > > > > > multiple cpu flag availability is simply a non-starter for a generic > > > > > distribution. > > > > > > > > Neil, we are mixing 2 different problems here. > > > > 1) we have to fix default build (without SSE-4.2) > > > Thats nothing to fix, thats a configuration issue. Just build for a lesser > > > machine. I've already done that in the fedora build, using the defalut machine > > > target. What exactly is missing from that? > > > > > Re-reading this, I'm wondering if I missed what you were trying to say, if so I > > apologize. Were you trying to assert that the right thing to do here is to > > adjust the ixgbe and acl code paths to not use the sse4.2 intrinsics so that > > they are buildable on the default platform? If so, I agree, thats a nice idea, > > and am supportive of it, though I don't think that fully solves teh problem. In > > the case of the ixgbe pmd, what we have is 2 code paths, a generic code path, > > and an optimized code path using sse4.2 intrinsics. In this case, I don't think > > theres anything to fix, in that I'm fine with the optimized path needing sse4.2 > > to execute. There I just want to be able to do a run time check and use the > > optimized path if the cpu supports it, and just use the default path otherwise. > > In effect we already have exactly what you are looking for there. > > > > As far as the ACL library goes, yes, thats more complex. The use of sse4.2 > > intrinsics there is done througout the code, so theres no easy way to select a > > path. we're just left with either using the code or returning an error at run > > time, as my patch does. Certainly we can build some macros that either use the > > intrinsics for sse4.2 or code up some C-level variants of those instructions > > based on generic code, and build for the least common demoniator, or compile the > > code twice (once without sse4.2 support, and once with), and do a runtime > > selection between the two. Either way, thats going to be a useful, though > > significant effort. > > I think a good first step here that I can't see anyone objecting to is > to enable the ixgbe driver to use the vector code path for a generic > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > allows a compile for machine type default, and testpmd can still forward > packets at a good rate (roughly perf down about 10% vs native compile on > SNB). > The ACL is a tougher nut to crack, but anyone see any issues with that > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > set thread, do you want to submit an official patch here, or would you prefer I > do so?] > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 instructions in that path seems significant, isn't it? Given that performance delta, it seems like it would still be preferable to have a path that used the sse4.2 instructions when they're available. Or am I misreading what you mean when you say down 10% Neil > > > > > > 2) we could try to have performance with default build > > > > > > > Yes, we can, thats what this patch does. It doesn't address every code path, > > > no, but it addresses two paths that are low hanging fruit for doing so, and we > > > can incrementally build on that > > > > > > > Please, let's focus on the first item and we could discuss about performance > > > > later. Having some different code path choosed at runtime is a big rework and > > > > imply changing the compilation model (RFC welcome). > > > > > > Even if I misinterpreted your statement above, I'm still not sure why your > > asserting this. Fixing the build to work with the default target machine is > > good, and should be undertaken, and I'll happily do so, but why reject the > > solution in front of you to wait for it? Even if I write macros to fix up the > > ACL library, I'd still like to be able to do a run time check and select the > > optimized version or the generic version based on cpu support. Just doing a > > compile time check to determine if sse4.2 is available really isn't going to cut > > it for me, as I don't want the fedora dpdk to have pessimal performance if it > > doesn't have to. > > > > Regards > > Neil > > > > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. > > /Bruce > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 19:01 ` Neil Horman @ 2014-07-31 20:19 ` Bruce Richardson 2014-08-01 13:36 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-07-31 20:19 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > I think a good first step here that I can't see anyone objecting to is > > to enable the ixgbe driver to use the vector code path for a generic > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > > allows a compile for machine type default, and testpmd can still forward > > packets at a good rate (roughly perf down about 10% vs native compile on > > SNB). > > The ACL is a tougher nut to crack, but anyone see any issues with that > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > > set thread, do you want to submit an official patch here, or would you prefer I > > do so?] > > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 > instructions in that path seems significant, isn't it? Given that performance > delta, it seems like it would still be preferable to have a path that used the > sse4.2 instructions when they're available. Or am I misreading what you mean > when you say down 10% > > Neil > Ok, I did a little bit more testing here. Using the vector pmd compiled for generic x86_64 and using __builtin_popcountll is approx 35% faster for packet IO than the existing fast-path functions. It is also 7% (a bit lower than ~10% as I originally stated) slower than the existing native-compiled vpmd on a Sandy Bridge platform. I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the extra instructions. The ~7% performance drop went to ~3%, so we would gain a little more with using SSE4.2, but compared to the gain from having the vector driver at all, it's not that much. [I don't have a system handy with AVX2 support to see what boosts might come from compiling with that instruction set enabled.] Because of this, I'd take the ~35% speed boost for now, and try and find what would be the best general way to solve this problem across all libraries. Also, I think that anyone who needs that extra 4% performance probably wants the other 3% too, and so will compile up the code from source using the "native" compilation target. :-) /Bruce ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 20:19 ` Bruce Richardson @ 2014-08-01 13:36 ` Neil Horman 2014-08-01 13:56 ` Ananyev, Konstantin 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-08-01 13:36 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote: > On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > > I think a good first step here that I can't see anyone objecting to is > > > to enable the ixgbe driver to use the vector code path for a generic > > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > > > allows a compile for machine type default, and testpmd can still forward > > > packets at a good rate (roughly perf down about 10% vs native compile on > > > SNB). > > > The ACL is a tougher nut to crack, but anyone see any issues with that > > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > > > set thread, do you want to submit an official patch here, or would you prefer I > > > do so?] > > > > > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 > > instructions in that path seems significant, isn't it? Given that performance > > delta, it seems like it would still be preferable to have a path that used the > > sse4.2 instructions when they're available. Or am I misreading what you mean > > when you say down 10% > > > > Neil > > > Ok, I did a little bit more testing here. Using the vector pmd compiled > for generic x86_64 and using __builtin_popcountll is approx 35% faster > for packet IO than the existing fast-path functions. It is also 7% (a > bit lower than ~10% as I originally stated) slower than the existing > native-compiled vpmd on a Sandy Bridge platform. > > I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the > extra instructions. The ~7% performance drop went to ~3%, so we would > gain a little more with using SSE4.2, but compared to the gain from > having the vector driver at all, it's not that much. [I don't have a > system handy with AVX2 support to see what boosts might come from > compiling with that instruction set enabled.] > > Because of this, I'd take the ~35% speed boost for now, and try and find > what would be the best general way to solve this problem across all > libraries. Also, I think that anyone who needs that extra 4% performance > probably wants the other 3% too, and so will compile up the code from > source using the "native" compilation target. :-) > Wait a moment, I'm not entirely sure what you did here. I understand that you replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8 intrinsic which is only available with sse4.2 from what I can see. did you replace those calls with a __builtin_shuffle variant? Otherwise, how did you get the pmd to build? I'm asking because this is what I tried in the first pass and Konstantin gave some pretty convicing evidence that this was an unworkable solution: http://dpdk.org/ml/archives/dev/2014-July/004443.html Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 13:36 ` Neil Horman @ 2014-08-01 13:56 ` Ananyev, Konstantin 2014-08-01 14:26 ` Venkatesan, Venky 2014-08-01 14:27 ` Neil Horman 0 siblings, 2 replies; 45+ messages in thread From: Ananyev, Konstantin @ 2014-08-01 13:56 UTC (permalink / raw) To: Neil Horman, Richardson, Bruce; +Cc: dev > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > Sent: Friday, August 01, 2014 2:37 PM > To: Richardson, Bruce > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features > > On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote: > > On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > > > > I think a good first step here that I can't see anyone objecting to is > > > > to enable the ixgbe driver to use the vector code path for a generic > > > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > > > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > > > > allows a compile for machine type default, and testpmd can still forward > > > > packets at a good rate (roughly perf down about 10% vs native compile on > > > > SNB). > > > > The ACL is a tougher nut to crack, but anyone see any issues with that > > > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > > > > set thread, do you want to submit an official patch here, or would you prefer I > > > > do so?] > > > > > > > > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 > > > instructions in that path seems significant, isn't it? Given that performance > > > delta, it seems like it would still be preferable to have a path that used the > > > sse4.2 instructions when they're available. Or am I misreading what you mean > > > when you say down 10% > > > > > > Neil > > > > > Ok, I did a little bit more testing here. Using the vector pmd compiled > > for generic x86_64 and using __builtin_popcountll is approx 35% faster > > for packet IO than the existing fast-path functions. It is also 7% (a > > bit lower than ~10% as I originally stated) slower than the existing > > native-compiled vpmd on a Sandy Bridge platform. > > > > I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the > > extra instructions. The ~7% performance drop went to ~3%, so we would > > gain a little more with using SSE4.2, but compared to the gain from > > having the vector driver at all, it's not that much. [I don't have a > > system handy with AVX2 support to see what boosts might come from > > compiling with that instruction set enabled.] > > > > Because of this, I'd take the ~35% speed boost for now, and try and find > > what would be the best general way to solve this problem across all > > libraries. Also, I think that anyone who needs that extra 4% performance > > probably wants the other 3% too, and so will compile up the code from > > source using the "native" compilation target. :-) > > > > > Wait a moment, I'm not entirely sure what you did here. I understand that you > replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with > __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8 > intrinsic which is only available with sse4.2 from what I can see. did you > replace those calls with a __builtin_shuffle variant? Otherwise, how did you > get the pmd to build? I'm asking because this is what I tried in the first pass > and Konstantin gave some pretty convicing evidence that this was an unworkable > solution: > http://dpdk.org/ml/archives/dev/2014-July/004443.html > I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3. So I presume, there is no need for replacement. Konstantin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 13:56 ` Ananyev, Konstantin @ 2014-08-01 14:26 ` Venkatesan, Venky 2014-08-01 14:27 ` Neil Horman 1 sibling, 0 replies; 45+ messages in thread From: Venkatesan, Venky @ 2014-08-01 14:26 UTC (permalink / raw) To: dev On 8/1/2014 6:56 AM, Ananyev, Konstantin wrote: >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman >> Sent: Friday, August 01, 2014 2:37 PM >> To: Richardson, Bruce >> Cc: dev@dpdk.org >> Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features >> >> On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote: >>> On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: >>>> On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: >>>>> I think a good first step here that I can't see anyone objecting to is >>>>> to enable the ixgbe driver to use the vector code path for a generic >>>>> x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" >>>>> to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] >>>>> allows a compile for machine type default, and testpmd can still forward >>>>> packets at a good rate (roughly perf down about 10% vs native compile on >>>>> SNB). >>>>> The ACL is a tougher nut to crack, but anyone see any issues with that >>>>> two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch >>>>> set thread, do you want to submit an official patch here, or would you prefer I >>>>> do so?] >>>>> >>>> I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 >>>> instructions in that path seems significant, isn't it? Given that performance >>>> delta, it seems like it would still be preferable to have a path that used the >>>> sse4.2 instructions when they're available. Or am I misreading what you mean >>>> when you say down 10% >>>> >>>> Neil >>>> >>> Ok, I did a little bit more testing here. Using the vector pmd compiled >>> for generic x86_64 and using __builtin_popcountll is approx 35% faster >>> for packet IO than the existing fast-path functions. It is also 7% (a >>> bit lower than ~10% as I originally stated) slower than the existing >>> native-compiled vpmd on a Sandy Bridge platform. >>> >>> I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the >>> extra instructions. The ~7% performance drop went to ~3%, so we would >>> gain a little more with using SSE4.2, but compared to the gain from >>> having the vector driver at all, it's not that much. [I don't have a >>> system handy with AVX2 support to see what boosts might come from >>> compiling with that instruction set enabled.] >>> >>> Because of this, I'd take the ~35% speed boost for now, and try and find >>> what would be the best general way to solve this problem across all >>> libraries. Also, I think that anyone who needs that extra 4% performance >>> probably wants the other 3% too, and so will compile up the code from >>> source using the "native" compilation target. :-) So if I read this right, the fast path scalar to the new "generic" vector implementation is 35%? That's is a bit higher than anticipated, but great!! One caution - we should probably get a performance read on Atom cores before we remove the scalar fast path completely. >> >> Wait a moment, I'm not entirely sure what you did here. I understand that you >> replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with >> __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8 >> intrinsic which is only available with sse4.2 from what I can see. did you >> replace those calls with a __builtin_shuffle variant? Otherwise, how did you >> get the pmd to build? I'm asking because this is what I tried in the first pass >> and Konstantin gave some pretty convicing evidence that this was an unworkable >> solution: >> http://dpdk.org/ml/archives/dev/2014-July/004443.html >> > I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3. > So I presume, there is no need for replacement. > Konstantin The change is really to keep the __mm_shuffle_epi8 and replace the _mm_popcnt_u64 with the builtin variant. That should allow compilation all the way up from SSSE3. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 13:56 ` Ananyev, Konstantin 2014-08-01 14:26 ` Venkatesan, Venky @ 2014-08-01 14:27 ` Neil Horman 1 sibling, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-08-01 14:27 UTC (permalink / raw) To: Ananyev, Konstantin; +Cc: dev On Fri, Aug 01, 2014 at 01:56:24PM +0000, Ananyev, Konstantin wrote: > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > > Sent: Friday, August 01, 2014 2:37 PM > > To: Richardson, Bruce > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features > > > > On Thu, Jul 31, 2014 at 01:19:50PM -0700, Bruce Richardson wrote: > > > On Thu, Jul 31, 2014 at 03:01:17PM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > > > > > > I think a good first step here that I can't see anyone objecting to is > > > > > to enable the ixgbe driver to use the vector code path for a generic > > > > > x86_64 build. I've run a quick test here, and changing "_mm_popcnt_u64" > > > > > to "__builtin_popcountll" [and the include from nmmintrin to tmmintrin] > > > > > allows a compile for machine type default, and testpmd can still forward > > > > > packets at a good rate (roughly perf down about 10% vs native compile on > > > > > SNB). > > > > > The ACL is a tougher nut to crack, but anyone see any issues with that > > > > > two-line change to ixgbe_rxtx_vec.c? [Neil, since you started the patch > > > > > set thread, do you want to submit an official patch here, or would you prefer I > > > > > do so?] > > > > > > > > > > > > > I'm happy to do so, Though 10% performance degradation vs. using the sse4.2 > > > > instructions in that path seems significant, isn't it? Given that performance > > > > delta, it seems like it would still be preferable to have a path that used the > > > > sse4.2 instructions when they're available. Or am I misreading what you mean > > > > when you say down 10% > > > > > > > > Neil > > > > > > > Ok, I did a little bit more testing here. Using the vector pmd compiled > > > for generic x86_64 and using __builtin_popcountll is approx 35% faster > > > for packet IO than the existing fast-path functions. It is also 7% (a > > > bit lower than ~10% as I originally stated) slower than the existing > > > native-compiled vpmd on a Sandy Bridge platform. > > > > > > I then ran an extra test, using EXTRA_CFLAGS='-msse4.2' to turn on the > > > extra instructions. The ~7% performance drop went to ~3%, so we would > > > gain a little more with using SSE4.2, but compared to the gain from > > > having the vector driver at all, it's not that much. [I don't have a > > > system handy with AVX2 support to see what boosts might come from > > > compiling with that instruction set enabled.] > > > > > > Because of this, I'd take the ~35% speed boost for now, and try and find > > > what would be the best general way to solve this problem across all > > > libraries. Also, I think that anyone who needs that extra 4% performance > > > probably wants the other 3% too, and so will compile up the code from > > > source using the "native" compilation target. :-) > > > > > > > > > Wait a moment, I'm not entirely sure what you did here. I understand that you > > replaced the _mm_popcnt_u64 call in the ixgbe pmd vector receive path with > > __builtin_popcnt, which is good, but ixgbe also uses the __mm_shuffle_epi8 > > intrinsic which is only available with sse4.2 from what I can see. did you > > replace those calls with a __builtin_shuffle variant? Otherwise, how did you > > get the pmd to build? I'm asking because this is what I tried in the first pass > > and Konstantin gave some pretty convicing evidence that this was an unworkable > > solution: > > http://dpdk.org/ml/archives/dev/2014-July/004443.html > > > > I think that _mm_shuffle_epi8 (PSHUFB) is available starting from SSE3. > So I presume, there is no need for replacement. Ah, I see, its just because we're using the nmmintrinsic.h header. We need to replace the popcount instruction and change the include header to tmmintrins.h to avoid the #error from the failed sse4.2 check Thanks! Neil > Konstantin > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 18:36 ` Bruce Richardson 2014-07-31 19:01 ` Neil Horman @ 2014-07-31 19:58 ` John W. Linville 2014-07-31 20:20 ` Bruce Richardson 2014-07-31 20:10 ` Neil Horman 2 siblings, 1 reply; 45+ messages in thread From: John W. Linville @ 2014-07-31 19:58 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. I'm not aware of a package that does anything like that. It probably is possible, but I imagine that it would provoke a lot of debate and consternation in FESCO... -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 19:58 ` John W. Linville @ 2014-07-31 20:20 ` Bruce Richardson 2014-07-31 20:32 ` John W. Linville 0 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-07-31 20:20 UTC (permalink / raw) To: John W. Linville; +Cc: dev On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > With regards to the general approach for runtime detection of software > > functions, I wonder if something like this can be handled by the > > packaging system? Is it possible to ship out a set of shared libs > > compiled up for different instruction sets, and then at rpm install > > time, symlink the appropriate library? This would push the whole issue > > of detection of code paths outside of code, work across all our > > libraries and ensure each user got the best performance they could get > > form a binary? > > Has something like this been done before? The building of all the > > libraries could be scripted easy enough, just do multiple builds using > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > each run. > > I'm not aware of a package that does anything like that. It probably > is possible, but I imagine that it would provoke a lot of debate > and consternation in FESCO... Nothing like a bit of consternation to get the adrenaline pumping, right :-) BTW: what is FESCO? /Bruce > > -- > John W. Linville Someday the world will need a hero, and you > linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 20:20 ` Bruce Richardson @ 2014-07-31 20:32 ` John W. Linville 2014-08-01 8:46 ` Vincent JARDIN 0 siblings, 1 reply; 45+ messages in thread From: John W. Linville @ 2014-07-31 20:32 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 01:20:42PM -0700, Bruce Richardson wrote: > On Thu, Jul 31, 2014 at 03:58:30PM -0400, John W. Linville wrote: > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > With regards to the general approach for runtime detection of software > > > functions, I wonder if something like this can be handled by the > > > packaging system? Is it possible to ship out a set of shared libs > > > compiled up for different instruction sets, and then at rpm install > > > time, symlink the appropriate library? This would push the whole issue > > > of detection of code paths outside of code, work across all our > > > libraries and ensure each user got the best performance they could get > > > form a binary? > > > Has something like this been done before? The building of all the > > > libraries could be scripted easy enough, just do multiple builds using > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > each run. > > > > I'm not aware of a package that does anything like that. It probably > > is possible, but I imagine that it would provoke a lot of debate > > and consternation in FESCO... > > Nothing like a bit of consternation to get the adrenaline pumping, right > :-) > BTW: what is FESCO? Fedora Engineering Steering Committee Neil and I have already felt the hot breath of FESCO on our necks regarding the Fedora DPDK package... John -- John W. Linville Someday the world will need a hero, and you linville@tuxdriver.com might be all we have. Be ready. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 20:32 ` John W. Linville @ 2014-08-01 8:46 ` Vincent JARDIN 2014-08-01 14:06 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Vincent JARDIN @ 2014-08-01 8:46 UTC (permalink / raw) To: dev On 31/07/2014 22:32, John W. Linville wrote: >> BTW: what is FESCO? > Fedora Engineering Steering Committee > > Neil and I have already felt the hot breath of FESCO on our necks > regarding the Fedora DPDK package... > I do confirm and feel that we should go step by step. Having multiple library as Bruce suggest could be an option, I like this idea. Another one could be to get a new ELF format (like executables on Mac or Windows) that allows to support multiple binaries optimized for each CPUs. I am not aware of such options with Linux loader. But here, as a DPDK community, we cannot push it. Any one at Fedora? ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 8:46 ` Vincent JARDIN @ 2014-08-01 14:06 ` Neil Horman 2014-08-01 14:57 ` Vincent JARDIN 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-08-01 14:06 UTC (permalink / raw) To: Vincent JARDIN; +Cc: dev On Fri, Aug 01, 2014 at 10:46:33AM +0200, Vincent JARDIN wrote: > On 31/07/2014 22:32, John W. Linville wrote: > >>BTW: what is FESCO? > >Fedora Engineering Steering Committee > > > >Neil and I have already felt the hot breath of FESCO on our necks > >regarding the Fedora DPDK package... > > > > I do confirm and feel that we should go step by step. > > Having multiple library as Bruce suggest could be an option, I like this > idea. > Its not an option (reasons described further down in the thread). > Another one could be to get a new ELF format (like executables on Mac or > Windows) that allows to support multiple binaries optimized for each CPUs. I > am not aware of such options with Linux loader. But here, as a DPDK > community, we cannot push it. Any one at Fedora? > This is definately not an option, at least not without significant justification or need. What you're asking for here is the development of an entirely new binary file format, the kernel and glibc support to interpret and execute it, and the compiler tooling to emit code in that format. Thats a huge undertaking, its not going to be done just because a single library would like to ship multiple binaries to be optimized for different cpu variants within the same family. Thats a multi year effort, and not something I'm prepared to even consider undertaking. Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 14:06 ` Neil Horman @ 2014-08-01 14:57 ` Vincent JARDIN 2014-08-01 15:19 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Vincent JARDIN @ 2014-08-01 14:57 UTC (permalink / raw) To: Neil Horman; +Cc: dev On 01/08/2014 16:06, Neil Horman wrote: > Thats a multi year effort, and not something I'm prepared to even > consider undertaking. Sorry: I am not pushing you, it was just an open comment. I do agree that it is a multi year effort to get it down into a wide "agreed" community. DPDK community cannot manage it. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 14:57 ` Vincent JARDIN @ 2014-08-01 15:19 ` Neil Horman 0 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-08-01 15:19 UTC (permalink / raw) To: Vincent JARDIN; +Cc: dev On Fri, Aug 01, 2014 at 04:57:56PM +0200, Vincent JARDIN wrote: > On 01/08/2014 16:06, Neil Horman wrote: > >Thats a multi year effort, and not something I'm prepared to even > >consider undertaking. > > Sorry: I am not pushing you, it was just an open comment. I do agree that it > is a multi year effort to get it down into a wide "agreed" community. DPDK > community cannot manage it. > I understand, but I still don't think it makes sense to do. The fat elf format was origionally intended to allow transparent movement bewteen major architectures. While it certainly could be used to migrate between a smaller granularity within the same arch family, it seems like a lousy tradeoff to me, especially as the fanout for variants increases. Right now I think there are at least 4 variants within the dpdk (sse3, sse4.2, avx, and avx512). Thats a 4x increase in binary size. While that might make sense for one highly optimized application, it doesn't seem like it would make sense as a general purpose utility, as the disk space / speedup tradeoff is not super great. Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 18:36 ` Bruce Richardson 2014-07-31 19:01 ` Neil Horman 2014-07-31 19:58 ` John W. Linville @ 2014-07-31 20:10 ` Neil Horman 2014-07-31 20:25 ` Bruce Richardson 2 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-07-31 20:10 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > 2014-07-31 09:13, Neil Horman: > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > Hey all- > > With regards to the general approach for runtime detection of software > functions, I wonder if something like this can be handled by the > packaging system? Is it possible to ship out a set of shared libs > compiled up for different instruction sets, and then at rpm install > time, symlink the appropriate library? This would push the whole issue > of detection of code paths outside of code, work across all our > libraries and ensure each user got the best performance they could get > form a binary? > Has something like this been done before? The building of all the > libraries could be scripted easy enough, just do multiple builds using > different EXTRA_CFLAGS each time, and move and rename the .so's after > each run. > Sorry, I missed this in my last reply. In answer to your question, the short version is that such a thing is roughly possible from a packaging standpoint, but completely unworkable from a distribution standpoint. We could certainly build the dpdk multiple times and rename all the shared objects to some variant name representative of the optimzations we build in for certain cpu flags, but then we woudl be shipping X versions of the dpdk, and any appilcation (say OVS that made use of the dpdk would need to provide a version linked against each variant to be useful when making a product, and each end user would need to manually select (or run a script to select) which variant is most optimized for the system at hand. Its just not a reasonable way to package a library. When pacaging software, the only consideration given to code variance at pacakge time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for your a given architecture, its expected to run on that architecture. Optional code paths are just that, optional, and executed based on run time tests. Its a requirement that we build for the lowest common demoniator system that is supported, and enable accelerative code paths optionally at run time when the cpu indicates support for them. Neil ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 20:10 ` Neil Horman @ 2014-07-31 20:25 ` Bruce Richardson 2014-08-01 15:06 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-07-31 20:25 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > Hey all- > > > > With regards to the general approach for runtime detection of software > > functions, I wonder if something like this can be handled by the > > packaging system? Is it possible to ship out a set of shared libs > > compiled up for different instruction sets, and then at rpm install > > time, symlink the appropriate library? This would push the whole issue > > of detection of code paths outside of code, work across all our > > libraries and ensure each user got the best performance they could get > > form a binary? > > Has something like this been done before? The building of all the > > libraries could be scripted easy enough, just do multiple builds using > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > each run. > > > > Sorry, I missed this in my last reply. > > In answer to your question, the short version is that such a thing is roughly > possible from a packaging standpoint, but completely unworkable from a > distribution standpoint. We could certainly build the dpdk multiple times and > rename all the shared objects to some variant name representative of the > optimzations we build in for certain cpu flags, but then we woudl be shipping X > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > would need to provide a version linked against each variant to be useful when > making a product, and each end user would need to manually select (or run a > script to select) which variant is most optimized for the system at hand. Its > just not a reasonable way to package a library. Sorry, perhaps I was not clear, having the user have to select the appropriate library was not what I was suggesting. Instead, I was suggesting that the rpm install "librte_pmd_ixgbe.so.generic", "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm post-install script would look at the cpuflags in cpuinfo and then symlink librte_pmd_ixgbe.so to the best-match version. That way the user only has to link against "librte_pmd_ixgbe.so" and depending on the system its run on, the loader will automatically resolve the symbols from the appropriate instruction-set specific .so file. > > When pacaging software, the only consideration given to code variance at pacakge > time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for > your a given architecture, its expected to run on that architecture. Optional > code paths are just that, optional, and executed based on run time tests. Its a > requirement that we build for the lowest common demoniator system that is > supported, and enable accelerative code paths optionally at run time when the > cpu indicates support for them. > > Neil > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 20:25 ` Bruce Richardson @ 2014-08-01 15:06 ` Neil Horman 2014-08-01 19:22 ` Bruce Richardson 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-08-01 15:06 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote: > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > > Hey all- > > > > > > With regards to the general approach for runtime detection of software > > > functions, I wonder if something like this can be handled by the > > > packaging system? Is it possible to ship out a set of shared libs > > > compiled up for different instruction sets, and then at rpm install > > > time, symlink the appropriate library? This would push the whole issue > > > of detection of code paths outside of code, work across all our > > > libraries and ensure each user got the best performance they could get > > > form a binary? > > > Has something like this been done before? The building of all the > > > libraries could be scripted easy enough, just do multiple builds using > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > each run. > > > > > > > Sorry, I missed this in my last reply. > > > > In answer to your question, the short version is that such a thing is roughly > > possible from a packaging standpoint, but completely unworkable from a > > distribution standpoint. We could certainly build the dpdk multiple times and > > rename all the shared objects to some variant name representative of the > > optimzations we build in for certain cpu flags, but then we woudl be shipping X > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > > would need to provide a version linked against each variant to be useful when > > making a product, and each end user would need to manually select (or run a > > script to select) which variant is most optimized for the system at hand. Its > > just not a reasonable way to package a library. > > Sorry, perhaps I was not clear, having the user have to select the > appropriate library was not what I was suggesting. Instead, I was > suggesting that the rpm install "librte_pmd_ixgbe.so.generic", > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm > post-install script would look at the cpuflags in cpuinfo and then > symlink librte_pmd_ixgbe.so to the best-match version. That way the user > only has to link against "librte_pmd_ixgbe.so" and depending on the > system its run on, the loader will automatically resolve the symbols > from the appropriate instruction-set specific .so file. > This is an absolute packaging nightmare, it will potentially break all sorts of corner cases, and support processes. To cite a few examples: 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced at some point in the future? The above strategy has no way to know that a given update has more advanced requirements than a previous update, and when the update is installed, the previously linked library for the old base will dissappear, leaving broken applications behind. 2) Debugging - Its going to be near impossible to support an application built with a package put together this way, because you'll never be sure as to which version of the library was running when the crash occured. You can figure it out for certain, but for support/development people to need to remember to figure this out is going to be a major turn off for them, and the result will be that they simply won't use the dpdk. Its Anathema to the expectations of linux user space. 3) QA - Building multiple versions of a library means needing to QA multiple versions of a library. If you have to have 4 builds to support different levels of optimization, you've created a 4x increase in the amount of testing you need to do to ensure consistent behavior. You need to be aware of how many different builds are available in the single rpm at all times, and find systems on which to QA which will ensure that all of the builds get tested (as they are in fact, unique builds). While you may not hit all code paths in a single build, you will at least test all the common paths. The bottom line is that Distribution packaging is all about consistency and commonality. If you install something for an arch on multiple systems, its the same thing on each system, and it works in the same way, all the time. This strategy breaks that. Thats why we do run time checks for things. Neil > > > > When pacaging software, the only consideration given to code variance at pacakge > > time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for > > your a given architecture, its expected to run on that architecture. Optional > > code paths are just that, optional, and executed based on run time tests. Its a > > requirement that we build for the lowest common demoniator system that is > > supported, and enable accelerative code paths optionally at run time when the > > cpu indicates support for them. > > > > Neil > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 15:06 ` Neil Horman @ 2014-08-01 19:22 ` Bruce Richardson 2014-08-01 20:43 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-08-01 19:22 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote: > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote: > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > > > Hey all- > > > > > > > > With regards to the general approach for runtime detection of software > > > > functions, I wonder if something like this can be handled by the > > > > packaging system? Is it possible to ship out a set of shared libs > > > > compiled up for different instruction sets, and then at rpm install > > > > time, symlink the appropriate library? This would push the whole issue > > > > of detection of code paths outside of code, work across all our > > > > libraries and ensure each user got the best performance they could get > > > > form a binary? > > > > Has something like this been done before? The building of all the > > > > libraries could be scripted easy enough, just do multiple builds using > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > > each run. > > > > > > > > > > Sorry, I missed this in my last reply. > > > > > > In answer to your question, the short version is that such a thing is roughly > > > possible from a packaging standpoint, but completely unworkable from a > > > distribution standpoint. We could certainly build the dpdk multiple times and > > > rename all the shared objects to some variant name representative of the > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > > > would need to provide a version linked against each variant to be useful when > > > making a product, and each end user would need to manually select (or run a > > > script to select) which variant is most optimized for the system at hand. Its > > > just not a reasonable way to package a library. > > > > Sorry, perhaps I was not clear, having the user have to select the > > appropriate library was not what I was suggesting. Instead, I was > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic", > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm > > post-install script would look at the cpuflags in cpuinfo and then > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user > > only has to link against "librte_pmd_ixgbe.so" and depending on the > > system its run on, the loader will automatically resolve the symbols > > from the appropriate instruction-set specific .so file. > > > > This is an absolute packaging nightmare, it will potentially break all sorts of > corner cases, and support processes. To cite a few examples: > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced > at some point in the future? The above strategy has no way to know that a given > update has more advanced requirements than a previous update, and when the > update is installed, the previously linked library for the old base will > dissappear, leaving broken applications behind. Firstly, I didn't know we could actually specify minimum cpu requirements for packaging, that is something that could be useful :-) Secondly, what is the normal case for handling something like this, where an upgrade has enhanced requirements compared to the previous version? Presumably you either need to prevent the upgrade from happening or else accept a broken app. Can the same mechanism not also be used to prevent upgrades using a multi-lib scheme? > > 2) Debugging - Its going to be near impossible to support an application built > with a package put together this way, because you'll never be sure as to which > version of the library was running when the crash occured. You can figure it > out for certain, but for support/development people to need to remember to > figure this out is going to be a major turn off for them, and the result will be > that they simply won't use the dpdk. Its Anathema to the expectations of linux > user space. Sorry, I just don't see this as being any harder to support than multiple code paths for the same functionality. In fact, it will surely make debugging easier, since you only have the one code path, just compiled up in different ways. > > 3) QA - Building multiple versions of a library means needing to QA multiple > versions of a library. If you have to have 4 builds to support different levels > of optimization, you've created a 4x increase in the amount of testing you need > to do to ensure consistent behavior. You need to be aware of how many different > builds are available in the single rpm at all times, and find systems on which > to QA which will ensure that all of the builds get tested (as they are in fact, > unique builds). While you may not hit all code paths in a single build, you > will at least test all the common paths. Again, the exact same QA conditions will also apply to an approach using multiple code paths bundled into the same library. Given a choice between one code path with multiple compiles, vs multiple code paths each compiled only once, the multiple code paths option leaves far greater scope for bugs, and when bugs do occur means that you always have to find out what specific hardware it was being run on. Using the exact same code multiply compiled, the vast, vast majority of bugs are going to occur across all platforms and systems so you should rarely need to ask what the specific platform being used is. > > The bottom line is that Distribution packaging is all about consistency and > commonality. If you install something for an arch on multiple systems, its the > same thing on each system, and it works in the same way, all the time. This > strategy breaks that. Thats why we do run time checks for things. If you want to have the best tuned code running for each instruction set, then commonality and consistency goes out the window anyway, because two different machines calling the same function are going to execute different sets of instructions. The decision then becomes: a) whether you need multiple sets of instructions - if no then you pay with lack of performance b) how you get those multiple sets of instructions c) how you validate those multiple sets of instructions. As is clear by now :-), my preference by far is to have multiple sets of instructions come from a single code base, as less code means less maintenance, and above all, fewer bugs. If that can't be done, then we need to look carefully at each code path being added and do a cost-benefit analysis on it. Regards, /Bruce > > Neil > > > > > > > When pacaging software, the only consideration given to code variance at pacakge > > > time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for > > > your a given architecture, its expected to run on that architecture. Optional > > > code paths are just that, optional, and executed based on run time tests. Its a > > > requirement that we build for the lowest common demoniator system that is > > > supported, and enable accelerative code paths optionally at run time when the > > > cpu indicates support for them. > > > > > > Neil > > > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 19:22 ` Bruce Richardson @ 2014-08-01 20:43 ` Neil Horman 2014-08-01 21:08 ` Bruce Richardson 0 siblings, 1 reply; 45+ messages in thread From: Neil Horman @ 2014-08-01 20:43 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote: > On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote: > > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > > > > Hey all- > > > > > > > > > > With regards to the general approach for runtime detection of software > > > > > functions, I wonder if something like this can be handled by the > > > > > packaging system? Is it possible to ship out a set of shared libs > > > > > compiled up for different instruction sets, and then at rpm install > > > > > time, symlink the appropriate library? This would push the whole issue > > > > > of detection of code paths outside of code, work across all our > > > > > libraries and ensure each user got the best performance they could get > > > > > form a binary? > > > > > Has something like this been done before? The building of all the > > > > > libraries could be scripted easy enough, just do multiple builds using > > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > > > each run. > > > > > > > > > > > > > Sorry, I missed this in my last reply. > > > > > > > > In answer to your question, the short version is that such a thing is roughly > > > > possible from a packaging standpoint, but completely unworkable from a > > > > distribution standpoint. We could certainly build the dpdk multiple times and > > > > rename all the shared objects to some variant name representative of the > > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X > > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > > > > would need to provide a version linked against each variant to be useful when > > > > making a product, and each end user would need to manually select (or run a > > > > script to select) which variant is most optimized for the system at hand. Its > > > > just not a reasonable way to package a library. > > > > > > Sorry, perhaps I was not clear, having the user have to select the > > > appropriate library was not what I was suggesting. Instead, I was > > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic", > > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm > > > post-install script would look at the cpuflags in cpuinfo and then > > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user > > > only has to link against "librte_pmd_ixgbe.so" and depending on the > > > system its run on, the loader will automatically resolve the symbols > > > from the appropriate instruction-set specific .so file. > > > > > > > This is an absolute packaging nightmare, it will potentially break all sorts of > > corner cases, and support processes. To cite a few examples: > > > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced > > at some point in the future? The above strategy has no way to know that a given > > update has more advanced requirements than a previous update, and when the > > update is installed, the previously linked library for the old base will > > dissappear, leaving broken applications behind. > > Firstly, I didn't know we could actually specify minimum cpu > requirements for packaging, that is something that could be useful :-) You misread my comment :). I didn't say we could specify minimum cpu requirements at packaging (you can't, beyond general arch), I said "what if the dpdk's cpu requriements were raised?". Completely different thing. Currently teh default, lowest common denominator system that dpdk appears to build for is core2 (as listed in the old default config). What if at some point you raise those requirements and decide that SSE4.2 really is required to achieve maximum performance. Using the above strategy any system that doesn't meet the new requirements will silently break on such an update. Thats not acceptable. > Secondly, what is the normal case for handling something like this, > where an upgrade has enhanced requirements compared to the previous > version? Presumably you either need to prevent the upgrade from > happening or else accept a broken app. Can the same mechanism not also > be used to prevent upgrades using a multi-lib scheme? > The case for handling something like this is: Don't do it. When you package something for Fedora (or any distro), you provide an implicit guaratee that it will run (or fail gracefully) on all supported systems. You can add support for systems as you go forward, but you can't deprecate support for systems within a major release. That is to say, if something runs on F20 now, its got to keep running on F20 for the lifetime of F20. If it stops running, thats a regression, the user opens a bug and you fix it. The DPDK is way off the reservation in regards to this. Application packages, as a general rule don't build with specific cpu features in mind, because performance, while important isn't on the same scale as what you're trying to do in the dpdk. A process getting scheduled off the cpu while we handle an interrupt wipes out any speedup gains made by any micro-optimizations, so theres no point in doing so. The DPDK is different, I understand that, but the drawback is that it (the DPDK) needs to make optimizations that really aren't considered particularly important to the rest of user space. I'm trying to opportunistically make the DPDK as fast as possible, but I need to do it in a single binary, that works on a lowest common demoninator system. > > > > 2) Debugging - Its going to be near impossible to support an application built > > with a package put together this way, because you'll never be sure as to which > > version of the library was running when the crash occured. You can figure it > > out for certain, but for support/development people to need to remember to > > figure this out is going to be a major turn off for them, and the result will be > > that they simply won't use the dpdk. Its Anathema to the expectations of linux > > user space. > > Sorry, I just don't see this as being any harder to support than > multiple code paths for the same functionality. In fact, it will surely make > debugging easier, since you only have the one code path, just compiled > up in different ways. > Well, then by all means, become a Fedora packager, an you can take over the DPDK maintenece there :). Until then, you'll just have to trust me. If you have multiple optional code paths (especialy if they're limited to isolated features) its manageable. But regardless of how you look at it, building the same source multiple times with different cpu support means completely different binaries. The assembly and optimization are just plain different. They may be close, but they're not the same, and they need to be QA-ed independently. With a single build and optional code paths, all the common code is executed no matter what system you're running on, and its always the same. Multiple builds with different instruction support means that code that is identical at a source level may well be significantly different at a binary level, and thats not something I can sanely manage in a general purpose environment. > > > 3) QA - Building multiple versions of a library means needing to QA multiple > > versions of a library. If you have to have 4 builds to support different levels > > of optimization, you've created a 4x increase in the amount of testing you need > > to do to ensure consistent behavior. You need to be aware of how many different > > builds are available in the single rpm at all times, and find systems on which > > to QA which will ensure that all of the builds get tested (as they are in fact, > > unique builds). While you may not hit all code paths in a single build, you > > will at least test all the common paths. > > Again, the exact same QA conditions will also apply to an approach using > multiple code paths bundled into the same library. Given a choice > between one code path with multiple compiles, vs multiple code paths > each compiled only once, the multiple code paths option leaves far > greater scope for bugs, and when bugs do occur means that you always > have to find out what specific hardware it was being run on. Using the > exact same code multiply compiled, the vast, vast majority of bugs are > going to occur across all platforms and systems so you should rarely > need to ask what the specific platform being used is. > No, they won't (see above). Enabling insructions will enable the compiler to emit and optimize common paths differently, so identical source code will lead to different binary code. I need to have a single binary so that I know what I'm working with when someone opens a bug. I don't have that using a multiple binary approach. At least with multiple runtime paths (especially/specifically with the run time paths we've been discussing, the igbe rx vector path and the acl library, which are isolated), I know that, if I get a bug report and the backtrace ends in either location, I know I'm sepecifically dealing with that code. With your multiple binary approach, if I get a crash in, say rte_eal_init, I need to figure out if this crash happened in the sse3 compiled binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the core2 binary. You can say thats easy, but its easy to say that when you're not the one that has to support it. > > > > The bottom line is that Distribution packaging is all about consistency and > > commonality. If you install something for an arch on multiple systems, its the > > same thing on each system, and it works in the same way, all the time. This > > strategy breaks that. Thats why we do run time checks for things. > > If you want to have the best tuned code running for each instruction > set, then commonality and consistency goes out the window anyway, So, this is perhaps where communication is breaking down. I don't want to have the best tuned code running for each instruction set. What I want is for the dpdk to run on a lowest common denominator platform, and be able to opportunistically take advantage of accelerated code paths that require advanced cpu featrues. Lets take the ixgbe code as an example. Note I didn't add any code paths there, at all (in fact I didn't add any anywhere). The ixgbe rx_burst method gets set according to compile time configuration. You can pick the bulk_alloc rx method, or the vectorized rx method at compile time (or some others I think, but thats not relevant). As it happened the vectorized rx path option had an implicit dependency on SSE4.2. Instead of requiring that all cpus that run the dpdk have SSE4.2, I instead chose to move that compile time decision to a run tmie decision, by building only the vectorized path with sse4.2 and only using it if we see that the cpu supports sse4.2 at run time. No new paths created, no new support requirements, you're still supporting the same options upstream, the only difference is I was able to include them both in a single binary. Thats better for our end users because the single binary still works everywhere. Thats better for our QA group because For whatever set of tests they perform, they only need an sse4.2 enabled system to test the one isolated path for that vector rx code. The rest of their tests can be conducted once, on any system, because the binary is exactly the same. If we compile multiple binaries, testing on one system doesn't mean we've tested all the code. > because two different machines calling the same function are going to > execute different sets of instructions. The decision then becomes: But thats not at all what I wanted. I want two different machines calling the same function to execute the same instructions 99.9999% of the time. The only time I want to diverge from that is in isolated paths where we can take advantage of a feature that we otherwise could not (i.e. the ixgbe and acl code). I look at it like the alternatives code in linux. There are these isolated areas where you have limited bits of code that at run time are re-written to use available cpu features. 99.9% of the code is identical, but in these little spots its ok to diverge from simmilarity because they'er isolated, easily identifiable > a) whether you need multiple sets of instructions - if no then you pay > with lack of performance > b) how you get those multiple sets of instructions > c) how you validate those multiple sets of instructions. > > As is clear by now :-), my preference by far is to have multiple sets of > instructions come from a single code base, as less code means less > maintenance, and above all, fewer bugs. If that can't be done, then we > need to look carefully at each code path being added and do a > cost-benefit analysis on it. > Yes, its quite clear :), I think its equally clear that I need a single binary, and would like to opportunisitcally enhance it where possible without losing the fact that its a single binary. I suppose its all somewhat moot at this point though, The reduction to sse3 for ixgbe seems agreeable to everyone, and it lets me preserve single binary builds there. I'm currently working on the ACL library, as you noted thats a tougher nut to crack. I think I'll have it done early next week (though i'm sure my translation of the instruction set reference to C will need some through testing :)). I'll post it when its ready. > Regards, > /Bruce > > > > > Neil > > > > > > > > > > When pacaging software, the only consideration given to code variance at pacakge > > > > time is architecture (x86/x86_64/ppc/s390/etc). If you install a package for > > > > your a given architecture, its expected to run on that architecture. Optional > > > > code paths are just that, optional, and executed based on run time tests. Its a > > > > requirement that we build for the lowest common demoniator system that is > > > > supported, and enable accelerative code paths optionally at run time when the > > > > cpu indicates support for them. > > > > > > > > Neil > > > > > > > > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 20:43 ` Neil Horman @ 2014-08-01 21:08 ` Bruce Richardson 2014-08-02 12:56 ` Neil Horman 0 siblings, 1 reply; 45+ messages in thread From: Bruce Richardson @ 2014-08-01 21:08 UTC (permalink / raw) To: Neil Horman; +Cc: dev On Fri, Aug 01, 2014 at 04:43:52PM -0400, Neil Horman wrote: > On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote: > > On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote: > > > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote: > > > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > > > > > Hey all- > > > > > > > > > > > > With regards to the general approach for runtime detection of software > > > > > > functions, I wonder if something like this can be handled by the > > > > > > packaging system? Is it possible to ship out a set of shared libs > > > > > > compiled up for different instruction sets, and then at rpm install > > > > > > time, symlink the appropriate library? This would push the whole issue > > > > > > of detection of code paths outside of code, work across all our > > > > > > libraries and ensure each user got the best performance they could get > > > > > > form a binary? > > > > > > Has something like this been done before? The building of all the > > > > > > libraries could be scripted easy enough, just do multiple builds using > > > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > > > > each run. > > > > > > > > > > > > > > > > Sorry, I missed this in my last reply. > > > > > > > > > > In answer to your question, the short version is that such a thing is roughly > > > > > possible from a packaging standpoint, but completely unworkable from a > > > > > distribution standpoint. We could certainly build the dpdk multiple times and > > > > > rename all the shared objects to some variant name representative of the > > > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X > > > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > > > > > would need to provide a version linked against each variant to be useful when > > > > > making a product, and each end user would need to manually select (or run a > > > > > script to select) which variant is most optimized for the system at hand. Its > > > > > just not a reasonable way to package a library. > > > > > > > > Sorry, perhaps I was not clear, having the user have to select the > > > > appropriate library was not what I was suggesting. Instead, I was > > > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic", > > > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm > > > > post-install script would look at the cpuflags in cpuinfo and then > > > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user > > > > only has to link against "librte_pmd_ixgbe.so" and depending on the > > > > system its run on, the loader will automatically resolve the symbols > > > > from the appropriate instruction-set specific .so file. > > > > > > > > > > This is an absolute packaging nightmare, it will potentially break all sorts of > > > corner cases, and support processes. To cite a few examples: > > > > > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced > > > at some point in the future? The above strategy has no way to know that a given > > > update has more advanced requirements than a previous update, and when the > > > update is installed, the previously linked library for the old base will > > > dissappear, leaving broken applications behind. > > > > Firstly, I didn't know we could actually specify minimum cpu > > requirements for packaging, that is something that could be useful :-) > You misread my comment :). I didn't say we could specify minimum cpu > requirements at packaging (you can't, beyond general arch), I said "what if the > dpdk's cpu requriements were raised?". Completely different thing. Currently > teh default, lowest common denominator system that dpdk appears to build for is > core2 (as listed in the old default config). What if at some point you raise > those requirements and decide that SSE4.2 really is required to achieve maximum > performance. Using the above strategy any system that doesn't meet the new > requirements will silently break on such an update. Thats not acceptable. Core2 was the first Intel set of chips that had x86_64 instruction set before (Core microarchitecture), so that's why it's listed as the minimum - it's the same thing as generic x86_64 support. :-) > > > Secondly, what is the normal case for handling something like this, > > where an upgrade has enhanced requirements compared to the previous > > version? Presumably you either need to prevent the upgrade from > > happening or else accept a broken app. Can the same mechanism not also > > be used to prevent upgrades using a multi-lib scheme? > > > The case for handling something like this is: Don't do it. When you package > something for Fedora (or any distro), you provide an implicit guaratee that it > will run (or fail gracefully) on all supported systems. You can add support for > systems as you go forward, but you can't deprecate support for systems within a > major release. That is to say, if something runs on F20 now, its got to keep > running on F20 for the lifetime of F20. If it stops running, thats a > regression, the user opens a bug and you fix it. > > The DPDK is way off the reservation in regards to this. Application packages, > as a general rule don't build with specific cpu features in mind, because > performance, while important isn't on the same scale as what you're trying to do > in the dpdk. A process getting scheduled off the cpu while we handle an > interrupt wipes out any speedup gains made by any micro-optimizations, so theres > no point in doing so. The DPDK is different, I understand that, but the > drawback is that it (the DPDK) needs to make optimizations that really aren't > considered particularly important to the rest of user space. I'm trying to > opportunistically make the DPDK as fast as possible, but I need to do it in a > single binary, that works on a lowest common demoninator system. > > > > > > > 2) Debugging - Its going to be near impossible to support an application built > > > with a package put together this way, because you'll never be sure as to which > > > version of the library was running when the crash occured. You can figure it > > > out for certain, but for support/development people to need to remember to > > > figure this out is going to be a major turn off for them, and the result will be > > > that they simply won't use the dpdk. Its Anathema to the expectations of linux > > > user space. > > > > Sorry, I just don't see this as being any harder to support than > > multiple code paths for the same functionality. In fact, it will surely make > > debugging easier, since you only have the one code path, just compiled > > up in different ways. > > > > Well, then by all means, become a Fedora packager, an you can take over the DPDK > maintenece there :). Until then, you'll just have to trust me. If you have > multiple optional code paths (especialy if they're limited to isolated features) > its manageable. But regardless of how you look at it, building the same > source multiple times with different cpu support means completely different > binaries. The assembly and optimization are just plain different. They may be > close, but they're not the same, and they need to be QA-ed independently. With > a single build and optional code paths, all the common code is executed no > matter what system you're running on, and its always the same. Multiple builds > with different instruction support means that code that is identical at a source > level may well be significantly different at a binary level, and thats not > something I can sanely manage in a general purpose environment. > > > > > > 3) QA - Building multiple versions of a library means needing to QA multiple > > > versions of a library. If you have to have 4 builds to support different levels > > > of optimization, you've created a 4x increase in the amount of testing you need > > > to do to ensure consistent behavior. You need to be aware of how many different > > > builds are available in the single rpm at all times, and find systems on which > > > to QA which will ensure that all of the builds get tested (as they are in fact, > > > unique builds). While you may not hit all code paths in a single build, you > > > will at least test all the common paths. > > > > Again, the exact same QA conditions will also apply to an approach using > > multiple code paths bundled into the same library. Given a choice > > between one code path with multiple compiles, vs multiple code paths > > each compiled only once, the multiple code paths option leaves far > > greater scope for bugs, and when bugs do occur means that you always > > have to find out what specific hardware it was being run on. Using the > > exact same code multiply compiled, the vast, vast majority of bugs are > > going to occur across all platforms and systems so you should rarely > > need to ask what the specific platform being used is. > > > > No, they won't (see above). Enabling insructions will enable the compiler to > emit and optimize common paths differently, so identical source code will lead > to different binary code. I need to have a single binary so that I know what > I'm working with when someone opens a bug. I don't have that using a multiple > binary approach. At least with multiple runtime paths (especially/specifically > with the run time paths we've been discussing, the igbe rx vector path and the > acl library, which are isolated), I know that, if I get a bug report and the > backtrace ends in either location, I know I'm sepecifically dealing with that > code. With your multiple binary approach, if I get a crash in, say > rte_eal_init, I need to figure out if this crash happened in the sse3 compiled > binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the > core2 binary. You can say thats easy, but its easy to say that when you're not > the one that has to support it. > > > > > > > The bottom line is that Distribution packaging is all about consistency and > > > commonality. If you install something for an arch on multiple systems, its the > > > same thing on each system, and it works in the same way, all the time. This > > > strategy breaks that. Thats why we do run time checks for things. > > > > If you want to have the best tuned code running for each instruction > > set, then commonality and consistency goes out the window anyway, > So, this is perhaps where communication is breaking down. I don't want to have the > best tuned code running for each instruction set. What I want is for the dpdk > to run on a lowest common denominator platform, and be able to opportunistically > take advantage of accelerated code paths that require advanced cpu featrues. > > > Lets take the ixgbe code as an example. Note I didn't add any code paths there, > at all (in fact I didn't add any anywhere). The ixgbe rx_burst method gets set > according to compile time configuration. You can pick the bulk_alloc rx method, > or the vectorized rx method at compile time (or some others I think, but thats > not relevant). As it happened the vectorized rx path option had an implicit > dependency on SSE4.2. Instead of requiring that all cpus that run the dpdk have > SSE4.2, I instead chose to move that compile time decision to a run tmie > decision, by building only the vectorized path with sse4.2 and only using it if > we see that the cpu supports sse4.2 at run time. No new paths created, no new > support requirements, you're still supporting the same options upstream, the > only difference is I was able to include them both in a single binary. Thats > better for our end users because the single binary still works everywhere. > Thats better for our QA group because For whatever set of tests they perform, > they only need an sse4.2 enabled system to test the one isolated path for that > vector rx code. The rest of their tests can be conducted once, on any system, > because the binary is exactly the same. If we compile multiple binaries, > testing on one system doesn't mean we've tested all the code. > > > because two different machines calling the same function are going to > > execute different sets of instructions. The decision then becomes: > But thats not at all what I wanted. I want two different machines calling the > same function to execute the same instructions 99.9999% of the time. The only > time I want to diverge from that is in isolated paths where we can take > advantage of a feature that we otherwise could not (i.e. the ixgbe and acl > code). I look at it like the alternatives code in linux. There are these > isolated areas where you have limited bits of code that at run time are > re-written to use available cpu features. 99.9% of the code is identical, but > in these little spots its ok to diverge from simmilarity because they'er isolated, > easily identifiable > > > a) whether you need multiple sets of instructions - if no then you pay > > with lack of performance > > b) how you get those multiple sets of instructions > > c) how you validate those multiple sets of instructions. > > > > As is clear by now :-), my preference by far is to have multiple sets of > > instructions come from a single code base, as less code means less > > maintenance, and above all, fewer bugs. If that can't be done, then we > > need to look carefully at each code path being added and do a > > cost-benefit analysis on it. > > > > Yes, its quite clear :), I think its equally clear that I need a single binary, > and would like to opportunisitcally enhance it where possible without losing the > fact that its a single binary. > > I suppose its all somewhat moot at this point though, The reduction to sse3 for > ixgbe seems agreeable to everyone, and it lets me preserve single binary builds > there. I'm currently working on the ACL library, as you noted thats a tougher > nut to crack. I think I'll have it done early next week (though i'm sure my > translation of the instruction set reference to C will need some through testing > :)). I'll post it when its ready. > Agreed, lets get everything working to a common baseline anyway. In terms of the number of RX and TX functions, you mentioned, I'd hope in future we could cut the number of them down a bit as we make the vector versions more generally applicable, but that's a whole discussion for another day. /Bruce ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-08-01 21:08 ` Bruce Richardson @ 2014-08-02 12:56 ` Neil Horman 0 siblings, 0 replies; 45+ messages in thread From: Neil Horman @ 2014-08-02 12:56 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev On Fri, Aug 01, 2014 at 02:08:22PM -0700, Bruce Richardson wrote: > On Fri, Aug 01, 2014 at 04:43:52PM -0400, Neil Horman wrote: > > On Fri, Aug 01, 2014 at 12:22:22PM -0700, Bruce Richardson wrote: > > > On Fri, Aug 01, 2014 at 11:06:29AM -0400, Neil Horman wrote: > > > > On Thu, Jul 31, 2014 at 01:25:06PM -0700, Bruce Richardson wrote: > > > > > On Thu, Jul 31, 2014 at 04:10:18PM -0400, Neil Horman wrote: > > > > > > On Thu, Jul 31, 2014 at 11:36:32AM -0700, Bruce Richardson wrote: > > > > > > > Thu, Jul 31, 2014 at 02:10:32PM -0400, Neil Horman wrote: > > > > > > > > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > > > > > > > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > > > > > > > > > 2014-07-31 09:13, Neil Horman: > > > > > > > > > > > On Wed, Jul 30, 2014 at 02:09:20PM -0700, Bruce Richardson wrote: > > > > > > > > > > > > On Wed, Jul 30, 2014 at 03:28:44PM -0400, Neil Horman wrote: > > > > > > > > > > > > > On Wed, Jul 30, 2014 at 11:59:03AM -0700, Bruce Richardson wrote: > > > > > > > > > > > > > > On Tue, Jul 29, 2014 at 04:24:24PM -0400, Neil Horman wrote: > > > > > > > > > > > > > > > Hey all- > > > > > > > > > > > > > > With regards to the general approach for runtime detection of software > > > > > > > functions, I wonder if something like this can be handled by the > > > > > > > packaging system? Is it possible to ship out a set of shared libs > > > > > > > compiled up for different instruction sets, and then at rpm install > > > > > > > time, symlink the appropriate library? This would push the whole issue > > > > > > > of detection of code paths outside of code, work across all our > > > > > > > libraries and ensure each user got the best performance they could get > > > > > > > form a binary? > > > > > > > Has something like this been done before? The building of all the > > > > > > > libraries could be scripted easy enough, just do multiple builds using > > > > > > > different EXTRA_CFLAGS each time, and move and rename the .so's after > > > > > > > each run. > > > > > > > > > > > > > > > > > > > Sorry, I missed this in my last reply. > > > > > > > > > > > > In answer to your question, the short version is that such a thing is roughly > > > > > > possible from a packaging standpoint, but completely unworkable from a > > > > > > distribution standpoint. We could certainly build the dpdk multiple times and > > > > > > rename all the shared objects to some variant name representative of the > > > > > > optimzations we build in for certain cpu flags, but then we woudl be shipping X > > > > > > versions of the dpdk, and any appilcation (say OVS that made use of the dpdk > > > > > > would need to provide a version linked against each variant to be useful when > > > > > > making a product, and each end user would need to manually select (or run a > > > > > > script to select) which variant is most optimized for the system at hand. Its > > > > > > just not a reasonable way to package a library. > > > > > > > > > > Sorry, perhaps I was not clear, having the user have to select the > > > > > appropriate library was not what I was suggesting. Instead, I was > > > > > suggesting that the rpm install "librte_pmd_ixgbe.so.generic", > > > > > "librte_pmd_ixgbe.so.sse42" and "librte_pmd_ixgbe.so.avx". Then the rpm > > > > > post-install script would look at the cpuflags in cpuinfo and then > > > > > symlink librte_pmd_ixgbe.so to the best-match version. That way the user > > > > > only has to link against "librte_pmd_ixgbe.so" and depending on the > > > > > system its run on, the loader will automatically resolve the symbols > > > > > from the appropriate instruction-set specific .so file. > > > > > > > > > > > > > This is an absolute packaging nightmare, it will potentially break all sorts of > > > > corner cases, and support processes. To cite a few examples: > > > > > > > > 1) Upgrade support - What if the minimum cpu requirements for dpdk are advanced > > > > at some point in the future? The above strategy has no way to know that a given > > > > update has more advanced requirements than a previous update, and when the > > > > update is installed, the previously linked library for the old base will > > > > dissappear, leaving broken applications behind. > > > > > > Firstly, I didn't know we could actually specify minimum cpu > > > requirements for packaging, that is something that could be useful :-) > > You misread my comment :). I didn't say we could specify minimum cpu > > requirements at packaging (you can't, beyond general arch), I said "what if the > > dpdk's cpu requriements were raised?". Completely different thing. Currently > > teh default, lowest common denominator system that dpdk appears to build for is > > core2 (as listed in the old default config). What if at some point you raise > > those requirements and decide that SSE4.2 really is required to achieve maximum > > performance. Using the above strategy any system that doesn't meet the new > > requirements will silently break on such an update. Thats not acceptable. > > Core2 was the first Intel set of chips that had x86_64 instruction set > before (Core microarchitecture), so that's why it's listed as the > minimum - it's the same thing as generic x86_64 support. :-) > Yes, and that matches with the base cpu that Fedora supports for x86_64. > > > > > Secondly, what is the normal case for handling something like this, > > > where an upgrade has enhanced requirements compared to the previous > > > version? Presumably you either need to prevent the upgrade from > > > happening or else accept a broken app. Can the same mechanism not also > > > be used to prevent upgrades using a multi-lib scheme? > > > > > The case for handling something like this is: Don't do it. When you package > > something for Fedora (or any distro), you provide an implicit guaratee that it > > will run (or fail gracefully) on all supported systems. You can add support for > > systems as you go forward, but you can't deprecate support for systems within a > > major release. That is to say, if something runs on F20 now, its got to keep > > running on F20 for the lifetime of F20. If it stops running, thats a > > regression, the user opens a bug and you fix it. > > > > The DPDK is way off the reservation in regards to this. Application packages, > > as a general rule don't build with specific cpu features in mind, because > > performance, while important isn't on the same scale as what you're trying to do > > in the dpdk. A process getting scheduled off the cpu while we handle an > > interrupt wipes out any speedup gains made by any micro-optimizations, so theres > > no point in doing so. The DPDK is different, I understand that, but the > > drawback is that it (the DPDK) needs to make optimizations that really aren't > > considered particularly important to the rest of user space. I'm trying to > > opportunistically make the DPDK as fast as possible, but I need to do it in a > > single binary, that works on a lowest common demoninator system. > > > > > > > > > > 2) Debugging - Its going to be near impossible to support an application built > > > > with a package put together this way, because you'll never be sure as to which > > > > version of the library was running when the crash occured. You can figure it > > > > out for certain, but for support/development people to need to remember to > > > > figure this out is going to be a major turn off for them, and the result will be > > > > that they simply won't use the dpdk. Its Anathema to the expectations of linux > > > > user space. > > > > > > Sorry, I just don't see this as being any harder to support than > > > multiple code paths for the same functionality. In fact, it will surely make > > > debugging easier, since you only have the one code path, just compiled > > > up in different ways. > > > > > > > Well, then by all means, become a Fedora packager, an you can take over the DPDK > > maintenece there :). Until then, you'll just have to trust me. If you have > > multiple optional code paths (especialy if they're limited to isolated features) > > its manageable. But regardless of how you look at it, building the same > > source multiple times with different cpu support means completely different > > binaries. The assembly and optimization are just plain different. They may be > > close, but they're not the same, and they need to be QA-ed independently. With > > a single build and optional code paths, all the common code is executed no > > matter what system you're running on, and its always the same. Multiple builds > > with different instruction support means that code that is identical at a source > > level may well be significantly different at a binary level, and thats not > > something I can sanely manage in a general purpose environment. > > > > > > > > > 3) QA - Building multiple versions of a library means needing to QA multiple > > > > versions of a library. If you have to have 4 builds to support different levels > > > > of optimization, you've created a 4x increase in the amount of testing you need > > > > to do to ensure consistent behavior. You need to be aware of how many different > > > > builds are available in the single rpm at all times, and find systems on which > > > > to QA which will ensure that all of the builds get tested (as they are in fact, > > > > unique builds). While you may not hit all code paths in a single build, you > > > > will at least test all the common paths. > > > > > > Again, the exact same QA conditions will also apply to an approach using > > > multiple code paths bundled into the same library. Given a choice > > > between one code path with multiple compiles, vs multiple code paths > > > each compiled only once, the multiple code paths option leaves far > > > greater scope for bugs, and when bugs do occur means that you always > > > have to find out what specific hardware it was being run on. Using the > > > exact same code multiply compiled, the vast, vast majority of bugs are > > > going to occur across all platforms and systems so you should rarely > > > need to ask what the specific platform being used is. > > > > > > > No, they won't (see above). Enabling insructions will enable the compiler to > > emit and optimize common paths differently, so identical source code will lead > > to different binary code. I need to have a single binary so that I know what > > I'm working with when someone opens a bug. I don't have that using a multiple > > binary approach. At least with multiple runtime paths (especially/specifically > > with the run time paths we've been discussing, the igbe rx vector path and the > > acl library, which are isolated), I know that, if I get a bug report and the > > backtrace ends in either location, I know I'm sepecifically dealing with that > > code. With your multiple binary approach, if I get a crash in, say > > rte_eal_init, I need to figure out if this crash happened in the sse3 compiled > > binary, the ss4.2 compiled binary, the avx binary, the avx512 binary, or the > > core2 binary. You can say thats easy, but its easy to say that when you're not > > the one that has to support it. > > > > > > > > > > The bottom line is that Distribution packaging is all about consistency and > > > > commonality. If you install something for an arch on multiple systems, its the > > > > same thing on each system, and it works in the same way, all the time. This > > > > strategy breaks that. Thats why we do run time checks for things. > > > > > > If you want to have the best tuned code running for each instruction > > > set, then commonality and consistency goes out the window anyway, > > So, this is perhaps where communication is breaking down. I don't want to have the > > best tuned code running for each instruction set. What I want is for the dpdk > > to run on a lowest common denominator platform, and be able to opportunistically > > take advantage of accelerated code paths that require advanced cpu featrues. > > > > > > Lets take the ixgbe code as an example. Note I didn't add any code paths there, > > at all (in fact I didn't add any anywhere). The ixgbe rx_burst method gets set > > according to compile time configuration. You can pick the bulk_alloc rx method, > > or the vectorized rx method at compile time (or some others I think, but thats > > not relevant). As it happened the vectorized rx path option had an implicit > > dependency on SSE4.2. Instead of requiring that all cpus that run the dpdk have > > SSE4.2, I instead chose to move that compile time decision to a run tmie > > decision, by building only the vectorized path with sse4.2 and only using it if > > we see that the cpu supports sse4.2 at run time. No new paths created, no new > > support requirements, you're still supporting the same options upstream, the > > only difference is I was able to include them both in a single binary. Thats > > better for our end users because the single binary still works everywhere. > > Thats better for our QA group because For whatever set of tests they perform, > > they only need an sse4.2 enabled system to test the one isolated path for that > > vector rx code. The rest of their tests can be conducted once, on any system, > > because the binary is exactly the same. If we compile multiple binaries, > > testing on one system doesn't mean we've tested all the code. > > > > > because two different machines calling the same function are going to > > > execute different sets of instructions. The decision then becomes: > > But thats not at all what I wanted. I want two different machines calling the > > same function to execute the same instructions 99.9999% of the time. The only > > time I want to diverge from that is in isolated paths where we can take > > advantage of a feature that we otherwise could not (i.e. the ixgbe and acl > > code). I look at it like the alternatives code in linux. There are these > > isolated areas where you have limited bits of code that at run time are > > re-written to use available cpu features. 99.9% of the code is identical, but > > in these little spots its ok to diverge from simmilarity because they'er isolated, > > easily identifiable > > > > > a) whether you need multiple sets of instructions - if no then you pay > > > with lack of performance > > > b) how you get those multiple sets of instructions > > > c) how you validate those multiple sets of instructions. > > > > > > As is clear by now :-), my preference by far is to have multiple sets of > > > instructions come from a single code base, as less code means less > > > maintenance, and above all, fewer bugs. If that can't be done, then we > > > need to look carefully at each code path being added and do a > > > cost-benefit analysis on it. > > > > > > > Yes, its quite clear :), I think its equally clear that I need a single binary, > > and would like to opportunisitcally enhance it where possible without losing the > > fact that its a single binary. > > > > I suppose its all somewhat moot at this point though, The reduction to sse3 for > > ixgbe seems agreeable to everyone, and it lets me preserve single binary builds > > there. I'm currently working on the ACL library, as you noted thats a tougher > > nut to crack. I think I'll have it done early next week (though i'm sure my > > translation of the instruction set reference to C will need some through testing > > :)). I'll post it when its ready. > > > > Agreed, lets get everything working to a common baseline anyway. In > terms of the number of RX and TX functions, you mentioned, I'd hope in future > we could cut the number of them down a bit as we make the vector versions more > generally applicable, but that's a whole discussion for another day. > As long as all paths can continue to work with a generic x86_64 compilation, I'm good with that. I'd still like to see some options be run time, rather than compile time selectable, but we can deal with that much later. Best Neil > /Bruce > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 18:10 ` Neil Horman 2014-07-31 18:36 ` Bruce Richardson @ 2014-07-31 21:53 ` Thomas Monjalon 1 sibling, 0 replies; 45+ messages in thread From: Thomas Monjalon @ 2014-07-31 21:53 UTC (permalink / raw) To: Neil Horman; +Cc: dev 2014-07-31 14:10, Neil Horman: > On Thu, Jul 31, 2014 at 10:32:28AM -0400, Neil Horman wrote: > > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > Please, let's focus on the first item and we could discuss about performance > > > later. Having some different code path choosed at runtime is a big rework and > > > imply changing the compilation model (RFC welcome). > > > > Even if I misinterpreted your statement above, I'm still not sure why your > asserting this. Fixing the build to work with the default target machine is > good, and should be undertaken, and I'll happily do so, but why reject the > solution in front of you to wait for it? I'm not rejecting the solution. Let's try to improve performance of the default build with runtime checks. Seeing patches and benchmarks will be interesting. You're opening a door and I don't know if we'll see the ceiling of this room :) -- Thomas ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features 2014-07-31 14:32 ` Neil Horman 2014-07-31 18:10 ` Neil Horman @ 2014-07-31 21:25 ` Thomas Monjalon 1 sibling, 0 replies; 45+ messages in thread From: Thomas Monjalon @ 2014-07-31 21:25 UTC (permalink / raw) To: Neil Horman; +Cc: dev 2014-07-31 10:32, Neil Horman: > On Thu, Jul 31, 2014 at 03:26:45PM +0200, Thomas Monjalon wrote: > > Neil, we are mixing 2 different problems here. > > 1) we have to fix default build (without SSE-4.2) > > Thats nothing to fix, thats a configuration issue. Just build for a lesser > machine. I've already done that in the fedora build, using the defalut machine > target. What exactly is missing from that? I mean that building with RTE_MACHINE=default is broken in ixgbe and acl. So the main priority is to fix it. But performance of the native build must not be degraded. And if we can have good performance with a default build, it's fine. -- Thomas ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2014-08-02 12:54 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-29 20:24 [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman 2014-07-29 20:24 ` [dpdk-dev] [PATCH 2/2] acl: Preform dynamic sse4.2 support check Neil Horman 2014-07-30 12:07 ` [dpdk-dev] [PATCH 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin 2014-07-30 13:01 ` Neil Horman 2014-07-30 13:44 ` Ananyev, Konstantin 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 " Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 1/2] ixgbe: test sse4.2 support at runtime for vectorized receive operations Neil Horman 2014-07-30 14:49 ` [dpdk-dev] [PATCH v2 2/2] acl: Preform dynamic sse4.2 support check Neil Horman 2014-07-30 15:36 ` [dpdk-dev] [PATCH v2 0/2] dpdk: Allow for dynamic enablement of some isolated features Ananyev, Konstantin 2014-07-30 19:03 ` Venky Venkatesan 2014-07-30 19:17 ` Neil Horman 2014-07-30 19:34 ` Neil Horman 2014-07-30 18:59 ` [dpdk-dev] [PATCH " Bruce Richardson 2014-07-30 19:28 ` Neil Horman 2014-07-30 21:09 ` Bruce Richardson 2014-07-31 9:30 ` Thomas Monjalon 2014-07-31 11:36 ` Ananyev, Konstantin 2014-07-31 13:13 ` Neil Horman 2014-07-31 13:26 ` Thomas Monjalon 2014-07-31 14:32 ` Neil Horman 2014-07-31 18:10 ` Neil Horman 2014-07-31 18:36 ` Bruce Richardson 2014-07-31 19:01 ` Neil Horman 2014-07-31 20:19 ` Bruce Richardson 2014-08-01 13:36 ` Neil Horman 2014-08-01 13:56 ` Ananyev, Konstantin 2014-08-01 14:26 ` Venkatesan, Venky 2014-08-01 14:27 ` Neil Horman 2014-07-31 19:58 ` John W. Linville 2014-07-31 20:20 ` Bruce Richardson 2014-07-31 20:32 ` John W. Linville 2014-08-01 8:46 ` Vincent JARDIN 2014-08-01 14:06 ` Neil Horman 2014-08-01 14:57 ` Vincent JARDIN 2014-08-01 15:19 ` Neil Horman 2014-07-31 20:10 ` Neil Horman 2014-07-31 20:25 ` Bruce Richardson 2014-08-01 15:06 ` Neil Horman 2014-08-01 19:22 ` Bruce Richardson 2014-08-01 20:43 ` Neil Horman 2014-08-01 21:08 ` Bruce Richardson 2014-08-02 12:56 ` Neil Horman 2014-07-31 21:53 ` Thomas Monjalon 2014-07-31 21:25 ` Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).