* [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler @ 2019-06-06 14:50 jerinj 2019-06-06 15:55 ` Michael Santana Francisco ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: jerinj @ 2019-06-06 14:50 UTC (permalink / raw) To: dev Cc: thomas, gavin.hu, honnappa.nagarahalli, msantana, aconole, Jerin Jacob, stable From: Jerin Jacob <jerinj@marvell.com> Some compilers reporting the following error, though the existing code doesn't have any uninitialized variable case. Just to make compiler happy, initialize the int32x4_t variable one shot in C language. ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used uninitialized in this function [-Werror=maybe-uninitialized] int32x4_t input; Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") Cc: stable@dpdk.org Signed-off-by: Jerin Jacob <jerinj@marvell.com> --- lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 100644 --- a/lib/librte_acl/acl_run_neon.h +++ b/lib/librte_acl/acl_run_neon.h @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, uint64_t index_array[8]; struct completion cmplt[8]; struct parms parms[8]; - int32x4_t input0, input1; acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, total_packets, categories, ctx->trans_table); @@ -181,17 +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input0, 2); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), input1, 2); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input0, 3); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), input1, 3); + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), + GET_NEXT_4BYTES(parms, 1), + GET_NEXT_4BYTES(parms, 2), + GET_NEXT_4BYTES(parms, 3)}; + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), + GET_NEXT_4BYTES(parms, 5), + GET_NEXT_4BYTES(parms, 6), + GET_NEXT_4BYTES(parms, 7)}; /* Process the 4 bytes of input on each stream. */ @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, uint64_t index_array[4]; struct completion cmplt[4]; struct parms parms[4]; - int32x4_t input; acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, total_packets, categories, ctx->trans_table); @@ -242,10 +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), + GET_NEXT_4BYTES(parms, 1), + GET_NEXT_4BYTES(parms, 2), + GET_NEXT_4BYTES(parms, 3)}; /* Process the 4 bytes of input on each stream. */ input = transition4(input, flows.trans, index_array); -- 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-06 14:50 [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler jerinj @ 2019-06-06 15:55 ` Michael Santana Francisco 2019-06-07 5:42 ` Honnappa Nagarahalli 2019-06-07 5:35 ` Honnappa Nagarahalli ` (2 subsequent siblings) 3 siblings, 1 reply; 17+ messages in thread From: Michael Santana Francisco @ 2019-06-06 15:55 UTC (permalink / raw) To: jerinj, dev; +Cc: thomas, gavin.hu, honnappa.nagarahalli, aconole, stable On 6/6/19 10:50 AM, jerinj@marvell.com wrote: > From: Jerin Jacob <jerinj@marvell.com> > > Some compilers reporting the following error, though the existing > code doesn't have any uninitialized variable case. > Just to make compiler happy, initialize the int32x4_t variable > one shot in C language. > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > used uninitialized in this function [-Werror=maybe-uninitialized] > int32x4_t input; > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > Cc: stable@dpdk.org > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > --- > lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- > 1 file changed, 12 insertions(+), 17 deletions(-) > > diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h > index 01b9766d8..dc9e9efe9 100644 > --- a/lib/librte_acl/acl_run_neon.h > +++ b/lib/librte_acl/acl_run_neon.h > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, > uint64_t index_array[8]; > struct completion cmplt[8]; > struct parms parms[8]; > - int32x4_t input0, input1; > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > total_packets, categories, ctx->trans_table); > @@ -181,17 +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input0, 2); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), input1, 2); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input0, 3); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), input1, 3); > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > + GET_NEXT_4BYTES(parms, 1), > + GET_NEXT_4BYTES(parms, 2), > + GET_NEXT_4BYTES(parms, 3)}; > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > + GET_NEXT_4BYTES(parms, 5), > + GET_NEXT_4BYTES(parms, 6), > + GET_NEXT_4BYTES(parms, 7)}; > > /* Process the 4 bytes of input on each stream. */ > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, > uint64_t index_array[4]; > struct completion cmplt[4]; > struct parms parms[4]; > - int32x4_t input; > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > total_packets, categories, ctx->trans_table); > @@ -242,10 +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > + GET_NEXT_4BYTES(parms, 1), > + GET_NEXT_4BYTES(parms, 2), > + GET_NEXT_4BYTES(parms, 3)}; > > /* Process the 4 bytes of input on each stream. */ > input = transition4(input, flows.trans, index_array); Fixed on travis: https://travis-ci.com/Maickii/dpdk-2/builds/114612090 Acked-by: Michael Santana <msantana@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-06 15:55 ` Michael Santana Francisco @ 2019-06-07 5:42 ` Honnappa Nagarahalli 0 siblings, 0 replies; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-07 5:42 UTC (permalink / raw) To: msantana, jerinj, dev Cc: thomas, Gavin Hu (Arm Technology China), aconole, stable, Honnappa Nagarahalli, nd, nd On 6/6/19 10:50 AM, mailto:jerinj@marvell.com wrote: From: Jerin Jacob mailto:jerinj@marvell.com Some compilers reporting the following error, though the existing code doesn't have any uninitialized variable case. Just to make compiler happy, initialize the int32x4_t variable one shot in C language. ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used uninitialized in this function [-Werror=maybe-uninitialized] int32x4_t input; Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") Cc: mailto:stable@dpdk.org Signed-off-by: Jerin Jacob mailto:jerinj@marvell.com --- lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 100644 --- a/lib/librte_acl/acl_run_neon.h +++ b/lib/librte_acl/acl_run_neon.h @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, uint64_t index_array[8]; struct completion cmplt[8]; struct parms parms[8]; - int32x4_t input0, input1; acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, total_packets, categories, ctx->trans_table); @@ -181,17 +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input0, 2); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), input1, 2); - - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input0, 3); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), input1, 3); + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), + GET_NEXT_4BYTES(parms, 1), + GET_NEXT_4BYTES(parms, 2), + GET_NEXT_4BYTES(parms, 3)}; + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), + GET_NEXT_4BYTES(parms, 5), + GET_NEXT_4BYTES(parms, 6), + GET_NEXT_4BYTES(parms, 7)}; /* Process the 4 bytes of input on each stream. */ @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, uint64_t index_array[4]; struct completion cmplt[4]; struct parms parms[4]; - int32x4_t input; acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, total_packets, categories, ctx->trans_table); @@ -242,10 +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), + GET_NEXT_4BYTES(parms, 1), + GET_NEXT_4BYTES(parms, 2), + GET_NEXT_4BYTES(parms, 3)}; /* Process the 4 bytes of input on each stream. */ input = transition4(input, flows.trans, index_array); Fixed on travis: https://travis-ci.com/Maickii/dpdk-2/builds/114612090 Acked-by: Michael Santana mailto:msantana@redhat.com [Honnappa] Prefer to go with Aaron's patch with a temp variable for setting the first lane. Mixing of NEON intrinsics and GCC vector extensions is not recommended as per Arm C Language Extensions guide 12.2.6 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-06 14:50 [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler jerinj 2019-06-06 15:55 ` Michael Santana Francisco @ 2019-06-07 5:35 ` Honnappa Nagarahalli 2019-06-07 6:21 ` Jerin Jacob Kollanukkaran 2019-06-10 12:10 ` Aaron Conole 2019-06-11 14:15 ` [dpdk-dev] [PATCH v2] " jerinj 3 siblings, 1 reply; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-07 5:35 UTC (permalink / raw) To: jerinj, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, jerinj, Honnappa Nagarahalli, stable, nd > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler > > From: Jerin Jacob <jerinj@marvell.com> > > Some compilers reporting the following error, though the existing code > doesn't have any uninitialized variable case. > Just to make compiler happy, initialize the int32x4_t variable one shot in C > language. > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used > uninitialized in this function [-Werror=maybe-uninitialized] > int32x4_t input; > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > Cc: stable@dpdk.org > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > --- > lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- > 1 file changed, 12 insertions(+), 17 deletions(-) > > diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h > index 01b9766d8..dc9e9efe9 100644 > --- a/lib/librte_acl/acl_run_neon.h > +++ b/lib/librte_acl/acl_run_neon.h > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > uint8_t **data, > uint64_t index_array[8]; > struct completion cmplt[8]; > struct parms parms[8]; > - int32x4_t input0, input1; > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > total_packets, categories, ctx->trans_table); @@ -181,17 > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t > **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > input0, 0); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > input1, 0); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > input0, 1); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), > input1, 1); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > input0, 2); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), > input1, 2); > - > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > input0, 3); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), > input1, 3); > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > + GET_NEXT_4BYTES(parms, 1), > + GET_NEXT_4BYTES(parms, 2), > + GET_NEXT_4BYTES(parms, 3)}; > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > + GET_NEXT_4BYTES(parms, 5), > + GET_NEXT_4BYTES(parms, 6), > + GET_NEXT_4BYTES(parms, 7)}; > This mixes the use of NEON intrinsics with GCC vector extensions. ACLE (Arm C Language Extensions) specifically recommends not to mix the two methods in section 12.2.6. IMO, Aaron's suggestion of using a temp vector should be good. > /* Process the 4 bytes of input on each stream. */ > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > uint8_t **data, > uint64_t index_array[4]; > struct completion cmplt[4]; > struct parms parms[4]; > - int32x4_t input; > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > total_packets, categories, ctx->trans_table); @@ -242,10 > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t > **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > 0); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, > 1); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > 2); > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > 3); > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > + GET_NEXT_4BYTES(parms, 1), > + GET_NEXT_4BYTES(parms, 2), > + GET_NEXT_4BYTES(parms, 3)}; > > /* Process the 4 bytes of input on each stream. */ > input = transition4(input, flows.trans, index_array); > -- > 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-07 5:35 ` Honnappa Nagarahalli @ 2019-06-07 6:21 ` Jerin Jacob Kollanukkaran 2019-06-10 5:29 ` Honnappa Nagarahalli 0 siblings, 1 reply; 17+ messages in thread From: Jerin Jacob Kollanukkaran @ 2019-06-07 6:21 UTC (permalink / raw) To: Honnappa Nagarahalli, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, stable, nd > -----Original Message----- > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Sent: Friday, June 7, 2019 11:05 AM > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; dev@dpdk.org > Cc: thomas@monjalon.net; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; msantana@redhat.com; aconole@redhat.com; Jerin > Jacob Kollanukkaran <jerinj@marvell.com>; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > ---------------------------------------------------------------------- > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > > compiler > > > > From: Jerin Jacob <jerinj@marvell.com> > > > > Some compilers reporting the following error, though the existing code > > doesn't have any uninitialized variable case. > > Just to make compiler happy, initialize the int32x4_t variable one > > shot in C language. > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used > > uninitialized in this function [-Werror=maybe-uninitialized] > > int32x4_t input; > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > Cc: stable@dpdk.org > > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > > --- > > lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- > > 1 file changed, 12 insertions(+), 17 deletions(-) > > > > diff --git a/lib/librte_acl/acl_run_neon.h > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 100644 > > --- a/lib/librte_acl/acl_run_neon.h > > +++ b/lib/librte_acl/acl_run_neon.h > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > > uint8_t **data, > > uint64_t index_array[8]; > > struct completion cmplt[8]; > > struct parms parms[8]; > > - int32x4_t input0, input1; > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > total_packets, categories, ctx->trans_table); @@ -181,17 > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t > > **data, > > > > while (flows.started > 0) { > > /* Gather 4 bytes of input data for each stream. */ > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > input0, 0); > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > input1, 0); > > - > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > > input0, 1); > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), > > input1, 1); > > - > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > > input0, 2); > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), > > input1, 2); > > - > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > > input0, 3); > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), > > input1, 3); > > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > > + GET_NEXT_4BYTES(parms, 1), > > + GET_NEXT_4BYTES(parms, 2), > > + GET_NEXT_4BYTES(parms, 3)}; > > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > > + GET_NEXT_4BYTES(parms, 5), > > + GET_NEXT_4BYTES(parms, 6), > > + GET_NEXT_4BYTES(parms, 7)}; > > > This mixes the use of NEON intrinsics with GCC vector extensions. ACLE (Arm C > Language Extensions) specifically recommends not to mix the two methods in > section 12.2.6. IMO, Aaron's suggestion of using a temp vector should be good. We are using this pattern across DPDK and SSE for x86 as well. https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91 Since it used in fastpath, a temp variable would be additional cost for no reason. If GCC supports it then I think it is fine, I think, above usage matters with C++ portability. > > > /* Process the 4 bytes of input on each stream. */ > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > > uint8_t **data, > > uint64_t index_array[4]; > > struct completion cmplt[4]; > > struct parms parms[4]; > > - int32x4_t input; > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > total_packets, categories, ctx->trans_table); @@ -242,10 > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t > > **data, > > > > while (flows.started > 0) { > > /* Gather 4 bytes of input data for each stream. */ > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > > 0); > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, > > 1); > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > > 2); > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > > 3); > > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > > + GET_NEXT_4BYTES(parms, 1), > > + GET_NEXT_4BYTES(parms, 2), > > + GET_NEXT_4BYTES(parms, 3)}; > > > > /* Process the 4 bytes of input on each stream. */ > > input = transition4(input, flows.trans, index_array); > > -- > > 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-07 6:21 ` Jerin Jacob Kollanukkaran @ 2019-06-10 5:29 ` Honnappa Nagarahalli 2019-06-10 9:39 ` Jerin Jacob Kollanukkaran 0 siblings, 1 reply; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-10 5:29 UTC (permalink / raw) To: jerinj, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, stable, Honnappa Nagarahalli, nd, nd > > > > ---------------------------------------------------------------------- > > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > > > compiler > > > > > > From: Jerin Jacob <jerinj@marvell.com> > > > > > > Some compilers reporting the following error, though the existing > > > code doesn't have any uninitialized variable case. > > > Just to make compiler happy, initialize the int32x4_t variable one > > > shot in C language. > > > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used > > > uninitialized in this function [-Werror=maybe-uninitialized] > > > int32x4_t input; > > > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > > Cc: stable@dpdk.org > > > > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > > > --- > > > lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- > > > 1 file changed, 12 insertions(+), 17 deletions(-) > > > > > > diff --git a/lib/librte_acl/acl_run_neon.h > > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 100644 > > > --- a/lib/librte_acl/acl_run_neon.h > > > +++ b/lib/librte_acl/acl_run_neon.h > > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > const uint8_t **data, > > > uint64_t index_array[8]; > > > struct completion cmplt[8]; > > > struct parms parms[8]; > > > - int32x4_t input0, input1; > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > total_packets, categories, ctx->trans_table); @@ -181,17 > > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > > > +uint8_t > > > **data, > > > > > > while (flows.started > 0) { > > > /* Gather 4 bytes of input data for each stream. */ > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > input0, 0); > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > > input1, 0); > > > - > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > > > input0, 1); > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), > > > input1, 1); > > > - > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > > > input0, 2); > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), > > > input1, 2); > > > - > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > > > input0, 3); > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), > > > input1, 3); > > > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > > > + GET_NEXT_4BYTES(parms, 1), > > > + GET_NEXT_4BYTES(parms, 2), > > > + GET_NEXT_4BYTES(parms, 3)}; > > > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > > > + GET_NEXT_4BYTES(parms, 5), > > > + GET_NEXT_4BYTES(parms, 6), > > > + GET_NEXT_4BYTES(parms, 7)}; > > > > > This mixes the use of NEON intrinsics with GCC vector extensions. ACLE > > (Arm C Language Extensions) specifically recommends not to mix the two > > methods in section 12.2.6. IMO, Aaron's suggestion of using a temp vector > should be good. > > We are using this pattern across DPDK and SSE for x86 as well. > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91 I am not sure about x86, I have not looked at a document similar to ACLE for x86. IMO, it is not relevant here as this is Arm specific code. > > Since it used in fastpath, a temp variable would be additional cost for no > reason. Then, I would suggest we can go with using 'vdupq_n_s32'. > If GCC supports it then I think it is fine, I think, above usage matters with C++ > portability. I did not understand the C++ portability part. Can you elaborate more? > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > const uint8_t **data, > > > uint64_t index_array[4]; > > > struct completion cmplt[4]; > > > struct parms parms[4]; > > > - int32x4_t input; > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > total_packets, categories, ctx->trans_table); @@ -242,10 > > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > > > +uint8_t > > > **data, > > > > > > while (flows.started > 0) { > > > /* Gather 4 bytes of input data for each stream. */ > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > > > 0); > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, > > > 1); > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > > > 2); > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > > > 3); > > > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > > > + GET_NEXT_4BYTES(parms, 1), > > > + GET_NEXT_4BYTES(parms, 2), > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > > /* Process the 4 bytes of input on each stream. */ > > > input = transition4(input, flows.trans, index_array); > > > -- > > > 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-10 5:29 ` Honnappa Nagarahalli @ 2019-06-10 9:39 ` Jerin Jacob Kollanukkaran 2019-06-11 1:27 ` Honnappa Nagarahalli 0 siblings, 1 reply; 17+ messages in thread From: Jerin Jacob Kollanukkaran @ 2019-06-10 9:39 UTC (permalink / raw) To: Honnappa Nagarahalli, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, stable, nd, nd > -----Original Message----- > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Sent: Monday, June 10, 2019 11:00 AM > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; dev@dpdk.org > Cc: thomas@monjalon.net; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; msantana@redhat.com; aconole@redhat.com; > stable@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > nd <nd@arm.com>; nd <nd@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > > > -- > > > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > > > > compiler > > > > > > > > From: Jerin Jacob <jerinj@marvell.com> > > > > > > > > Some compilers reporting the following error, though the existing > > > > code doesn't have any uninitialized variable case. > > > > Just to make compiler happy, initialize the int32x4_t variable one > > > > shot in C language. > > > > > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > > > > used uninitialized in this function [-Werror=maybe-uninitialized] > > > > int32x4_t input; > > > > > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > > > Cc: stable@dpdk.org > > > > > > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > > > > --- > > > > lib/librte_acl/acl_run_neon.h | 29 ++++++++++++----------------- > > > > 1 file changed, 12 insertions(+), 17 deletions(-) > > > > > > > > diff --git a/lib/librte_acl/acl_run_neon.h > > > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 100644 > > > > --- a/lib/librte_acl/acl_run_neon.h > > > > +++ b/lib/librte_acl/acl_run_neon.h > > > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > > const uint8_t **data, > > > > uint64_t index_array[8]; > > > > struct completion cmplt[8]; > > > > struct parms parms[8]; > > > > - int32x4_t input0, input1; > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > total_packets, categories, ctx->trans_table); @@ -181,17 > > > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > > > > +uint8_t > > > > **data, > > > > > > > > while (flows.started > 0) { > > > > /* Gather 4 bytes of input data for each stream. */ > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > > > > input0, 0); > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > > > > input1, 0); > > > > - > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > > > > input0, 1); > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), > > > > input1, 1); > > > > - > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > > > > input0, 2); > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 6), > > > > input1, 2); > > > > - > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > > > > input0, 3); > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 7), > > > > input1, 3); > > > > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > > > > + GET_NEXT_4BYTES(parms, 1), > > > > + GET_NEXT_4BYTES(parms, 2), > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > > > > + GET_NEXT_4BYTES(parms, 5), > > > > + GET_NEXT_4BYTES(parms, 6), > > > > + GET_NEXT_4BYTES(parms, 7)}; > > > > > > > This mixes the use of NEON intrinsics with GCC vector extensions. > > > ACLE (Arm C Language Extensions) specifically recommends not to mix > > > the two methods in section 12.2.6. IMO, Aaron's suggestion of using > > > a temp vector > > should be good. > > > > We are using this pattern across DPDK and SSE for x86 as well. > > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n > > 91 > I am not sure about x86, I have not looked at a document similar to ACLE for > x86. IMO, it is not relevant here as this is Arm specific code. What I meant was its been already used in DPDK for arm64. https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91 Please see offial page vector gcc gcc documentation. The examples are using this scheme. https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html This is to just create 'input' variable. I am fine to use any other scheme with out additional cost of instructions. > > > > > Since it used in fastpath, a temp variable would be additional cost > > for no reason. > Then, I would suggest we can go with using 'vdupq_n_s32'. We have to form uint64x2_t with 4 x uint32_t variable, How does 'vdupq_n_s32' help here? Can you share code snippet without any temp variable? > > > If GCC supports it then I think it is fine, I think, above usage > > matters with C++ portability. > I did not understand the C++ portability part. Can you elaborate more? > > > > > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > > const uint8_t **data, > > > > uint64_t index_array[4]; > > > > struct completion cmplt[4]; > > > > struct parms parms[4]; > > > > - int32x4_t input; > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > total_packets, categories, ctx->trans_table); @@ -242,10 > > > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > > > > +uint8_t > > > > **data, > > > > > > > > while (flows.started > 0) { > > > > /* Gather 4 bytes of input data for each stream. */ > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > > > > 0); > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, > > > > 1); > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > > > > 2); > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > > > > 3); > > > > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > > > > + GET_NEXT_4BYTES(parms, 1), > > > > + GET_NEXT_4BYTES(parms, 2), > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > input = transition4(input, flows.trans, index_array); > > > > -- > > > > 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-10 9:39 ` Jerin Jacob Kollanukkaran @ 2019-06-11 1:27 ` Honnappa Nagarahalli 2019-06-11 14:24 ` Jerin Jacob Kollanukkaran 0 siblings, 1 reply; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-11 1:27 UTC (permalink / raw) To: jerinj, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, stable, Honnappa Nagarahalli, nd, nd > > > > -- > > > > > Subject: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > > > > > compiler > > > > > > > > > > From: Jerin Jacob <jerinj@marvell.com> > > > > > > > > > > Some compilers reporting the following error, though the > > > > > existing code doesn't have any uninitialized variable case. > > > > > Just to make compiler happy, initialize the int32x4_t variable > > > > > one shot in C language. > > > > > > > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > > > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > > > > > used uninitialized in this function [-Werror=maybe-uninitialized] > > > > > int32x4_t input; > > > > > > > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > > > > Cc: stable@dpdk.org > > > > > > > > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > > > > > --- > > > > > lib/librte_acl/acl_run_neon.h | 29 > > > > > ++++++++++++----------------- > > > > > 1 file changed, 12 insertions(+), 17 deletions(-) > > > > > > > > > > diff --git a/lib/librte_acl/acl_run_neon.h > > > > > b/lib/librte_acl/acl_run_neon.h index 01b9766d8..dc9e9efe9 > > > > > 100644 > > > > > --- a/lib/librte_acl/acl_run_neon.h > > > > > +++ b/lib/librte_acl/acl_run_neon.h > > > > > @@ -165,7 +165,6 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > > > const uint8_t **data, > > > > > uint64_t index_array[8]; > > > > > struct completion cmplt[8]; > > > > > struct parms parms[8]; > > > > > - int32x4_t input0, input1; > > > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > > total_packets, categories, ctx->trans_table); @@ -181,17 > > > > > +180,14 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > > > > > +uint8_t > > > > > **data, > > > > > > > > > > while (flows.started > 0) { > > > > > /* Gather 4 bytes of input data for each stream. */ > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 0), > > > > > input0, 0); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 4), > > > > > input1, 0); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 1), > > > > > input0, 1); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 5), > > > > > input1, 1); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 2), > > > > > input0, 2); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 6), > > > > > input1, 2); > > > > > - > > > > > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 3), > > > > > input0, 3); > > > > > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, > 7), > > > > > input1, 3); > > > > > + int32x4_t input0 = {GET_NEXT_4BYTES(parms, 0), > > > > > + GET_NEXT_4BYTES(parms, 1), > > > > > + GET_NEXT_4BYTES(parms, 2), > > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > + int32x4_t input1 = {GET_NEXT_4BYTES(parms, 4), > > > > > + GET_NEXT_4BYTES(parms, 5), > > > > > + GET_NEXT_4BYTES(parms, 6), > > > > > + GET_NEXT_4BYTES(parms, 7)}; > > > > > > > > > This mixes the use of NEON intrinsics with GCC vector extensions. > > > > ACLE (Arm C Language Extensions) specifically recommends not to > > > > mix the two methods in section 12.2.6. IMO, Aaron's suggestion of > > > > using a temp vector > > > should be good. > > > > > > We are using this pattern across DPDK and SSE for x86 as well. > > > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c > > > #n > > > 91 > > I am not sure about x86, I have not looked at a document similar to > > ACLE for x86. IMO, it is not relevant here as this is Arm specific code. > > What I meant was its been already used in DPDK for arm64. > https://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_neon.c#n91 Ok, got it. I have had discussion with compiler folks at Arm with mixing vector programming models and the recommendation has been to use NEON exclusively. I have had this discussion with Marvel compiler folks too some time back. > > Please see offial page vector gcc gcc documentation. The examples are using > this scheme. > https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html > > This is to just create 'input' variable. I am fine to use any other scheme with > out additional cost of instructions. > > > > > > > > > Since it used in fastpath, a temp variable would be additional cost > > > for no reason. > > Then, I would suggest we can go with using 'vdupq_n_s32'. > > We have to form uint64x2_t with 4 x uint32_t variable, How does > 'vdupq_n_s32' help here? We would use 'vdupq_n_s32' only for the first initialization, the rest of the code remains the same (see the diff below) > Can you share code snippet without any temp variable? diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index 01b9766d8..b3196cd12 100644 --- a/lib/librte_acl/acl_run_neon.h +++ b/lib/librte_acl/acl_run_neon.h @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); + input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); + input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 4)); input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); @@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); + input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); My understanding is that the generated code for both your patch and my changes above is the same. Above suggested changes will conform to ACLE recommendation. > > > > > > If GCC supports it then I think it is fine, I think, above usage > > > matters with C++ portability. > > I did not understand the C++ portability part. Can you elaborate more? > > > > > > > > > > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > > > > > > @@ -227,7 +223,6 @@ search_neon_4(const struct rte_acl_ctx *ctx, > > > > > const uint8_t **data, > > > > > uint64_t index_array[4]; > > > > > struct completion cmplt[4]; > > > > > struct parms parms[4]; > > > > > - int32x4_t input; > > > > > > > > > > acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, > > > > > total_packets, categories, ctx->trans_table); @@ -242,10 > > > > > +237,10 @@ search_neon_4(const struct rte_acl_ctx *ctx, const > > > > > +uint8_t > > > > > **data, > > > > > > > > > > while (flows.started > 0) { > > > > > /* Gather 4 bytes of input data for each stream. */ > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > input, > > > > > 0); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > input, > > > > > 1); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > input, > > > > > 2); > > > > > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > input, > > > > > 3); > > > > > + int32x4_t input = {GET_NEXT_4BYTES(parms, 0), > > > > > + GET_NEXT_4BYTES(parms, 1), > > > > > + GET_NEXT_4BYTES(parms, 2), > > > > > + GET_NEXT_4BYTES(parms, 3)}; > > > > > > > > > > /* Process the 4 bytes of input on each stream. */ > > > > > input = transition4(input, flows.trans, index_array); > > > > > -- > > > > > 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-11 1:27 ` Honnappa Nagarahalli @ 2019-06-11 14:24 ` Jerin Jacob Kollanukkaran 2019-06-11 19:48 ` Honnappa Nagarahalli 0 siblings, 1 reply; 17+ messages in thread From: Jerin Jacob Kollanukkaran @ 2019-06-11 14:24 UTC (permalink / raw) To: Honnappa Nagarahalli, dev Cc: thomas, Gavin Hu (Arm Technology China), msantana, aconole, stable, nd, nd > -----Original Message----- > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Sent: Tuesday, June 11, 2019 6:58 AM > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; dev@dpdk.org > Cc: thomas@monjalon.net; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; msantana@redhat.com; aconole@redhat.com; > stable@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > nd <nd@arm.com>; nd <nd@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > > > > > > > > > > > Since it used in fastpath, a temp variable would be additional > > > > cost for no reason. > > > Then, I would suggest we can go with using 'vdupq_n_s32'. > > > > We have to form uint64x2_t with 4 x uint32_t variable, How does > > 'vdupq_n_s32' help here? > We would use 'vdupq_n_s32' only for the first initialization, the rest of the code > remains the same (see the diff below) > > > Can you share code snippet without any temp variable? > diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index > 01b9766d8..b3196cd12 100644 > --- a/lib/librte_acl/acl_run_neon.h > +++ b/lib/librte_acl/acl_run_neon.h > @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const > uint8_t **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); > - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); > + input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); > + input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 4)); > > input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); > input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); @@ - > 242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t > **data, > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); > + input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); > input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); > > My understanding is that the generated code for both your patch and my > changes above is the same. Above suggested changes will conform to ACLE > recommendation. Though instructions are different. Effective cycles are same even though First dup updates the four positions. To make forward progress send the v2 based on the updated logic just to make ACLE Spec happy, I don’t see any real reason to do it though 😊 http://patches.dpdk.org/patch/54656/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-11 14:24 ` Jerin Jacob Kollanukkaran @ 2019-06-11 19:48 ` Honnappa Nagarahalli 2019-06-12 2:41 ` Jerin Jacob Kollanukkaran 0 siblings, 1 reply; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-11 19:48 UTC (permalink / raw) To: jerinj, dev; +Cc: thomas, Gavin Hu (Arm Technology China), nd, nd Reduced the CC list (changing the topic slightly) > > > > My understanding is that the generated code for both your patch and my > > changes above is the same. Above suggested changes will conform to > > ACLE recommendation. > > Though instructions are different. Effective cycles are same even though First > dup updates the four positions. Can you elaborate on how the instructions are different? I wrote the following code with both the methods: uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, uint32_t *p3) { uint32x4_t r = {*p0, *p1, *p2, *p3}; return r; } uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t *p2, uint32_t *p3) { uint32x4_t r; r = vdupq_n_u32 (* p0); r = vsetq_lane_u32 (*p1, r, 1); r = vsetq_lane_u32 (*p2, r, 2); r = vsetq_lane_u32 (*p3, r, 3); return r; } The generated code has the same instructions for both (omitted the unwanted parts): u32x4_gather_gcc: ld1r {v0.4s}, [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret u32x4_gather_acle: ld1r {v0.4s}, [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret The first 'ld1r' updates all the lanes in both the cases. > To make forward progress send the v2 based on the updated logic just to > make ACLE Spec happy, I don’t see any real reason to do it though 😊 Thanks for the patch, it was important to make forward progress. But, I think we should carry forward the discussion as I plan to change other parts of DPDK on similar lines. I want to understand why you think there is no real reason. The ACLE recommendation mentions the reasoning. > > http://patches.dpdk.org/patch/54656/ > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-11 19:48 ` Honnappa Nagarahalli @ 2019-06-12 2:41 ` Jerin Jacob Kollanukkaran 2019-06-17 0:48 ` Honnappa Nagarahalli 0 siblings, 1 reply; 17+ messages in thread From: Jerin Jacob Kollanukkaran @ 2019-06-12 2:41 UTC (permalink / raw) To: Honnappa Nagarahalli, dev; +Cc: thomas, Gavin Hu (Arm Technology China), nd, nd > -----Original Message----- > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Sent: Wednesday, June 12, 2019 1:18 AM > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; dev@dpdk.org > Cc: thomas@monjalon.net; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; nd <nd@arm.com>; nd <nd@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > Reduced the CC list (changing the topic slightly) > > > > > > > My understanding is that the generated code for both your patch and > > > my changes above is the same. Above suggested changes will conform > > > to ACLE recommendation. > > > > Though instructions are different. Effective cycles are same even > > though First dup updates the four positions. > Can you elaborate on how the instructions are different? > I wrote the following code with both the methods: > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, > uint32_t *p3) { > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > return r; > } > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t *p2, > uint32_t *p3) { > uint32x4_t r; > > r = vdupq_n_u32 (* p0); > r = vsetq_lane_u32 (*p1, r, 1); > r = vsetq_lane_u32 (*p2, r, 2); > r = vsetq_lane_u32 (*p3, r, 3); > > return r; > } > > The generated code has the same instructions for both (omitted the unwanted > parts): > > u32x4_gather_gcc: > ld1r {v0.4s}, [x0] > ld1 {v0.s}[1], [x1] > ld1 {v0.s}[2], [x2] > ld1 {v0.s}[3], [x3] > ret > > u32x4_gather_acle: > ld1r {v0.4s}, [x0] > ld1 {v0.s}[1], [x1] > ld1 {v0.s}[2], [x2] > ld1 {v0.s}[3], [x3] > ret > > The first 'ld1r' updates all the lanes in both the cases. Please check actual generated code for ACL case. We can see difference 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] vs 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] With patch: 244 /* Gather 4 bytes of input data for each stream. */ 245 input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] 0x00000000005cc26c <+2028>: 73 12 00 91 add x19, x19, #0x4 0x00000000005cc2ac <+2092>: b3 37 00 f9 str x19, [x29, #104] 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] 0x00000000005cc21c <+1948>: e7 10 00 91 add x7, x7, #0x4 0x00000000005cc260 <+2016>: a7 43 00 f9 str x7, [x29, #128] 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] 0x00000000005cc224 <+1956>: c6 10 00 91 add x6, x6, #0x4 0x00000000005cc264 <+2020>: a6 4f 00 f9 str x6, [x29, #152] 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 0x00000000005cc218 <+1944>: b7 57 40 f9 ldr x23, [x29, #168] 0x00000000005cc220 <+1952>: f4 6a 74 b8 ldr w20, [x23, x20] 0x00000000005cc228 <+1960>: a5 5b 00 f9 str x5, [x29, #176] With out patch: 245 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] 0x00000000005cc248 <+1992>: 73 12 00 91 add x19, x19, #0x4 0x00000000005cc24c <+1996>: b3 37 00 f9 str x19, [x29, #104] 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] 0x00000000005cc228 <+1960>: e7 10 00 91 add x7, x7, #0x4 0x00000000005cc240 <+1984>: a7 43 00 f9 str x7, [x29, #128] 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] 0x00000000005cc22c <+1964>: c6 10 00 91 add x6, x6, #0x4 0x00000000005cc244 <+1988>: a6 4f 00 f9 str x6, [x29, #152] 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 0x00000000005cc21c <+1948>: b7 57 40 f9 ldr x23, [x29, #168] 0x00000000005cc224 <+1956>: f4 6a 74 b8 ldr w20, [x23, x20] 0x00000000005cc230 <+1968>: a5 5b 00 f9 str x5, [x29, #176] > > > To make forward progress send the v2 based on the updated logic just > > to make ACLE Spec happy, I don’t see any real reason to do it though > > 😊 > Thanks for the patch, it was important to make forward progress. > But, I think we should carry forward the discussion as I plan to change other > parts of DPDK on similar lines. I want to understand why you think there is no > real reason. The ACLE recommendation mentions the reasoning. # I see following in the ACLE spec. What is the actual reasoning? " ACLE does not define static construction of vector types. E.g. int32x4_t x = { 1, 2, 3, 4 }; Is not portable. Use the vcreate or vdup intrinsics to construct values from scalars. " # Why does compiler(gcc) allows if it not indented to use? # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan) Gcc feature to DPDK to detect undefined behavior checks to detect such case > > > > > http://patches.dpdk.org/patch/54656/ > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-12 2:41 ` Jerin Jacob Kollanukkaran @ 2019-06-17 0:48 ` Honnappa Nagarahalli 2019-06-17 6:52 ` Jerin Jacob Kollanukkaran 0 siblings, 1 reply; 17+ messages in thread From: Honnappa Nagarahalli @ 2019-06-17 0:48 UTC (permalink / raw) To: jerinj, dev Cc: thomas, Gavin Hu (Arm Technology China), Honnappa Nagarahalli, nd, nd > > > > Reduced the CC list (changing the topic slightly) > > > > > > > > > > My understanding is that the generated code for both your patch > > > > and my changes above is the same. Above suggested changes will > > > > conform to ACLE recommendation. > > > > > > Though instructions are different. Effective cycles are same even > > > though First dup updates the four positions. > > Can you elaborate on how the instructions are different? > > I wrote the following code with both the methods: > > > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, > > uint32_t *p3) { > > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > > > return r; > > } > > > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t > > *p2, uint32_t *p3) { > > uint32x4_t r; > > > > r = vdupq_n_u32 (* p0); > > r = vsetq_lane_u32 (*p1, r, 1); > > r = vsetq_lane_u32 (*p2, r, 2); > > r = vsetq_lane_u32 (*p3, r, 3); > > > > return r; > > } > > > > The generated code has the same instructions for both (omitted the > > unwanted > > parts): > > > > u32x4_gather_gcc: > > ld1r {v0.4s}, [x0] > > ld1 {v0.s}[1], [x1] > > ld1 {v0.s}[2], [x2] > > ld1 {v0.s}[3], [x3] > > ret > > > > u32x4_gather_acle: > > ld1r {v0.4s}, [x0] > > ld1 {v0.s}[1], [x1] > > ld1 {v0.s}[2], [x2] > > ld1 {v0.s}[3], [x3] > > ret > > > > The first 'ld1r' updates all the lanes in both the cases. > > > Please check actual generated code for ACL case. We can see difference I think there is something wrong with the way you are looking at the generated code. Please see comments below. > 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] > vs > 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] The register W30 is a scalar register. > > With patch: > > 244 /* Gather 4 bytes of input data for each stream. */ > 245 input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); > 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] > 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] > 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] > 0x00000000005cc26c <+2028>: 73 12 00 91 add x19, x19, #0x4 > 0x00000000005cc2ac <+2092>: b3 37 00 f9 str x19, [x29, #104] > > 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, This one and below ones are not containing any vector instructions. > 1); > 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] > 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] > 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] > 0x00000000005cc21c <+1948>: e7 10 00 91 add x7, x7, #0x4 > 0x00000000005cc260 <+2016>: a7 43 00 f9 str x7, [x29, #128] > > 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > 2); > 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] > 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] > 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] > 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] > 0x00000000005cc224 <+1956>: c6 10 00 91 add x6, x6, #0x4 > 0x00000000005cc264 <+2020>: a6 4f 00 f9 str x6, [x29, #152] > > 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > 3); > 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] > 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] > 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 > 0x00000000005cc218 <+1944>: b7 57 40 f9 ldr x23, [x29, #168] > 0x00000000005cc220 <+1952>: f4 6a 74 b8 ldr w20, [x23, x20] > 0x00000000005cc228 <+1960>: a5 5b 00 f9 str x5, [x29, #176] > > With out patch: This generated code does not contain any vector instructions. Can you please check? I changed the code to be similar to ACL code, please look at [1], the generated code is the same. [1] https://gcc.godbolt.org/z/p1sQNA > > 245 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, > 0); > 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] > 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] > 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] > 0x00000000005cc248 <+1992>: 73 12 00 91 add x19, x19, #0x4 > 0x00000000005cc24c <+1996>: b3 37 00 f9 str x19, [x29, #104] > > 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, > 1); > 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] > 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] > 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] > 0x00000000005cc228 <+1960>: e7 10 00 91 add x7, x7, #0x4 > 0x00000000005cc240 <+1984>: a7 43 00 f9 str x7, [x29, #128] > > 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, > 2); > 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] > 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] > 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] > 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] > 0x00000000005cc22c <+1964>: c6 10 00 91 add x6, x6, #0x4 > 0x00000000005cc244 <+1988>: a6 4f 00 f9 str x6, [x29, #152] > > 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, > 3); > 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] > 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] > 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 > 0x00000000005cc21c <+1948>: b7 57 40 f9 ldr x23, [x29, #168] > 0x00000000005cc224 <+1956>: f4 6a 74 b8 ldr w20, [x23, x20] > 0x00000000005cc230 <+1968>: a5 5b 00 f9 str x5, [x29, #176] > > > > > > > To make forward progress send the v2 based on the updated logic > > > just to make ACLE Spec happy, I don’t see any real reason to do it > > > though > > > 😊 > > Thanks for the patch, it was important to make forward progress. > > But, I think we should carry forward the discussion as I plan to > > change other parts of DPDK on similar lines. I want to understand why > > you think there is no real reason. The ACLE recommendation mentions the > reasoning. > > # I see following in the ACLE spec. What is the actual reasoning? > " > ACLE does not define static construction of vector types. E.g. > int32x4_t x = { 1, 2, 3, 4 }; > Is not portable. Use the vcreate or vdup intrinsics to construct values from > scalars. > " Here is the complete text from ACLE 2.1 12.2.6 Compatibility with other vector programming models Programmers should take particular care when combining the Neon Intrinsics API with alternative vector programming models; ACLE does not specify how the NEON Intrinsics API interoperates with them. For instance, the GCC vector extension permits include “arm_neon.h” ... uint32x2_t x = {0, 1}; // GCC extension. uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic. But with this code the value stored in ‘y’ will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode. It is recommended that NEON Intrinsics be used consistently: include “arm_neon.h” ... const int temp[2] = {0, 1}; uint32x2_t x = vld1_s32 (temp); uint32_t y = vget_lane_s32 (x, 0); > > # Why does compiler(gcc) allows if it not indented to use? I do not have an answer. This is a recommendation and all that I am trying to say is, following the recommendation does not cost us anything in performance. > > # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan) > Gcc feature to DPDK to detect undefined behavior checks to detect such case I am not sure if it helps here. > > > > > > > > > > http://patches.dpdk.org/patch/54656/ > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-17 0:48 ` Honnappa Nagarahalli @ 2019-06-17 6:52 ` Jerin Jacob Kollanukkaran 0 siblings, 0 replies; 17+ messages in thread From: Jerin Jacob Kollanukkaran @ 2019-06-17 6:52 UTC (permalink / raw) To: Honnappa Nagarahalli, dev; +Cc: thomas, Gavin Hu (Arm Technology China), nd, nd > -----Original Message----- > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> > Sent: Monday, June 17, 2019 6:19 AM > To: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; dev@dpdk.org > Cc: thomas@monjalon.net; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com> > Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 > compiler > > External Email > > ---------------------------------------------------------------------- > > > > > > Reduced the CC list (changing the topic slightly) > > > > > > > > > > > > > My understanding is that the generated code for both your patch > > > > > and my changes above is the same. Above suggested changes will > > > > > conform to ACLE recommendation. > > > > > > > > Though instructions are different. Effective cycles are same even > > > > though First dup updates the four positions. > > > Can you elaborate on how the instructions are different? > > > I wrote the following code with both the methods: > > > > > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t > > > *p2, uint32_t *p3) { > > > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > > > > > return r; > > > } > > > > > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t > > > *p2, uint32_t *p3) { > > > uint32x4_t r; > > > > > > r = vdupq_n_u32 (* p0); > > > r = vsetq_lane_u32 (*p1, r, 1); > > > r = vsetq_lane_u32 (*p2, r, 2); > > > r = vsetq_lane_u32 (*p3, r, 3); > > > > > > return r; > > > } > > > > > > The generated code has the same instructions for both (omitted the > > > unwanted > > > parts): > > > > > > u32x4_gather_gcc: > > > ld1r {v0.4s}, [x0] > > > ld1 {v0.s}[1], [x1] > > > ld1 {v0.s}[2], [x2] > > > ld1 {v0.s}[3], [x3] > > > ret > > > > > > u32x4_gather_acle: > > > ld1r {v0.4s}, [x0] > > > ld1 {v0.s}[1], [x1] > > > ld1 {v0.s}[2], [x2] > > > ld1 {v0.s}[3], [x3] > > > ret > > > > > > The first 'ld1r' updates all the lanes in both the cases. > > > > > > Please check actual generated code for ACL case. We can see difference > I think there is something wrong with the way you are looking at the > generated code. Please see comments below. I am generating the dis assembly like below. gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rm search_neon_4' You can try it out. > > > > > To make forward progress send the v2 based on the updated logic > > > > just to make ACLE Spec happy, I don’t see any real reason to do > > > > it though > > > > 😊 > > > Thanks for the patch, it was important to make forward progress. > > > But, I think we should carry forward the discussion as I plan to > > > change other parts of DPDK on similar lines. I want to understand > > > why you think there is no real reason. The ACLE recommendation > > > mentions the > > reasoning. > > > > # I see following in the ACLE spec. What is the actual reasoning? > > " > > ACLE does not define static construction of vector types. E.g. > > int32x4_t x = { 1, 2, 3, 4 }; > > Is not portable. Use the vcreate or vdup intrinsics to construct > > values from scalars. > > " > Here is the complete text from ACLE 2.1 > > 12.2.6 Compatibility with other vector programming models Programmers > should take particular care when combining the Neon Intrinsics API with > alternative vector programming models; ACLE does not specify how the > NEON Intrinsics API interoperates with them. > For instance, the GCC vector extension permits include “arm_neon.h” > ... > uint32x2_t x = {0, 1}; // GCC extension. > uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic. > But with this code the value stored in ‘y’ will depend on both the target > architecture (AArch32 or AArch64) and whether the program is running in > big- or little-endian mode. I don’t have a big endian machine to test. I would be interesting to see The output in bigendian. > It is recommended that NEON Intrinsics be used consistently: > include “arm_neon.h” > ... > const int temp[2] = {0, 1}; > uint32x2_t x = vld1_s32 (temp); > uint32_t y = vget_lane_s32 (x, 0); > > > > > # Why does compiler(gcc) allows if it not indented to use? > I do not have an answer. This is a recommendation and all that I am trying to > say is, following the recommendation does not cost us anything in > performance. If there is no performance regression then no issue in changing to this format. > > > > > # I think, it may be time to introduce UndefinedBehaviorSanitizer > > (UBSan) Gcc feature to DPDK to detect undefined behavior checks to > > detect such case > I am not sure if it helps here. > > > > > > > > > > > > > > > > http://patches.dpdk.org/patch/54656/ > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler 2019-06-06 14:50 [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler jerinj 2019-06-06 15:55 ` Michael Santana Francisco 2019-06-07 5:35 ` Honnappa Nagarahalli @ 2019-06-10 12:10 ` Aaron Conole 2019-06-11 14:15 ` [dpdk-dev] [PATCH v2] " jerinj 3 siblings, 0 replies; 17+ messages in thread From: Aaron Conole @ 2019-06-10 12:10 UTC (permalink / raw) To: jerinj; +Cc: dev, thomas, gavin.hu, honnappa.nagarahalli, msantana, stable <jerinj@marvell.com> writes: > From: Jerin Jacob <jerinj@marvell.com> > > Some compilers reporting the following error, though the existing > code doesn't have any uninitialized variable case. > Just to make compiler happy, initialize the int32x4_t variable > one shot in C language. > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > used uninitialized in this function [-Werror=maybe-uninitialized] > int32x4_t input; > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > Cc: stable@dpdk.org > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > --- This pattern is easy to understand, congruent with other usages in the code base, has good patch statistics, and solves the issue. Acked-by: Aaron Conole <aconole@redhat.com> I prefer this solution to the others posted. Thanks for looking into it, Jerin! ^ permalink raw reply [flat|nested] 17+ messages in thread
* [dpdk-dev] [PATCH v2] acl: fix build issue with some arm64 compiler 2019-06-06 14:50 [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler jerinj ` (2 preceding siblings ...) 2019-06-10 12:10 ` Aaron Conole @ 2019-06-11 14:15 ` jerinj 2019-06-11 14:53 ` Aaron Conole 3 siblings, 1 reply; 17+ messages in thread From: jerinj @ 2019-06-11 14:15 UTC (permalink / raw) To: Jerin Jacob, Gavin Hu, Konstantin Ananyev Cc: dev, thomas, msantana, aconole, stable, Honnappa Nagarahalli From: Jerin Jacob <jerinj@marvell.com> Some compilers reporting the following error, though the existing code doesn't have any uninitialized variable case. Just to make compiler happy, initialize the int32x4_t variable one shot using vdupq_n_s32. ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be used uninitialized in this function [-Werror=maybe-uninitialized] int32x4_t input; Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") Cc: stable@dpdk.org Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Signed-off-by: Jerin Jacob <jerinj@marvell.com> --- v2: - Changed C based initializion to vdupq_n_s32 for better comparability with ACLE(Honnappa) --- lib/librte_acl/acl_run_neon.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/librte_acl/acl_run_neon.h b/lib/librte_acl/acl_run_neon.h index 01b9766d8..b3196cd12 100644 --- a/lib/librte_acl/acl_run_neon.h +++ b/lib/librte_acl/acl_run_neon.h @@ -181,8 +181,8 @@ search_neon_8(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input0, 0); - input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), input1, 0); + input0 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); + input1 = vdupq_n_s32(GET_NEXT_4BYTES(parms, 4)); input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input0, 1); input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 5), input1, 1); @@ -242,7 +242,7 @@ search_neon_4(const struct rte_acl_ctx *ctx, const uint8_t **data, while (flows.started > 0) { /* Gather 4 bytes of input data for each stream. */ - input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input, 0); + input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input, 1); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input, 2); input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input, 3); -- 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH v2] acl: fix build issue with some arm64 compiler 2019-06-11 14:15 ` [dpdk-dev] [PATCH v2] " jerinj @ 2019-06-11 14:53 ` Aaron Conole 2019-06-11 15:07 ` Thomas Monjalon 0 siblings, 1 reply; 17+ messages in thread From: Aaron Conole @ 2019-06-11 14:53 UTC (permalink / raw) To: jerinj Cc: Gavin Hu, Konstantin Ananyev, dev, thomas, msantana, stable, Honnappa Nagarahalli <jerinj@marvell.com> writes: > From: Jerin Jacob <jerinj@marvell.com> > > Some compilers reporting the following error, though the existing > code doesn't have any uninitialized variable case. > Just to make compiler happy, initialize the int32x4_t variable > one shot using vdupq_n_s32. > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > used uninitialized in this function [-Werror=maybe-uninitialized] > int32x4_t input; > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > Cc: stable@dpdk.org > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > --- LGTM Acked-by: Aaron Conole <aconole@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [dpdk-dev] [PATCH v2] acl: fix build issue with some arm64 compiler 2019-06-11 14:53 ` Aaron Conole @ 2019-06-11 15:07 ` Thomas Monjalon 0 siblings, 0 replies; 17+ messages in thread From: Thomas Monjalon @ 2019-06-11 15:07 UTC (permalink / raw) To: jerinj Cc: dev, Aaron Conole, Gavin Hu, Konstantin Ananyev, msantana, stable, Honnappa Nagarahalli 11/06/2019 23:53, Aaron Conole: > <jerinj@marvell.com> writes: > > > From: Jerin Jacob <jerinj@marvell.com> > > > > Some compilers reporting the following error, though the existing > > code doesn't have any uninitialized variable case. > > Just to make compiler happy, initialize the int32x4_t variable > > one shot using vdupq_n_s32. > > > > ../lib/librte_acl/acl_run_neon.h: In function 'search_neon_4' > > ../lib/librte_acl/acl_run_neon.h:230:12: error: 'input' may be > > used uninitialized in this function [-Werror=maybe-uninitialized] > > int32x4_t input; > > > > Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8") > > Cc: stable@dpdk.org > > > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Signed-off-by: Jerin Jacob <jerinj@marvell.com> > > --- > > LGTM > > Acked-by: Aaron Conole <aconole@redhat.com> Applied, thanks ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-06-17 6:52 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-06-06 14:50 [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler jerinj 2019-06-06 15:55 ` Michael Santana Francisco 2019-06-07 5:42 ` Honnappa Nagarahalli 2019-06-07 5:35 ` Honnappa Nagarahalli 2019-06-07 6:21 ` Jerin Jacob Kollanukkaran 2019-06-10 5:29 ` Honnappa Nagarahalli 2019-06-10 9:39 ` Jerin Jacob Kollanukkaran 2019-06-11 1:27 ` Honnappa Nagarahalli 2019-06-11 14:24 ` Jerin Jacob Kollanukkaran 2019-06-11 19:48 ` Honnappa Nagarahalli 2019-06-12 2:41 ` Jerin Jacob Kollanukkaran 2019-06-17 0:48 ` Honnappa Nagarahalli 2019-06-17 6:52 ` Jerin Jacob Kollanukkaran 2019-06-10 12:10 ` Aaron Conole 2019-06-11 14:15 ` [dpdk-dev] [PATCH v2] " jerinj 2019-06-11 14:53 ` Aaron Conole 2019-06-11 15:07 ` Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).