From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: "jerinj@marvell.com" <jerinj@marvell.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
"Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
nd <nd@arm.com>, nd <nd@arm.com>
Subject: Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler
Date: Mon, 17 Jun 2019 00:48:41 +0000 [thread overview]
Message-ID: <VE1PR08MB51496238B1C7148AAE42549A98EB0@VE1PR08MB5149.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <BYAPR18MB2424D7AE8B39DADEF8398E33C8EC0@BYAPR18MB2424.namprd18.prod.outlook.com>
> >
> > Reduced the CC list (changing the topic slightly)
> >
> > > >
> > > > My understanding is that the generated code for both your patch
> > > > and my changes above is the same. Above suggested changes will
> > > > conform to ACLE recommendation.
> > >
> > > Though instructions are different. Effective cycles are same even
> > > though First dup updates the four positions.
> > Can you elaborate on how the instructions are different?
> > I wrote the following code with both the methods:
> >
> > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2,
> > uint32_t *p3) {
> > uint32x4_t r = {*p0, *p1, *p2, *p3};
> >
> > return r;
> > }
> >
> > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t
> > *p2, uint32_t *p3) {
> > uint32x4_t r;
> >
> > r = vdupq_n_u32 (* p0);
> > r = vsetq_lane_u32 (*p1, r, 1);
> > r = vsetq_lane_u32 (*p2, r, 2);
> > r = vsetq_lane_u32 (*p3, r, 3);
> >
> > return r;
> > }
> >
> > The generated code has the same instructions for both (omitted the
> > unwanted
> > parts):
> >
> > u32x4_gather_gcc:
> > ld1r {v0.4s}, [x0]
> > ld1 {v0.s}[1], [x1]
> > ld1 {v0.s}[2], [x2]
> > ld1 {v0.s}[3], [x3]
> > ret
> >
> > u32x4_gather_acle:
> > ld1r {v0.4s}, [x0]
> > ld1 {v0.s}[1], [x1]
> > ld1 {v0.s}[2], [x2]
> > ld1 {v0.s}[3], [x3]
> > ret
> >
> > The first 'ld1r' updates all the lanes in both the cases.
>
>
> Please check actual generated code for ACL case. We can see difference
I think there is something wrong with the way you are looking at the generated code. Please see comments below.
> 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5]
> vs
> 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5]
The register W30 is a scalar register.
>
> With patch:
>
> 244 /* Gather 4 bytes of input data for each stream. */
> 245 input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
> 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96]
> 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19]
> 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5]
> 0x00000000005cc26c <+2028>: 73 12 00 91 add x19, x19, #0x4
> 0x00000000005cc2ac <+2092>: b3 37 00 f9 str x19, [x29, #104]
>
> 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input,
This one and below ones are not containing any vector instructions.
> 1);
> 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120]
> 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7]
> 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5]
> 0x00000000005cc21c <+1948>: e7 10 00 91 add x7, x7, #0x4
> 0x00000000005cc260 <+2016>: a7 43 00 f9 str x7, [x29, #128]
>
> 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input,
> 2);
> 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144]
> 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152]
> 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6]
> 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20]
> 0x00000000005cc224 <+1956>: c6 10 00 91 add x6, x6, #0x4
> 0x00000000005cc264 <+2020>: a6 4f 00 f9 str x6, [x29, #152]
>
> 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input,
> 3);
> 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176]
> 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5]
> 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4
> 0x00000000005cc218 <+1944>: b7 57 40 f9 ldr x23, [x29, #168]
> 0x00000000005cc220 <+1952>: f4 6a 74 b8 ldr w20, [x23, x20]
> 0x00000000005cc228 <+1960>: a5 5b 00 f9 str x5, [x29, #176]
>
> With out patch:
This generated code does not contain any vector instructions. Can you please check?
I changed the code to be similar to ACL code, please look at [1], the generated code is the same.
[1] https://gcc.godbolt.org/z/p1sQNA
>
> 245 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), input,
> 0);
> 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96]
> 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19]
> 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5]
> 0x00000000005cc248 <+1992>: 73 12 00 91 add x19, x19, #0x4
> 0x00000000005cc24c <+1996>: b3 37 00 f9 str x19, [x29, #104]
>
> 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), input,
> 1);
> 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120]
> 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7]
> 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5]
> 0x00000000005cc228 <+1960>: e7 10 00 91 add x7, x7, #0x4
> 0x00000000005cc240 <+1984>: a7 43 00 f9 str x7, [x29, #128]
>
> 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), input,
> 2);
> 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144]
> 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152]
> 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6]
> 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20]
> 0x00000000005cc22c <+1964>: c6 10 00 91 add x6, x6, #0x4
> 0x00000000005cc244 <+1988>: a6 4f 00 f9 str x6, [x29, #152]
>
> 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), input,
> 3);
> 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176]
> 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5]
> 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4
> 0x00000000005cc21c <+1948>: b7 57 40 f9 ldr x23, [x29, #168]
> 0x00000000005cc224 <+1956>: f4 6a 74 b8 ldr w20, [x23, x20]
> 0x00000000005cc230 <+1968>: a5 5b 00 f9 str x5, [x29, #176]
>
>
> >
> > > To make forward progress send the v2 based on the updated logic
> > > just to make ACLE Spec happy, I don’t see any real reason to do it
> > > though
> > > 😊
> > Thanks for the patch, it was important to make forward progress.
> > But, I think we should carry forward the discussion as I plan to
> > change other parts of DPDK on similar lines. I want to understand why
> > you think there is no real reason. The ACLE recommendation mentions the
> reasoning.
>
> # I see following in the ACLE spec. What is the actual reasoning?
> "
> ACLE does not define static construction of vector types. E.g.
> int32x4_t x = { 1, 2, 3, 4 };
> Is not portable. Use the vcreate or vdup intrinsics to construct values from
> scalars.
> "
Here is the complete text from ACLE 2.1
12.2.6 Compatibility with other vector programming models
Programmers should take particular care when combining the Neon Intrinsics API with alternative vector programming models; ACLE does not specify how the NEON Intrinsics API interoperates with them.
For instance, the GCC vector extension permits
include “arm_neon.h”
...
uint32x2_t x = {0, 1}; // GCC extension.
uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic.
But with this code the value stored in ‘y’ will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode.
It is recommended that NEON Intrinsics be used consistently:
include “arm_neon.h”
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);
>
> # Why does compiler(gcc) allows if it not indented to use?
I do not have an answer. This is a recommendation and all that I am trying to say is, following the recommendation does not cost us anything in performance.
>
> # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan)
> Gcc feature to DPDK to detect undefined behavior checks to detect such case
I am not sure if it helps here.
>
> >
>
> > >
> > > http://patches.dpdk.org/patch/54656/
> > >
next prev parent reply other threads:[~2019-06-17 0:48 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-06 14:50 jerinj
2019-06-06 15:55 ` Michael Santana Francisco
2019-06-07 5:42 ` Honnappa Nagarahalli
2019-06-07 5:35 ` Honnappa Nagarahalli
2019-06-07 6:21 ` Jerin Jacob Kollanukkaran
2019-06-10 5:29 ` Honnappa Nagarahalli
2019-06-10 9:39 ` Jerin Jacob Kollanukkaran
2019-06-11 1:27 ` Honnappa Nagarahalli
2019-06-11 14:24 ` Jerin Jacob Kollanukkaran
2019-06-11 19:48 ` Honnappa Nagarahalli
2019-06-12 2:41 ` Jerin Jacob Kollanukkaran
2019-06-17 0:48 ` Honnappa Nagarahalli [this message]
2019-06-17 6:52 ` Jerin Jacob Kollanukkaran
2019-06-10 12:10 ` Aaron Conole
2019-06-11 14:15 ` [dpdk-dev] [PATCH v2] " jerinj
2019-06-11 14:53 ` Aaron Conole
2019-06-11 15:07 ` Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=VE1PR08MB51496238B1C7148AAE42549A98EB0@VE1PR08MB5149.eurprd08.prod.outlook.com \
--to=honnappa.nagarahalli@arm.com \
--cc=Gavin.Hu@arm.com \
--cc=dev@dpdk.org \
--cc=jerinj@marvell.com \
--cc=nd@arm.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).