* [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64
@ 2019-08-13 10:43 Gavin Hu
2019-08-13 10:43 ` [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered " Gavin Hu
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Gavin Hu @ 2019-08-13 10:43 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli,
qi.z.zhang, bruce.richardson
Aarch64 neon vPMD survives across discontinuous DD bits, which makes
the ordering for descriptors loading unnecessary.
Similarly, the compiler barrier to order the extraction of packet
length is not needed any more when the extraction was simplified
by anothe patch.
Gavin Hu (2):
net/i40e: desc loading is unnecessarily ordered for aarch64
net/i40e: remove compiler barrier for aarch64
drivers/net/i40e/i40e_rxtx_vec_neon.c | 5 -----
1 file changed, 5 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for aarch64
2019-08-13 10:43 [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64 Gavin Hu
@ 2019-08-13 10:43 ` Gavin Hu
2019-08-28 22:09 ` Honnappa Nagarahalli
2019-08-13 10:43 ` [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier " Gavin Hu
2019-09-04 7:49 ` [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization " Ferruh Yigit
2 siblings, 1 reply; 8+ messages in thread
From: Gavin Hu @ 2019-08-13 10:43 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli,
qi.z.zhang, bruce.richardson, stable
For x86, the descriptors needs to be loaded in order, so in between two
descriptors loading, there is a compiler barrier in place.[1]
For aarch64, a patch [2] is in place to survive with discontinuous DD bits,
the barriers can be removed to take full advantage of out-of-order
execution.
50% performance gain in the RFC2544 NDR test was measured on ThunderX2.
12.50% performan gain in the RFC2544 NDR test was measured on Ampere
eMAG80 platform.
[1] http://inbox.dpdk.org/users/039ED4275CED7440929022BC67E7061153D71548@
SHSMSX105.ccr.corp.intel.com/
[2] https://mails.dpdk.org/archives/stable/2017-October/003324.html
Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
---
drivers/net/i40e/i40e_rxtx_vec_neon.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 83572ef..5555e9b 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -285,7 +285,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
/* Read desc statuses backwards to avoid race condition */
/* A.1 load 4 pkts desc */
descs[3] = vld1q_u64((uint64_t *)(rxdp + 3));
- rte_rmb();
/* B.2 copy 2 mbuf point into rx_pkts */
vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
--
2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier for aarch64
2019-08-13 10:43 [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64 Gavin Hu
2019-08-13 10:43 ` [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered " Gavin Hu
@ 2019-08-13 10:43 ` Gavin Hu
2019-08-28 22:48 ` Honnappa Nagarahalli
2019-09-04 7:49 ` [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization " Ferruh Yigit
2 siblings, 1 reply; 8+ messages in thread
From: Gavin Hu @ 2019-08-13 10:43 UTC (permalink / raw)
To: dev
Cc: nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli,
qi.z.zhang, bruce.richardson, stable
As packet length extraction code was simplified,the ordering
was not necessary any more.[1]
2% performance gain was measured on Marvell ThunderX2.
4.3% performance gain was measure on Ampere eMAG80
[1] http://mails.dpdk.org/archives/dev/2016-April/037529.html
Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
Cc: stable@dpdk.org
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
---
drivers/net/i40e/i40e_rxtx_vec_neon.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 5555e9b..864eb9a 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -307,9 +307,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
}
- /* avoid compiler reorder optimization */
- rte_compiler_barrier();
-
/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
uint32x4_t len3 = vshlq_u32(vreinterpretq_u32_u64(descs[3]),
len_shl);
--
2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for aarch64
2019-08-13 10:43 ` [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered " Gavin Hu
@ 2019-08-28 22:09 ` Honnappa Nagarahalli
2019-08-30 8:33 ` Gavin Hu (Arm Technology China)
0 siblings, 1 reply; 8+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 22:09 UTC (permalink / raw)
To: Gavin Hu (Arm Technology China), dev
Cc: nd, thomas, jerinj, pbhagavatula, qi.z.zhang, bruce.richardson,
stable, Honnappa Nagarahalli, nd
Thanks Gavin, few comments are inline
> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Tuesday, August 13, 2019 5:44 AM
> To: dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> pbhagavatula@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; qi.z.zhang@intel.com;
> bruce.richardson@intel.com; stable@dpdk.org
> Subject: [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for
> aarch64
>
> For x86, the descriptors needs to be loaded in order, so in between two
> descriptors loading, there is a compiler barrier in place.
IMO, we can skip the above as this change applies to Arm platforms. Instead, capture this in the code in comments to explain why the ordering of the loads is not required. This will help others reading the code.
[1] For aarch64, a
> patch [2] is in place to survive with discontinuous DD bits, the barriers can be
> removed to take full advantage of out-of-order execution.
>
> 50% performance gain in the RFC2544 NDR test was measured on ThunderX2.
> 12.50% performan gain in the RFC2544 NDR test was measured on Ampere
> eMAG80 platform.
>
> [1]
> http://inbox.dpdk.org/users/039ED4275CED7440929022BC67E7061153D71
> 548@
> SHSMSX105.ccr.corp.intel.com/
> [2] https://mails.dpdk.org/archives/stable/2017-October/003324.html
>
> Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> ---
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> index 83572ef..5555e9b 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> @@ -285,7 +285,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq,
> struct rte_mbuf **rx_pkts,
> /* Read desc statuses backwards to avoid race condition */
> /* A.1 load 4 pkts desc */
> descs[3] = vld1q_u64((uint64_t *)(rxdp + 3));
> - rte_rmb();
>
> /* B.2 copy 2 mbuf point into rx_pkts */
> vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
> --
> 2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier for aarch64
2019-08-13 10:43 ` [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier " Gavin Hu
@ 2019-08-28 22:48 ` Honnappa Nagarahalli
2019-08-30 8:51 ` Gavin Hu (Arm Technology China)
0 siblings, 1 reply; 8+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 22:48 UTC (permalink / raw)
To: Gavin Hu (Arm Technology China), dev
Cc: nd, thomas, jerinj, pbhagavatula, qi.z.zhang, bruce.richardson,
stable, Honnappa Nagarahalli, nd
>
> As packet length extraction code was simplified,the ordering was not
> necessary any more.[1]
IMO, there is no relationship between the compiler barrier and [1] at least on Arm platforms. I suggest we just say 'there is no reason for the compiler barrier'.
I think this compiler barrier is not required for x86/PPC as well.
>
> 2% performance gain was measured on Marvell ThunderX2.
> 4.3% performance gain was measure on Ampere eMAG80
>
> [1] http://mails.dpdk.org/archives/dev/2016-April/037529.html
>
> Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> ---
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> index 5555e9b..864eb9a 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> @@ -307,9 +307,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq,
> struct rte_mbuf **rx_pkts,
> rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
> }
>
> - /* avoid compiler reorder optimization */
> - rte_compiler_barrier();
> -
> /* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
> uint32x4_t len3 =
> vshlq_u32(vreinterpretq_u32_u64(descs[3]),
> len_shl);
> --
> 2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for aarch64
2019-08-28 22:09 ` Honnappa Nagarahalli
@ 2019-08-30 8:33 ` Gavin Hu (Arm Technology China)
0 siblings, 0 replies; 8+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-08-30 8:33 UTC (permalink / raw)
To: Honnappa Nagarahalli, dev
Cc: nd, thomas, jerinj, pbhagavatula, qi.z.zhang, bruce.richardson,
stable, nd
Hi Honnappa,
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, August 29, 2019 6:10 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> pbhagavatula@marvell.com; qi.z.zhang@intel.com;
> bruce.richardson@intel.com; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for
> aarch64
>
> Thanks Gavin, few comments are inline
>
> > -----Original Message-----
> > From: Gavin Hu <gavin.hu@arm.com>
> > Sent: Tuesday, August 13, 2019 5:44 AM
> > To: dev@dpdk.org
> > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > pbhagavatula@marvell.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; qi.z.zhang@intel.com;
> > bruce.richardson@intel.com; stable@dpdk.org
> > Subject: [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered for
> > aarch64
> >
> > For x86, the descriptors needs to be loaded in order, so in between two
> > descriptors loading, there is a compiler barrier in place.
> IMO, we can skip the above as this change applies to Arm platforms. Instead,
> capture this in the code in comments to explain why the ordering of the
> loads is not required. This will help others reading the code.
As the line of code was removed, there is no suitable place to add a comment.
Instead adding it in the commit log makes the story complete and easy to understand.
> [1] For aarch64, a
> > patch [2] is in place to survive with discontinuous DD bits, the barriers can
> be
> > removed to take full advantage of out-of-order execution.
> >
> > 50% performance gain in the RFC2544 NDR test was measured on
> ThunderX2.
> > 12.50% performan gain in the RFC2544 NDR test was measured on
> Ampere
> > eMAG80 platform.
> >
> > [1]
> >
> http://inbox.dpdk.org/users/039ED4275CED7440929022BC67E7061153D71
> > 548@
> > SHSMSX105.ccr.corp.intel.com/
> > [2] https://mails.dpdk.org/archives/stable/2017-October/003324.html
> >
> > Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > ---
> > drivers/net/i40e/i40e_rxtx_vec_neon.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > index 83572ef..5555e9b 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > @@ -285,7 +285,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq,
> > struct rte_mbuf **rx_pkts,
> > /* Read desc statuses backwards to avoid race condition */
> > /* A.1 load 4 pkts desc */
> > descs[3] = vld1q_u64((uint64_t *)(rxdp + 3));
> > - rte_rmb();
> >
> > /* B.2 copy 2 mbuf point into rx_pkts */
> > vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
> > --
> > 2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier for aarch64
2019-08-28 22:48 ` Honnappa Nagarahalli
@ 2019-08-30 8:51 ` Gavin Hu (Arm Technology China)
0 siblings, 0 replies; 8+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-08-30 8:51 UTC (permalink / raw)
To: Honnappa Nagarahalli, dev
Cc: nd, thomas, jerinj, pbhagavatula, qi.z.zhang, bruce.richardson,
stable, nd
Hi Honnappa,
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, August 29, 2019 6:49 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> pbhagavatula@marvell.com; qi.z.zhang@intel.com;
> bruce.richardson@intel.com; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH 2/2] net/i40e: remove compiler barrier for aarch64
>
> >
> > As packet length extraction code was simplified,the ordering was not
> > necessary any more.[1]
> IMO, there is no relationship between the compiler barrier and [1] at least
> on Arm platforms. I suggest we just say 'there is no reason for the compiler
> barrier'.
> I think this compiler barrier is not required for x86/PPC as well.
The compiler barrier was ever really required for x86, as the two accesses to the desc[] entry must be ordered.
After [1] was applied, the first access was removed, then there is no reason for the compiler barrier.
For aarch64, it borrows the barrier and does not change according to the new code, so the barrier can be removed also.
Hopefully I got the whole story across clearly and completely.
>
> >
> > 2% performance gain was measured on Marvell ThunderX2.
> > 4.3% performance gain was measure on Ampere eMAG80
> >
> > [1] http://mails.dpdk.org/archives/dev/2016-April/037529.html
> >
> > Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > ---
> > drivers/net/i40e/i40e_rxtx_vec_neon.c | 3 ---
> > 1 file changed, 3 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > index 5555e9b..864eb9a 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > @@ -307,9 +307,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq,
> > struct rte_mbuf **rx_pkts,
> > rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
> > }
> >
> > - /* avoid compiler reorder optimization */
> > - rte_compiler_barrier();
> > -
> > /* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
> > uint32x4_t len3 =
> > vshlq_u32(vreinterpretq_u32_u64(descs[3]),
> > len_shl);
> > --
> > 2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64
2019-08-13 10:43 [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64 Gavin Hu
2019-08-13 10:43 ` [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered " Gavin Hu
2019-08-13 10:43 ` [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier " Gavin Hu
@ 2019-09-04 7:49 ` Ferruh Yigit
2 siblings, 0 replies; 8+ messages in thread
From: Ferruh Yigit @ 2019-09-04 7:49 UTC (permalink / raw)
To: Gavin Hu, dev
Cc: nd, thomas, jerinj, pbhagavatula, Honnappa.Nagarahalli,
qi.z.zhang, bruce.richardson
On 8/13/2019 11:43 AM, Gavin Hu wrote:
> Aarch64 neon vPMD survives across discontinuous DD bits, which makes
> the ordering for descriptors loading unnecessary.
> Similarly, the compiler barrier to order the extraction of packet
> length is not needed any more when the extraction was simplified
> by anothe patch.
>
> Gavin Hu (2):
> net/i40e: desc loading is unnecessarily ordered for aarch64
> net/i40e: remove compiler barrier for aarch64
>
Series applied to dpdk-next-net/master, thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-09-04 7:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13 10:43 [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization for aarch64 Gavin Hu
2019-08-13 10:43 ` [dpdk-dev] [PATCH 1/2] net/i40e: desc loading is unnecessarily ordered " Gavin Hu
2019-08-28 22:09 ` Honnappa Nagarahalli
2019-08-30 8:33 ` Gavin Hu (Arm Technology China)
2019-08-13 10:43 ` [dpdk-dev] [PATCH 2/2] net/i40e: remove compiler barrier " Gavin Hu
2019-08-28 22:48 ` Honnappa Nagarahalli
2019-08-30 8:51 ` Gavin Hu (Arm Technology China)
2019-09-04 7:49 ` [dpdk-dev] [PATCH 0/2] i40e neon vPMD optiomization " Ferruh Yigit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).