* [dpdk-dev] [PATCH] net/mlx5: fix vectorized mini-CQE prefetching
@ 2020-07-22 20:32 Alexander Kozyrev
2020-07-23 8:57 ` Raslan Darawsheh
0 siblings, 1 reply; 2+ messages in thread
From: Alexander Kozyrev @ 2020-07-22 20:32 UTC (permalink / raw)
To: dev; +Cc: stable, rasland, viacheslavo
There was an optimization work to prefetch all the CQEs before
their invalidation. It allowed us to speed up the mini-CQE
decompression process by preheating the cache in the vectorized
Rx routine.
Prefetching of the next mini-CQE, on the other hand, showed
no difference in the performance on x86 platform. So, that was
removed. Unfortunately this caused the performance drop on ARM.
Prefetch the mini-CQE as well as well as the all the soon to be
invalidated CQEs to get both CQE and mini-CQE on the hot path.
Fixes: 28a4b9632 ("net/mlx5: prefetch CQEs for a faster decompression")
Cc: stable@dpdk.org
Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 3 ++-
drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 3 +++
drivers/net/mlx5/mlx5_rxtx_vec_sse.h | 3 ++-
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
index f5414eebad..cb4ce1a099 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
@@ -158,7 +158,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
if (likely(pos + i < mcqe_n))
rte_prefetch0((void *)(cq + pos + i));
-
/* A.1 load mCQEs into a 128bit register. */
mcqe1 = (vector unsigned char)vec_vsx_ld(0,
(signed int const *)&mcq[pos % 8]);
@@ -287,6 +286,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
pos += MLX5_VPMD_DESCS_PER_LOOP;
/* Move to next CQE and invalidate consumed CQEs. */
if (!(pos & 0x7) && pos < mcqe_n) {
+ if (pos + 8 < mcqe_n)
+ rte_prefetch0((void *)(cq + pos + 8));
mcq = (void *)&(cq + pos)->pkt_info;
for (i = 0; i < 8; ++i)
cq[inv++].op_own = MLX5_CQE_INVALIDATE;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 555c342626..6c3149523e 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -145,6 +145,7 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
-1UL << ((mcqe_n - pos) *
sizeof(uint16_t) * 8) : 0);
#endif
+
for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
if (likely(pos + i < mcqe_n))
rte_prefetch0((void *)(cq + pos + i));
@@ -227,6 +228,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
pos += MLX5_VPMD_DESCS_PER_LOOP;
/* Move to next CQE and invalidate consumed CQEs. */
if (!(pos & 0x7) && pos < mcqe_n) {
+ if (pos + 8 < mcqe_n)
+ rte_prefetch0((void *)(cq + pos + 8));
mcq = (void *)&(cq + pos)->pkt_info;
for (i = 0; i < 8; ++i)
cq[inv++].op_own = MLX5_CQE_INVALIDATE;
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index 34e3397115..554924d7fc 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -135,7 +135,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
if (likely(pos + i < mcqe_n))
rte_prefetch0((void *)(cq + pos + i));
-
/* A.1 load mCQEs into a 128bit register. */
mcqe1 = _mm_loadu_si128((__m128i *)&mcq[pos % 8]);
mcqe2 = _mm_loadu_si128((__m128i *)&mcq[pos % 8 + 2]);
@@ -214,6 +213,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
pos += MLX5_VPMD_DESCS_PER_LOOP;
/* Move to next CQE and invalidate consumed CQEs. */
if (!(pos & 0x7) && pos < mcqe_n) {
+ if (pos + 8 < mcqe_n)
+ rte_prefetch0((void *)(cq + pos + 8));
mcq = (void *)(cq + pos);
for (i = 0; i < 8; ++i)
cq[inv++].op_own = MLX5_CQE_INVALIDATE;
--
2.24.1
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: fix vectorized mini-CQE prefetching
2020-07-22 20:32 [dpdk-dev] [PATCH] net/mlx5: fix vectorized mini-CQE prefetching Alexander Kozyrev
@ 2020-07-23 8:57 ` Raslan Darawsheh
0 siblings, 0 replies; 2+ messages in thread
From: Raslan Darawsheh @ 2020-07-23 8:57 UTC (permalink / raw)
To: Alexander Kozyrev, dev; +Cc: stable, Slava Ovsiienko
Hi,
> -----Original Message-----
> From: Alexander Kozyrev <akozyrev@mellanox.com>
> Sent: Wednesday, July 22, 2020 11:33 PM
> To: dev@dpdk.org
> Cc: stable@dpdk.org; Raslan Darawsheh <rasland@mellanox.com>; Slava
> Ovsiienko <viacheslavo@mellanox.com>
> Subject: [PATCH] net/mlx5: fix vectorized mini-CQE prefetching
>
> There was an optimization work to prefetch all the CQEs before
> their invalidation. It allowed us to speed up the mini-CQE
> decompression process by preheating the cache in the vectorized
> Rx routine.
>
> Prefetching of the next mini-CQE, on the other hand, showed
> no difference in the performance on x86 platform. So, that was
> removed. Unfortunately this caused the performance drop on ARM.
>
> Prefetch the mini-CQE as well as well as the all the soon to be
> invalidated CQEs to get both CQE and mini-CQE on the hot path.
>
> Fixes: 28a4b9632 ("net/mlx5: prefetch CQEs for a faster decompression")
> Cc: stable@dpdk.org
>
> Signed-off-by: Alexander Kozyrev <akozyrev@mellanox.com>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
> drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 3 ++-
> drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 3 +++
> drivers/net/mlx5/mlx5_rxtx_vec_sse.h | 3 ++-
> 3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> index f5414eebad..cb4ce1a099 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h
> @@ -158,7 +158,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
> if (likely(pos + i < mcqe_n))
> rte_prefetch0((void *)(cq + pos + i));
> -
> /* A.1 load mCQEs into a 128bit register. */
> mcqe1 = (vector unsigned char)vec_vsx_ld(0,
> (signed int const *)&mcq[pos % 8]);
> @@ -287,6 +286,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> pos += MLX5_VPMD_DESCS_PER_LOOP;
> /* Move to next CQE and invalidate consumed CQEs. */
> if (!(pos & 0x7) && pos < mcqe_n) {
> + if (pos + 8 < mcqe_n)
> + rte_prefetch0((void *)(cq + pos + 8));
> mcq = (void *)&(cq + pos)->pkt_info;
> for (i = 0; i < 8; ++i)
> cq[inv++].op_own =
> MLX5_CQE_INVALIDATE;
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index 555c342626..6c3149523e 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -145,6 +145,7 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> -1UL << ((mcqe_n - pos) *
> sizeof(uint16_t) * 8) : 0);
> #endif
> +
> for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
> if (likely(pos + i < mcqe_n))
> rte_prefetch0((void *)(cq + pos + i));
> @@ -227,6 +228,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> pos += MLX5_VPMD_DESCS_PER_LOOP;
> /* Move to next CQE and invalidate consumed CQEs. */
> if (!(pos & 0x7) && pos < mcqe_n) {
> + if (pos + 8 < mcqe_n)
> + rte_prefetch0((void *)(cq + pos + 8));
> mcq = (void *)&(cq + pos)->pkt_info;
> for (i = 0; i < 8; ++i)
> cq[inv++].op_own =
> MLX5_CQE_INVALIDATE;
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> index 34e3397115..554924d7fc 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> @@ -135,7 +135,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
> if (likely(pos + i < mcqe_n))
> rte_prefetch0((void *)(cq + pos + i));
> -
> /* A.1 load mCQEs into a 128bit register. */
> mcqe1 = _mm_loadu_si128((__m128i *)&mcq[pos % 8]);
> mcqe2 = _mm_loadu_si128((__m128i *)&mcq[pos % 8 + 2]);
> @@ -214,6 +213,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> pos += MLX5_VPMD_DESCS_PER_LOOP;
> /* Move to next CQE and invalidate consumed CQEs. */
> if (!(pos & 0x7) && pos < mcqe_n) {
> + if (pos + 8 < mcqe_n)
> + rte_prefetch0((void *)(cq + pos + 8));
> mcq = (void *)(cq + pos);
> for (i = 0; i < 8; ++i)
> cq[inv++].op_own =
> MLX5_CQE_INVALIDATE;
> --
> 2.24.1
Patch applied to next-net-mlx,
Kindest regards,
Raslan Darawsheh
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2020-07-23 8:57 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-22 20:32 [dpdk-dev] [PATCH] net/mlx5: fix vectorized mini-CQE prefetching Alexander Kozyrev
2020-07-23 8:57 ` Raslan Darawsheh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).