* [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning
@ 2021-06-01 8:30 Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations Ruifeng Wang
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Ruifeng Wang @ 2021-06-01 8:30 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang
This series include optimizations for MLX5 PMD.
In tests on Arm N1SDP with MLX5 40G NIC, changes
showed performance gain.
Ruifeng Wang (2):
net/mlx5: remove redundant operations
net/mlx5: reduce unnecessary memory access
drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
2 files changed, 5 insertions(+), 10 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations
2021-06-01 8:30 [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
@ 2021-06-01 8:30 ` Ruifeng Wang
2021-07-02 8:12 ` Slava Ovsiienko
2021-06-01 8:30 ` [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Ruifeng Wang @ 2021-06-01 8:30 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang, stable
Some operations on mask are redundant and can be removed.
The change yielded 1.6% performance gain on N1SDP.
On ThunderX2, slight performance uplift was also observed.
Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
comp_mask), 0)) /
(sizeof(uint16_t) * 8);
/* D.6 mask out entries after the compressed CQE. */
- mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ?
- -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
- 0);
- invalid_mask = vorr_u16(invalid_mask, mask);
+ invalid_mask = vorr_u16(invalid_mask, comp_mask);
/* D.7 count non-compressed valid CQEs. */
n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
invalid_mask), 0)) / (sizeof(uint16_t) * 8);
nocmp_n += n;
- /* D.2 get the final invalid mask. */
- mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
- -1UL >> (n * sizeof(uint16_t) * 8) : 0);
- invalid_mask = vorr_u16(invalid_mask, mask);
/* D.3 check error in opcode. */
opcode = vceq_u16(resp_err_check, opcode);
opcode = vbic_u16(opcode, invalid_mask);
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access
2021-06-01 8:30 [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations Ruifeng Wang
@ 2021-06-01 8:30 ` Ruifeng Wang
2021-07-02 7:05 ` Slava Ovsiienko
2021-06-30 7:22 ` [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
3 siblings, 1 reply; 16+ messages in thread
From: Ruifeng Wang @ 2021-06-01 8:30 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang
MR btree len is a constant during Rx replenish.
Moved retrieve of the value out of loop to reduce data loads.
Slight performance uplift was measured on N1SDP.
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index d5af2d91ff..fc7e2a7f41 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
volatile struct mlx5_wqe_data_seg *wq =
&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
unsigned int i;
+ uint16_t btree_len;
if (n >= rxq->rq_repl_thresh) {
MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
@@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
rxq->stats.rx_nombuf += n;
return;
}
+
+ btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
for (i = 0; i < n; ++i) {
void *buf_addr;
@@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
RTE_PKTMBUF_HEADROOM);
/* If there's a single MR, no need to replace LKey. */
- if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh)
- > 1))
+ if (unlikely(btree_len > 1))
wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
}
rxq->rq_ci += n;
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning
2021-06-01 8:30 [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
@ 2021-06-30 7:22 ` Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
3 siblings, 0 replies; 16+ messages in thread
From: Ruifeng Wang @ 2021-06-30 7:22 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, June 1, 2021 4:31 PM
> To: rasland@nvidia.com; matan@nvidia.com; shahafs@nvidia.com;
> viacheslavo@nvidia.com
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: [PATCH 0/2] MLX5 PMD tuning
>
> This series include optimizations for MLX5 PMD.
> In tests on Arm N1SDP with MLX5 40G NIC, changes showed performance
> gain.
>
> Ruifeng Wang (2):
> net/mlx5: remove redundant operations
> net/mlx5: reduce unnecessary memory access
>
> drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
> drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> 2 files changed, 5 insertions(+), 10 deletions(-)
>
> --
> 2.25.1
Ping.
Appreciate your review of these patches.
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access
2021-06-01 8:30 ` [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
@ 2021-07-02 7:05 ` Slava Ovsiienko
2021-07-02 7:28 ` Ruifeng Wang
0 siblings, 1 reply; 16+ messages in thread
From: Slava Ovsiienko @ 2021-07-02 7:05 UTC (permalink / raw)
To: Ruifeng Wang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, honnappa.nagarahalli
Hi, Ruifeng
Could we go further and implement loop inside the conditional?
Like this:
if (mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1) {
for (i = 0; i < n; ++i) {
void *buf_addr = elts[i]->buf_addr;
wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
RTE_PKTMBUF_HEADROOM);
wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
}
} else {
for (i = 0; i < n; ++i) {
void *buf_addr = elts[i]->buf_addr;
wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
RTE_PKTMBUF_HEADROOM);
}
}
What do you think?
Also, we should check the performance on other archs is not affected.
With best regards,
Slava
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, June 1, 2021 11:31
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
>
> MR btree len is a constant during Rx replenish.
> Moved retrieve of the value out of loop to reduce data loads.
> Slight performance uplift was measured on N1SDP.
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
> drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c
> b/drivers/net/mlx5/mlx5_rxtx_vec.c
> index d5af2d91ff..fc7e2a7f41 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.c
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
> @@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
> volatile struct mlx5_wqe_data_seg *wq =
> &((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
> unsigned int i;
> + uint16_t btree_len;
>
> if (n >= rxq->rq_repl_thresh) {
> MLX5_ASSERT(n >=
> MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
> @@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
> rxq->stats.rx_nombuf += n;
> return;
> }
> +
> + btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
> for (i = 0; i < n; ++i) {
> void *buf_addr;
>
> @@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
> wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
>
> RTE_PKTMBUF_HEADROOM);
> /* If there's a single MR, no need to replace LKey. */
> - if (unlikely(mlx5_mr_btree_len(&rxq-
> >mr_ctrl.cache_bh)
> - > 1))
> + if (unlikely(btree_len > 1))
> wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> }
> rxq->rq_ci += n;
> --
> 2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access
2021-07-02 7:05 ` Slava Ovsiienko
@ 2021-07-02 7:28 ` Ruifeng Wang
0 siblings, 0 replies; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-02 7:28 UTC (permalink / raw)
To: Slava Ovsiienko, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, Honnappa Nagarahalli, nd
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Friday, July 2, 2021 3:06 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
>
> Hi, Ruifeng
>
> Could we go further and implement loop inside the conditional?
> Like this:
> if (mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1) {
> for (i = 0; i < n; ++i) {
> void *buf_addr = elts[i]->buf_addr;
>
> wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
> RTE_PKTMBUF_HEADROOM);
> wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> }
> } else {
> for (i = 0; i < n; ++i) {
> void *buf_addr = elts[i]->buf_addr;
>
> wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
> RTE_PKTMBUF_HEADROOM);
> }
> }
> What do you think?
Agree. Loop inside the conditional should be more efficient.
> Also, we should check the performance on other archs is not affected.
I will also test on x86 platform that I have.
>
> With best regards,
> Slava
>
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, June 1, 2021 11:31
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
> >
> > MR btree len is a constant during Rx replenish.
> > Moved retrieve of the value out of loop to reduce data loads.
> > Slight performance uplift was measured on N1SDP.
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> > drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c
> > b/drivers/net/mlx5/mlx5_rxtx_vec.c
> > index d5af2d91ff..fc7e2a7f41 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec.c
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
> > @@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> > *rxq)
> > volatile struct mlx5_wqe_data_seg *wq =
> > &((volatile struct mlx5_wqe_data_seg *)rxq-
> >wqes)[elts_idx];
> > unsigned int i;
> > + uint16_t btree_len;
> >
> > if (n >= rxq->rq_repl_thresh) {
> > MLX5_ASSERT(n >=
> > MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
> > @@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct
> mlx5_rxq_data
> > *rxq)
> > rxq->stats.rx_nombuf += n;
> > return;
> > }
> > +
> > + btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
> > for (i = 0; i < n; ++i) {
> > void *buf_addr;
> >
> > @@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct
> mlx5_rxq_data
> > *rxq)
> > wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr
> +
> >
> > RTE_PKTMBUF_HEADROOM);
> > /* If there's a single MR, no need to replace LKey. */
> > - if (unlikely(mlx5_mr_btree_len(&rxq-
> > >mr_ctrl.cache_bh)
> > - > 1))
> > + if (unlikely(btree_len > 1))
> > wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> > }
> > rxq->rq_ci += n;
> > --
> > 2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations
2021-06-01 8:30 ` [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations Ruifeng Wang
@ 2021-07-02 8:12 ` Slava Ovsiienko
2021-07-02 10:30 ` Ruifeng Wang
0 siblings, 1 reply; 16+ messages in thread
From: Slava Ovsiienko @ 2021-07-02 8:12 UTC (permalink / raw)
To: Ruifeng Wang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, honnappa.nagarahalli, stable
Hi, Ruifeng
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, June 1, 2021 11:31
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> stable@dpdk.org
> Subject: [PATCH 1/2] net/mlx5: remove redundant operations
>
> Some operations on mask are redundant and can be removed.
> The change yielded 1.6% performance gain on N1SDP.
> On ThunderX2, slight performance uplift was also observed.
>
> Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> Cc: stable@dpdk.org
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
> drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index 2234fbe6b2..98a75b09c6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> comp_mask), 0)) /
> (sizeof(uint16_t) * 8);
> /* D.6 mask out entries after the compressed CQE. */
> - mask = vcreate_u16(comp_idx <
> MLX5_VPMD_DESCS_PER_LOOP ?
> - -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> - 0);
> - invalid_mask = vorr_u16(invalid_mask, mask);
> + invalid_mask = vorr_u16(invalid_mask, comp_mask);
Mmmm... I'm not sure we can drop the masking compressed (and following) CQE skip.
Let's consider the completion scenario (the series of 4 CQEs, each element is 64B long)
0: normal uncompressed CQE, ownership OK, format uncompressed, opcode OK, no error
1: compressed CQE, ownership OK, format compressed, opcode OK, no error
2: miniCQE array, format can be any!!, may be discovered as ownership OK, format uncompressed, opcode OK, no error
3: miniCQE array, format can be any!!, may be discovered as ownership OK, format uncompressed, opcode OK, no error
Obviously, we should unconditionally mask out 2 and 3, regardless of recognized their formats/opcode/error/etc.
I think we can get the diff above and skip diff below:
> /* D.7 count non-compressed valid CQEs. */
> n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> nocmp_n += n;
> - /* D.2 get the final invalid mask. */
> - mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> - -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> - invalid_mask = vorr_u16(invalid_mask, mask);
and get the correct final invalid_mask - all compressed and invalid CQEs and following ones will be masked out.
With best regards,
Slava
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations
2021-07-02 8:12 ` Slava Ovsiienko
@ 2021-07-02 10:30 ` Ruifeng Wang
2021-07-05 10:01 ` Slava Ovsiienko
0 siblings, 1 reply; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-02 10:30 UTC (permalink / raw)
To: Slava Ovsiienko, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, Honnappa Nagarahalli, stable, nd
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Friday, July 2, 2021 4:13 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
>
> Hi, Ruifeng
Hi, Slava
>
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, June 1, 2021 11:31
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> > stable@dpdk.org
> > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > Some operations on mask are redundant and can be removed.
> > The change yielded 1.6% performance gain on N1SDP.
> > On ThunderX2, slight performance uplift was also observed.
> >
> > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> > drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> > 1 file changed, 1 insertion(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > index 2234fbe6b2..98a75b09c6 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > volatile struct mlx5_cqe *cq,
> > comp_mask), 0)) /
> > (sizeof(uint16_t) * 8);
> > /* D.6 mask out entries after the compressed CQE. */
> > - mask = vcreate_u16(comp_idx <
> > MLX5_VPMD_DESCS_PER_LOOP ?
> > - -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > - 0);
> > - invalid_mask = vorr_u16(invalid_mask, mask);
> > + invalid_mask = vorr_u16(invalid_mask, comp_mask);
>
> Mmmm... I'm not sure we can drop the masking compressed (and following)
> CQE skip.
> Let's consider the completion scenario (the series of 4 CQEs, each element is
> 64B long)
>
> 0: normal uncompressed CQE, ownership OK, format uncompressed, opcode
> OK, no error
> 1: compressed CQE, ownership OK, format compressed, opcode OK, no error
> 2: miniCQE array, format can be any!!, may be discovered as ownership OK,
> format uncompressed, opcode OK, no error
> 3: miniCQE array, format can be any!!, may be discovered as ownership OK,
> format uncompressed, opcode OK, no error
Thanks for your review and explanation about CQE processing details.
I did the change based on the fact that some calculations doesn't change the data.
So some intermediate calculations were removed.
In the above diff section, result of 'mask' always equals to the nearest 'comp_mask' that above it.
So I just remoed 'mask' and use 'comp_mask' instead.
>
> Obviously, we should unconditionally mask out 2 and 3, regardless of
> recognized their formats/opcode/error/etc.
> I think we can get the diff above and skip diff below:
>
> > /* D.7 count non-compressed valid CQEs. */
> > n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> > invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> > nocmp_n += n;
> > - /* D.2 get the final invalid mask. */
> > - mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > - -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > - invalid_mask = vorr_u16(invalid_mask, mask);
>
> and get the correct final invalid_mask - all compressed and invalid CQEs and
> following ones will be masked out.
This diff section is similar to the previous one.
'mask' always equals to the nearest 'invalid_mask' that above it.
So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be removed.
Code logic is not changed. But I'm not sure the code change impacts readability
or maintainability that you may concern.
Thanks.
>
> With best regards,
> Slava
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations
2021-07-02 10:30 ` Ruifeng Wang
@ 2021-07-05 10:01 ` Slava Ovsiienko
2021-07-07 8:00 ` Ruifeng Wang
0 siblings, 1 reply; 16+ messages in thread
From: Slava Ovsiienko @ 2021-07-05 10:01 UTC (permalink / raw)
To: Ruifeng Wang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, Honnappa Nagarahalli, stable, nd
Hi, Ruifeng
The invalid_mask is used to set error flags and calculate the statistics.
So, all the CQEs the first one with error or invalid status should be masked out
(and the CQEs after that).
IMO, what we could improve (apply just the part of the patch below):
>>>>
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
comp_mask), 0)) /
(sizeof(uint16_t) * 8);
/* D.6 mask out entries after the compressed CQE. */
- mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ?
- -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
- 0);
- invalid_mask = vorr_u16(invalid_mask, mask);
+ invalid_mask = vorr_u16(invalid_mask, comp_mask);
/* D.7 count non-compressed valid CQEs. */
n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
invalid_mask), 0)) / (sizeof(uint16_t) * 8);
nocmp_n += n;
<<<<
And that's it. The rest of the patch:
>>>>
- /* D.2 get the final invalid mask. */
- mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
- -1UL >> (n * sizeof(uint16_t) * 8) : 0);
- invalid_mask = vorr_u16(invalid_mask, mask);
<<<<
Should not be applied, otherwise the following might be affected:
opcode = vbic_u16(opcode, invalid_mask);
...
opcode = vbic_u16(opcode, invalid_mask);
With best regards,
Slava
> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: Friday, July 2, 2021 13:30
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
>
> > -----Original Message-----
> > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > Sent: Friday, July 2, 2021 4:13 PM
> > To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > Hi, Ruifeng
> Hi, Slava
>
> >
> > > -----Original Message-----
> > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Sent: Tuesday, June 1, 2021 11:31
> > > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > > Ovsiienko <viacheslavo@nvidia.com>
> > > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > > honnappa.nagarahalli@arm.com; Ruifeng Wang
> <ruifeng.wang@arm.com>;
> > > stable@dpdk.org
> > > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> > >
> > > Some operations on mask are redundant and can be removed.
> > > The change yielded 1.6% performance gain on N1SDP.
> > > On ThunderX2, slight performance uplift was also observed.
> > >
> > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > > drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> > > 1 file changed, 1 insertion(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > index 2234fbe6b2..98a75b09c6 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > > volatile struct mlx5_cqe *cq,
> > > comp_mask), 0)) /
> > > (sizeof(uint16_t) * 8);
> > > /* D.6 mask out entries after the compressed CQE. */
> > > - mask = vcreate_u16(comp_idx <
> > > MLX5_VPMD_DESCS_PER_LOOP ?
> > > - -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > > - 0);
> > > - invalid_mask = vorr_u16(invalid_mask, mask);
> > > + invalid_mask = vorr_u16(invalid_mask, comp_mask);
> >
> > Mmmm... I'm not sure we can drop the masking compressed (and
> > following) CQE skip.
> > Let's consider the completion scenario (the series of 4 CQEs, each
> > element is 64B long)
> >
> > 0: normal uncompressed CQE, ownership OK, format uncompressed, opcode
> > OK, no error
> > 1: compressed CQE, ownership OK, format compressed, opcode OK, no
> > error
> > 2: miniCQE array, format can be any!!, may be discovered as ownership
> > OK, format uncompressed, opcode OK, no error
> > 3: miniCQE array, format can be any!!, may be discovered as ownership
> > OK, format uncompressed, opcode OK, no error
>
> Thanks for your review and explanation about CQE processing details.
> I did the change based on the fact that some calculations doesn't change the
> data.
> So some intermediate calculations were removed.
>
> In the above diff section, result of 'mask' always equals to the nearest
> 'comp_mask' that above it.
> So I just remoed 'mask' and use 'comp_mask' instead.
> >
> > Obviously, we should unconditionally mask out 2 and 3, regardless of
> > recognized their formats/opcode/error/etc.
> > I think we can get the diff above and skip diff below:
> >
> > > /* D.7 count non-compressed valid CQEs. */
> > > n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> > > invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> > > nocmp_n += n;
> > > - /* D.2 get the final invalid mask. */
> > > - mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > > - -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > > - invalid_mask = vorr_u16(invalid_mask, mask);
> >
> > and get the correct final invalid_mask - all compressed and invalid
> > CQEs and following ones will be masked out.
>
> This diff section is similar to the previous one.
> 'mask' always equals to the nearest 'invalid_mask' that above it.
> So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be removed.
>
> Code logic is not changed. But I'm not sure the code change impacts readability
> or maintainability that you may concern.
>
> Thanks.
> >
> > With best regards,
> > Slava
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations
2021-07-05 10:01 ` Slava Ovsiienko
@ 2021-07-07 8:00 ` Ruifeng Wang
0 siblings, 0 replies; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-07 8:00 UTC (permalink / raw)
To: Slava Ovsiienko, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, Honnappa Nagarahalli, stable, nd, nd
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Monday, July 5, 2021 6:02 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
>
> Hi, Ruifeng
>
> The invalid_mask is used to set error flags and calculate the statistics.
> So, all the CQEs the first one with error or invalid status should be masked
> out (and the CQEs after that).
Now I understand it. What I was missing is inconsecutive mask bits.
Thanks for your patience.
I'll update in next version.
>
> IMO, what we could improve (apply just the part of the patch below):
> >>>>
> index 2234fbe6b2..98a75b09c6 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
> comp_mask), 0)) /
> (sizeof(uint16_t) * 8);
> /* D.6 mask out entries after the compressed CQE. */
> - mask = vcreate_u16(comp_idx <
> MLX5_VPMD_DESCS_PER_LOOP ?
> - -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> - 0);
> - invalid_mask = vorr_u16(invalid_mask, mask);
> + invalid_mask = vorr_u16(invalid_mask, comp_mask);
> /* D.7 count non-compressed valid CQEs. */
> n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> nocmp_n += n;
> <<<<
>
> And that's it. The rest of the patch:
> >>>>
> - /* D.2 get the final invalid mask. */
> - mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> - -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> - invalid_mask = vorr_u16(invalid_mask, mask);
> <<<<
> Should not be applied, otherwise the following might be affected:
>
> opcode = vbic_u16(opcode, invalid_mask); ...
> opcode = vbic_u16(opcode, invalid_mask);
>
> With best regards,
> Slava
>
> > -----Original Message-----
> > From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Sent: Friday, July 2, 2021 13:30
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Raslan Darawsheh
> > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd
> > <nd@arm.com>
> > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> >
> > > -----Original Message-----
> > > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > Sent: Friday, July 2, 2021 4:13 PM
> > > To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> > > <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> Shuler
> > > <shahafs@nvidia.com>
> > > Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > > Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
> > >
> > > Hi, Ruifeng
> > Hi, Slava
> >
> > >
> > > > -----Original Message-----
> > > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Sent: Tuesday, June 1, 2021 11:31
> > > > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > > > Ovsiienko <viacheslavo@nvidia.com>
> > > > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > > > honnappa.nagarahalli@arm.com; Ruifeng Wang
> > <ruifeng.wang@arm.com>;
> > > > stable@dpdk.org
> > > > Subject: [PATCH 1/2] net/mlx5: remove redundant operations
> > > >
> > > > Some operations on mask are redundant and can be removed.
> > > > The change yielded 1.6% performance gain on N1SDP.
> > > > On ThunderX2, slight performance uplift was also observed.
> > > >
> > > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for
> > > > ARM")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > > drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
> > > > 1 file changed, 1 insertion(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > index 2234fbe6b2..98a75b09c6 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > > @@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > > > volatile struct mlx5_cqe *cq,
> > > > comp_mask), 0)) /
> > > > (sizeof(uint16_t) * 8);
> > > > /* D.6 mask out entries after the compressed CQE. */
> > > > - mask = vcreate_u16(comp_idx <
> > > > MLX5_VPMD_DESCS_PER_LOOP ?
> > > > - -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
> > > > - 0);
> > > > - invalid_mask = vorr_u16(invalid_mask, mask);
> > > > + invalid_mask = vorr_u16(invalid_mask, comp_mask);
> > >
> > > Mmmm... I'm not sure we can drop the masking compressed (and
> > > following) CQE skip.
> > > Let's consider the completion scenario (the series of 4 CQEs, each
> > > element is 64B long)
> > >
> > > 0: normal uncompressed CQE, ownership OK, format uncompressed,
> > > opcode OK, no error
> > > 1: compressed CQE, ownership OK, format compressed, opcode OK, no
> > > error
> > > 2: miniCQE array, format can be any!!, may be discovered as
> > > ownership OK, format uncompressed, opcode OK, no error
> > > 3: miniCQE array, format can be any!!, may be discovered as
> > > ownership OK, format uncompressed, opcode OK, no error
> >
> > Thanks for your review and explanation about CQE processing details.
> > I did the change based on the fact that some calculations doesn't
> > change the data.
> > So some intermediate calculations were removed.
> >
> > In the above diff section, result of 'mask' always equals to the
> > nearest 'comp_mask' that above it.
> > So I just remoed 'mask' and use 'comp_mask' instead.
> > >
> > > Obviously, we should unconditionally mask out 2 and 3, regardless of
> > > recognized their formats/opcode/error/etc.
> > > I think we can get the diff above and skip diff below:
> > >
> > > > /* D.7 count non-compressed valid CQEs. */
> > > > n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
> > > > invalid_mask), 0)) / (sizeof(uint16_t) * 8);
> > > > nocmp_n += n;
> > > > - /* D.2 get the final invalid mask. */
> > > > - mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
> > > > - -1UL >> (n * sizeof(uint16_t) * 8) : 0);
> > > > - invalid_mask = vorr_u16(invalid_mask, mask);
> > >
> > > and get the correct final invalid_mask - all compressed and invalid
> > > CQEs and following ones will be masked out.
> >
> > This diff section is similar to the previous one.
> > 'mask' always equals to the nearest 'invalid_mask' that above it.
> > So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be
> removed.
> >
> > Code logic is not changed. But I'm not sure the code change impacts
> > readability or maintainability that you may concern.
> >
> > Thanks.
> > >
> > > With best regards,
> > > Slava
^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH v2 0/2] MLX5 PMD tuning
2021-06-01 8:30 [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
` (2 preceding siblings ...)
2021-06-30 7:22 ` [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
@ 2021-07-07 9:03 ` Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations Ruifeng Wang
` (2 more replies)
3 siblings, 3 replies; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-07 9:03 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang
This series include optimizations for MLX5 PMD.
In tests on Arm N1SDP with MLX5 40G NIC, changes
showed performance gain.
Ruifeng Wang (2):
net/mlx5: remove redundant operations
net/mlx5: reduce unnecessary memory access
drivers/net/mlx5/mlx5_rxtx_vec.c | 35 +++++++++++++++------------
drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 11 ++++-----
2 files changed, 25 insertions(+), 21 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
@ 2021-07-07 9:03 ` Ruifeng Wang
2021-07-12 15:31 ` Slava Ovsiienko
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
2021-07-13 9:32 ` [dpdk-dev] [PATCH v2 0/2] MLX5 PMD tuning Raslan Darawsheh
2 siblings, 1 reply; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-07 9:03 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang, stable
Mask of entries after the compressed CQE is covered by invalid mask of
non-compressed valid CQEs. Hence remove redundant calculation on mask.
The change showed slight performance uplift on N1SDP.
Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 2234fbe6b2..ce50a3ccc4 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -767,16 +767,15 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
comp_idx = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
comp_mask), 0)) /
(sizeof(uint16_t) * 8);
- /* D.6 mask out entries after the compressed CQE. */
- mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ?
- -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
- 0);
- invalid_mask = vorr_u16(invalid_mask, mask);
+ invalid_mask = vorr_u16(invalid_mask, comp_mask);
/* D.7 count non-compressed valid CQEs. */
n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
invalid_mask), 0)) / (sizeof(uint16_t) * 8);
nocmp_n += n;
- /* D.2 get the final invalid mask. */
+ /*
+ * D.2 mask out entries after the compressed CQE.
+ * get the final invalid mask.
+ */
mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
-1UL >> (n * sizeof(uint16_t) * 8) : 0);
invalid_mask = vorr_u16(invalid_mask, mask);
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations Ruifeng Wang
@ 2021-07-07 9:03 ` Ruifeng Wang
2021-07-12 15:33 ` Slava Ovsiienko
2021-07-13 9:32 ` [dpdk-dev] [PATCH v2 0/2] MLX5 PMD tuning Raslan Darawsheh
2 siblings, 1 reply; 16+ messages in thread
From: Ruifeng Wang @ 2021-07-07 9:03 UTC (permalink / raw)
To: rasland, matan, shahafs, viacheslavo
Cc: dev, jerinj, nd, honnappa.nagarahalli, Ruifeng Wang
MR btree len is a constant during Rx replenish.
Moved retrieve of the value out of loop to reduce data loads.
Slight performance uplift was measured on both N1SDP and x86.
Suggested-by: Slava Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
drivers/net/mlx5/mlx5_rxtx_vec.c | 35 ++++++++++++++++++--------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index d5af2d91ff..e64ef70181 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -106,22 +106,27 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
rxq->stats.rx_nombuf += n;
return;
}
- for (i = 0; i < n; ++i) {
- void *buf_addr;
-
- /*
- * In order to support the mbufs with external attached
- * data buffer we should use the buf_addr pointer
- * instead of rte_mbuf_buf_addr(). It touches the mbuf
- * itself and may impact the performance.
- */
- buf_addr = elts[i]->buf_addr;
- wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
- RTE_PKTMBUF_HEADROOM);
- /* If there's a single MR, no need to replace LKey. */
- if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh)
- > 1))
+ if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1)) {
+ for (i = 0; i < n; ++i) {
+ /*
+ * In order to support the mbufs with external attached
+ * data buffer we should use the buf_addr pointer
+ * instead of rte_mbuf_buf_addr(). It touches the mbuf
+ * itself and may impact the performance.
+ */
+ void *buf_addr = elts[i]->buf_addr;
+
+ wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
+ RTE_PKTMBUF_HEADROOM);
wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
+ }
+ } else {
+ for (i = 0; i < n; ++i) {
+ void *buf_addr = elts[i]->buf_addr;
+
+ wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
+ RTE_PKTMBUF_HEADROOM);
+ }
}
rxq->rq_ci += n;
/* Prevent overflowing into consumed mbufs. */
--
2.25.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations Ruifeng Wang
@ 2021-07-12 15:31 ` Slava Ovsiienko
0 siblings, 0 replies; 16+ messages in thread
From: Slava Ovsiienko @ 2021-07-12 15:31 UTC (permalink / raw)
To: Ruifeng Wang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, honnappa.nagarahalli, stable
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Wednesday, July 7, 2021 12:03
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> stable@dpdk.org
> Subject: [PATCH v2 1/2] net/mlx5: remove redundant operations
>
> Mask of entries after the compressed CQE is covered by invalid mask of non-
> compressed valid CQEs. Hence remove redundant calculation on mask.
> The change showed slight performance uplift on N1SDP.
>
> Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> Cc: stable@dpdk.org
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Thank you for the patch update,
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
@ 2021-07-12 15:33 ` Slava Ovsiienko
0 siblings, 0 replies; 16+ messages in thread
From: Slava Ovsiienko @ 2021-07-12 15:33 UTC (permalink / raw)
To: Ruifeng Wang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
Cc: dev, jerinj, nd, honnappa.nagarahalli
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Wednesday, July 7, 2021 12:03
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access
>
> MR btree len is a constant during Rx replenish.
> Moved retrieve of the value out of loop to reduce data loads.
> Slight performance uplift was measured on both N1SDP and x86.
>
> Suggested-by: Slava Ovsiienko <viacheslavo@nvidia.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Thank you for the update,
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/2] MLX5 PMD tuning
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
@ 2021-07-13 9:32 ` Raslan Darawsheh
2 siblings, 0 replies; 16+ messages in thread
From: Raslan Darawsheh @ 2021-07-13 9:32 UTC (permalink / raw)
To: Ruifeng Wang, Matan Azrad, Shahaf Shuler, Slava Ovsiienko
Cc: dev, jerinj, nd, honnappa.nagarahalli
Hi,
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Wednesday, July 7, 2021 12:03 PM
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v2 0/2] MLX5 PMD tuning
>
> This series include optimizations for MLX5 PMD.
> In tests on Arm N1SDP with MLX5 40G NIC, changes showed performance
> gain.
>
> Ruifeng Wang (2):
> net/mlx5: remove redundant operations
> net/mlx5: reduce unnecessary memory access
>
> drivers/net/mlx5/mlx5_rxtx_vec.c | 35 +++++++++++++++------------
> drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 11 ++++-----
> 2 files changed, 25 insertions(+), 21 deletions(-)
>
> --
> 2.25.1
Series applied to next-net-mlx,
Kindest regards,
Raslan Darawsheh
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-07-13 9:32 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 8:30 [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations Ruifeng Wang
2021-07-02 8:12 ` Slava Ovsiienko
2021-07-02 10:30 ` Ruifeng Wang
2021-07-05 10:01 ` Slava Ovsiienko
2021-07-07 8:00 ` Ruifeng Wang
2021-06-01 8:30 ` [dpdk-dev] [PATCH 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
2021-07-02 7:05 ` Slava Ovsiienko
2021-07-02 7:28 ` Ruifeng Wang
2021-06-30 7:22 ` [dpdk-dev] [PATCH 0/2] MLX5 PMD tuning Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 " Ruifeng Wang
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 1/2] net/mlx5: remove redundant operations Ruifeng Wang
2021-07-12 15:31 ` Slava Ovsiienko
2021-07-07 9:03 ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: reduce unnecessary memory access Ruifeng Wang
2021-07-12 15:33 ` Slava Ovsiienko
2021-07-13 9:32 ` [dpdk-dev] [PATCH v2 0/2] MLX5 PMD tuning Raslan Darawsheh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).