DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] net/mlx5: improve vMPRQ descriptors allocation locality
@ 2020-11-06  0:04 Alexander Kozyrev
  2020-11-08  4:23 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Kozyrev @ 2020-11-06  0:04 UTC (permalink / raw)
  To: dev; +Cc: rasland, viacheslavo, matan

There is a performance penalty for the replenish scheme
used in vectorized Rx burst for both MPRQ and SPRQ.
Mbuf elements are being filled at the end of the mbufs
array and being replenished at the beginning. That leads
to an increase in cache misses and the performance drop.
The more Rx descriptors are used the worse the situation.

Change the allocation scheme for vectorized MPRQ Rx burst:
allocate new mbufs only when consumed mbufs are almost
depleted (always have one burst gap between allocated and
consumed indices). Keeping a small number of mbufs allocated
improves cache locality and improves performance a lot.

Unfortunately, this approach cannot be applied to SPRQ Rx
burst routine. In MPRQ Rx burst we simply copy packets from
external MPRQ buffers or attach these buffers to mbufs.
In SPRQ Rx burst we allow the NIC to fill mbufs for us.
Hence keeping a small number of allocated mbufs will limit
NIC ability to fill as many buffers as possible. This fact
offsets the advantage of better cache locality.

Fixes: f2fa5327ff ("net/mlx5: implement vectorized MPRQ burst")

Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 469ea8401d..8001ab6eb3 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -145,16 +145,16 @@ mlx5_rx_mprq_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 	const uint32_t strd_n = 1 << rxq->strd_num_n;
 	const uint32_t elts_n = wqe_n * strd_n;
 	const uint32_t wqe_mask = elts_n - 1;
-	uint32_t n = elts_n - (rxq->elts_ci - rxq->rq_pi);
+	uint32_t n = rxq->elts_ci - rxq->rq_pi;
 	uint32_t elts_idx = rxq->elts_ci & wqe_mask;
 	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
 
-	/* Not to cross queue end. */
-	if (n >= rxq->rq_repl_thresh) {
+	if (n <= rxq->rq_repl_thresh) {
 		MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n));
 		MLX5_ASSERT(MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n) >
 			     MLX5_VPMD_DESCS_PER_LOOP);
-		n = RTE_MIN(n, elts_n - elts_idx);
+		/* Not to cross queue end. */
+		n = RTE_MIN(n + MLX5_VPMD_RX_MAX_BURST, elts_n - elts_idx);
 		if (rte_mempool_get_bulk(rxq->mp, (void *)elts, n) < 0) {
 			rxq->stats.rx_nombuf += n;
 			return;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [dpdk-dev] [PATCH v2] net/mlx5: improve vMPRQ descriptors allocation locality
  2020-11-06  0:04 [dpdk-dev] [PATCH] net/mlx5: improve vMPRQ descriptors allocation locality Alexander Kozyrev
@ 2020-11-08  4:23 ` Alexander Kozyrev
  2020-11-10 16:30   ` Slava Ovsiienko
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Kozyrev @ 2020-11-08  4:23 UTC (permalink / raw)
  To: dev; +Cc: rasland, matan, viacheslavo

There is a performance penalty for the replenish scheme
used in vectorized Rx burst for both MPRQ and SPRQ.
Mbuf elements are being filled at the end of the mbufs
array and being replenished at the beginning. That leads
to an increase in cache misses and the performance drop.
The more Rx descriptors are used the worse the situation.

Change the allocation scheme for vectorized MPRQ Rx burst:
allocate new mbufs only when consumed mbufs are almost
depleted (always have one burst gap between allocated and
consumed indices). Keeping a small number of mbufs allocated
improves cache locality and improves performance a lot.

Unfortunately, this approach cannot be applied to SPRQ Rx
burst routine. In MPRQ Rx burst we simply copy packets from
external MPRQ buffers or attach these buffers to mbufs.
In SPRQ Rx burst we allow the NIC to fill mbufs for us.
Hence keeping a small number of allocated mbufs will limit
NIC ability to fill as many buffers as possible. This fact
offsets the advantage of better cache locality.

Fixes: 0f20acbf5e ("net/mlx5: implement vectorized MPRQ burst")

Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
---
v1: https://patchwork.dpdk.org/patch/83779/
v2: fixed assertion for the number of mbufs to replenish

 drivers/net/mlx5/mlx5_rxtx_vec.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 469ea8401d..68c51dce31 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -145,16 +145,17 @@ mlx5_rx_mprq_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 	const uint32_t strd_n = 1 << rxq->strd_num_n;
 	const uint32_t elts_n = wqe_n * strd_n;
 	const uint32_t wqe_mask = elts_n - 1;
-	uint32_t n = elts_n - (rxq->elts_ci - rxq->rq_pi);
+	uint32_t n = rxq->elts_ci - rxq->rq_pi;
 	uint32_t elts_idx = rxq->elts_ci & wqe_mask;
 	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
 
-	/* Not to cross queue end. */
-	if (n >= rxq->rq_repl_thresh) {
-		MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n));
+	if (n <= rxq->rq_repl_thresh) {
+		MLX5_ASSERT(n + MLX5_VPMD_RX_MAX_BURST >=
+			    MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n));
 		MLX5_ASSERT(MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n) >
 			     MLX5_VPMD_DESCS_PER_LOOP);
-		n = RTE_MIN(n, elts_n - elts_idx);
+		/* Not to cross queue end. */
+		n = RTE_MIN(n + MLX5_VPMD_RX_MAX_BURST, elts_n - elts_idx);
 		if (rte_mempool_get_bulk(rxq->mp, (void *)elts, n) < 0) {
 			rxq->stats.rx_nombuf += n;
 			return;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [PATCH v2] net/mlx5: improve vMPRQ descriptors allocation locality
  2020-11-08  4:23 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev
@ 2020-11-10 16:30   ` Slava Ovsiienko
  2020-11-13 17:58     ` Thomas Monjalon
  0 siblings, 1 reply; 4+ messages in thread
From: Slava Ovsiienko @ 2020-11-10 16:30 UTC (permalink / raw)
  To: Alexander Kozyrev, dev; +Cc: Raslan Darawsheh, Matan Azrad

> -----Original Message-----
> From: Alexander Kozyrev <akozyrev@nvidia.com>
> Sent: Sunday, November 8, 2020 6:24
> To: dev@dpdk.org
> Cc: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> Subject: [PATCH v2] net/mlx5: improve vMPRQ descriptors allocation locality
> 
> There is a performance penalty for the replenish scheme used in vectorized Rx
> burst for both MPRQ and SPRQ.
> Mbuf elements are being filled at the end of the mbufs array and being
> replenished at the beginning. That leads to an increase in cache misses and the
> performance drop.
> The more Rx descriptors are used the worse the situation.
> 
> Change the allocation scheme for vectorized MPRQ Rx burst:
> allocate new mbufs only when consumed mbufs are almost depleted (always
> have one burst gap between allocated and consumed indices). Keeping a small
> number of mbufs allocated improves cache locality and improves performance
> a lot.
> 
> Unfortunately, this approach cannot be applied to SPRQ Rx burst routine. In
> MPRQ Rx burst we simply copy packets from external MPRQ buffers or attach
> these buffers to mbufs.
> In SPRQ Rx burst we allow the NIC to fill mbufs for us.
> Hence keeping a small number of allocated mbufs will limit NIC ability to fill as
> many buffers as possible. This fact offsets the advantage of better cache
> locality.
> 
> Fixes: 0f20acbf5e ("net/mlx5: implement vectorized MPRQ burst")
> 
> Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [PATCH v2] net/mlx5: improve vMPRQ descriptors allocation locality
  2020-11-10 16:30   ` Slava Ovsiienko
@ 2020-11-13 17:58     ` Thomas Monjalon
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Monjalon @ 2020-11-13 17:58 UTC (permalink / raw)
  To: Alexander Kozyrev
  Cc: dev, Raslan Darawsheh, Matan Azrad, Slava Ovsiienko, asafp

> > There is a performance penalty for the replenish scheme used in vectorized Rx
> > burst for both MPRQ and SPRQ.
> > Mbuf elements are being filled at the end of the mbufs array and being
> > replenished at the beginning. That leads to an increase in cache misses and the
> > performance drop.
> > The more Rx descriptors are used the worse the situation.
> > 
> > Change the allocation scheme for vectorized MPRQ Rx burst:
> > allocate new mbufs only when consumed mbufs are almost depleted (always
> > have one burst gap between allocated and consumed indices). Keeping a small
> > number of mbufs allocated improves cache locality and improves performance
> > a lot.
> > 
> > Unfortunately, this approach cannot be applied to SPRQ Rx burst routine. In
> > MPRQ Rx burst we simply copy packets from external MPRQ buffers or attach
> > these buffers to mbufs.
> > In SPRQ Rx burst we allow the NIC to fill mbufs for us.
> > Hence keeping a small number of allocated mbufs will limit NIC ability to fill as
> > many buffers as possible. This fact offsets the advantage of better cache
> > locality.
> > 
> > Fixes: 0f20acbf5e ("net/mlx5: implement vectorized MPRQ burst")
> > 
> > Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Applied in next-net-mlx, thanks.



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-11-13 17:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06  0:04 [dpdk-dev] [PATCH] net/mlx5: improve vMPRQ descriptors allocation locality Alexander Kozyrev
2020-11-08  4:23 ` [dpdk-dev] [PATCH v2] " Alexander Kozyrev
2020-11-10 16:30   ` Slava Ovsiienko
2020-11-13 17:58     ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).