Re: [PATCH v6 2/2] net/idpf: enable AVX2 for split queue Tx

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: Shaiq Wani <shaiq.wani@intel.com>
Cc: <dev@dpdk.org>, <aman.deep.singh@intel.com>
Subject: Re: [PATCH v6 2/2] net/idpf: enable AVX2 for split queue Tx
Date: Thu, 9 Oct 2025 15:25:46 +0100	[thread overview]
Message-ID: <aOfF6lYDICH2GsYE@bricha3-mobl1.ger.corp.intel.com> (raw)
In-Reply-To: <20251003094950.2818019-3-shaiq.wani@intel.com>

On Fri, Oct 03, 2025 at 03:19:50PM +0530, Shaiq Wani wrote:
> In case some CPUs don't support AVX512. Enable AVX2 for them to
> get better per-core performance.
> 
> In the single queue model, the same descriptor queue is used by SW
> to post descriptors to the device and used by device to report completed
> descriptors to SW. While as the split queue model separates them into
> different queues for parallel processing and improved performance.
> 
> Signed-off-by: Shaiq Wani <shaiq.wani@intel.com>

Hi,

review comments inline below. [Note, I reviewed from the bottom up because
that tends to be the way this code flows, so earlier comments may only make
sense in the light of other later comments further down!]

/Bruce

> ---
>  drivers/net/intel/idpf/idpf_common_rxtx.h     |   3 +
>  .../net/intel/idpf/idpf_common_rxtx_avx2.c    | 197 ++++++++++++++++++
>  drivers/net/intel/idpf/idpf_rxtx.c            |   9 +
>  3 files changed, 209 insertions(+)
> 
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx.h b/drivers/net/intel/idpf/idpf_common_rxtx.h
> index 87f6895c4c..3636d55272 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx.h
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx.h
> @@ -264,6 +264,9 @@ uint16_t idpf_dp_singleq_recv_pkts_avx2(void *rx_queue,
>  					struct rte_mbuf **rx_pkts,
>  					uint16_t nb_pkts);
>  __rte_internal
> +uint16_t idpf_dp_splitq_xmit_pkts_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
> +				       uint16_t nb_pkts);
> +__rte_internal
>  uint16_t idpf_dp_singleq_xmit_pkts_avx2(void *tx_queue,
>  					struct rte_mbuf **tx_pkts,
>  					uint16_t nb_pkts);
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx2.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
> index ae10ca981f..1d8f7dd0e3 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx2.c
> @@ -800,3 +800,200 @@ idpf_dp_singleq_xmit_pkts_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
>  
>  	return nb_tx;
>  }
> +
> +static __rte_always_inline void
> +idpf_splitq_scan_cq_ring(struct ci_tx_queue *cq)
> +{
> +	struct idpf_splitq_tx_compl_desc *compl_ring;
> +	struct ci_tx_queue *txq;
> +	uint16_t genid, txq_qid, cq_qid, i;
> +	uint8_t ctype;
> +
> +	cq_qid = cq->tx_tail;
> +
> +	for (i = 0; i < IDPD_TXQ_SCAN_CQ_THRESH; i++) {
> +		if (cq_qid == cq->nb_tx_desc) {
> +			cq_qid = 0;
> +			cq->expected_gen_id ^= 1;  /* toggle generation bit */
> +		}
> +
> +		compl_ring = &cq->compl_ring[cq_qid];
> +
> +		genid = (rte_le_to_cpu_16(compl_ring->qid_comptype_gen) &
> +			 IDPF_TXD_COMPLQ_GEN_M) >> IDPF_TXD_COMPLQ_GEN_S;
> +
> +		if (genid != cq->expected_gen_id)
> +			break;
> +
> +		ctype = (rte_le_to_cpu_16(compl_ring->qid_comptype_gen) &
> +			 IDPF_TXD_COMPLQ_COMPL_TYPE_M) >> IDPF_TXD_COMPLQ_COMPL_TYPE_S;
> +
> +		txq_qid = (rte_le_to_cpu_16(compl_ring->qid_comptype_gen) &
> +			   IDPF_TXD_COMPLQ_QID_M) >> IDPF_TXD_COMPLQ_QID_S;
> +
> +		txq = cq->txqs[txq_qid - cq->tx_start_qid];

Given that we have multiple Txq's working off the same completion queue, do
we need to handle the scenario where we have multiple threads sending on
multiple Tx queues at the same time, but using the same CQ?

> +		if (ctype == IDPF_TXD_COMPLT_RS)
> +			txq->rs_compl_count++;

According to what I see here, we increment the completion count packet by
packet, correct? And that matches what I see in the descriptor writing
function where I don't see a separate flag we set for reporting status. In
that case, why are we tracking the next_rs setting for the Tx ring, when
there is no specific RS bit to track?

> +
> +		cq_qid++;
> +	}
> +
> +	cq->tx_tail = cq_qid;
> +}
> +
> +static __rte_always_inline void
> +idpf_splitq_vtx1_avx2(struct idpf_flex_tx_sched_desc *txdp,
> +				struct rte_mbuf *pkt, uint64_t flags)
> +{
> +	uint64_t high_qw =
> +		IDPF_TX_DESC_DTYPE_FLEX_FLOW_SCHE |

Is this a typo in the original enum definition? Should it be ..._FLOW_SCHED?

> +		((uint64_t)flags) |
> +		((uint64_t)pkt->data_len << IDPF_TXD_QW1_TX_BUF_SZ_S);
> +
> +	__m128i descriptor = _mm_set_epi64x(high_qw,
> +						pkt->buf_iova + pkt->data_off);
> +	_mm_storeu_si128((__m128i *)txdp, descriptor);
> +}
> +
> +static inline void
> +idpf_splitq_vtx_avx2(struct idpf_flex_tx_sched_desc *txdp,
> +				struct rte_mbuf **pkt, uint16_t nb_pkts, uint64_t flags)
> +{
> +	const uint64_t hi_qw_tmpl = IDPF_TX_DESC_DTYPE_FLEX_FLOW_SCHE |
> +		((uint64_t)flags);

Line doesn't need wrapping.

> +
> +	/* align if needed */
> +	if (((uintptr_t)txdp & 0x1F) != 0 && nb_pkts != 0) {
> +		idpf_splitq_vtx1_avx2(txdp, *pkt, flags);
> +		txdp++, pkt++, nb_pkts--;
> +	}
> +
> +	for (; nb_pkts >= IDPF_VPMD_DESCS_PER_LOOP;
> +					txdp += IDPF_VPMD_DESCS_PER_LOOP,
> +					pkt += IDPF_VPMD_DESCS_PER_LOOP,
> +					nb_pkts -= IDPF_VPMD_DESCS_PER_LOOP) {

Over-indenting I think. Two tabs should be enough.

> +		uint64_t hi_qw3 = hi_qw_tmpl |
> +			((uint64_t)pkt[3]->data_len << IDPF_TXD_QW1_TX_BUF_SZ_S);
> +		uint64_t hi_qw2 = hi_qw_tmpl |
> +			((uint64_t)pkt[2]->data_len << IDPF_TXD_QW1_TX_BUF_SZ_S);
> +		uint64_t hi_qw1 = hi_qw_tmpl |
> +			((uint64_t)pkt[1]->data_len << IDPF_TXD_QW1_TX_BUF_SZ_S);
> +		uint64_t hi_qw0 = hi_qw_tmpl |
> +			((uint64_t)pkt[0]->data_len << IDPF_TXD_QW1_TX_BUF_SZ_S);
> +
> +		__m256i desc2_3 = _mm256_set_epi64x(hi_qw3,
> +			pkt[3]->buf_iova + pkt[3]->data_off,
> +			hi_qw2,
> +			pkt[2]->buf_iova + pkt[2]->data_off);
> +
> +		__m256i desc0_1 = _mm256_set_epi64x(hi_qw1,
> +			pkt[1]->buf_iova + pkt[1]->data_off,
> +			hi_qw0,
> +			pkt[0]->buf_iova + pkt[0]->data_off);
> +
> +		_mm256_storeu_si256((__m256i *)(txdp + 2), desc2_3);
> +		_mm256_storeu_si256((__m256i *)txdp, desc0_1);

For Tx, there is no race condition to be aware of with the NIC, so there is
no need to build up and write the descriptors in reverse order. It's not
wrong to do so, but unnecessary.

> +	}
> +
> +	while (nb_pkts--) {
> +		idpf_splitq_vtx1_avx2(txdp, *pkt, flags);
> +		txdp++;
> +		pkt++;
> +	}
> +}
> +
> +static inline uint16_t
> +idpf_splitq_xmit_fixed_burst_vec_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
> +						uint16_t nb_pkts)

More indentation than necessary here. Double-tab should be sufficient.
However, if we don't wrap the line it only reaches column 99, so no need to
wrap at all.

> +{
> +	struct ci_tx_queue *txq = (struct ci_tx_queue *)tx_queue;
> +	struct idpf_flex_tx_sched_desc *txdp;
> +	struct ci_tx_entry_vec *txep;
> +	uint16_t n, nb_commit, tx_id;
> +	uint64_t cmd_dtype = IDPF_TXD_FLEX_FLOW_CMD_EOP;
> +
> +	tx_id = txq->tx_tail;
> +

Why not just assign the value when you define the variable?

> +	/* restrict to max burst size */
> +	nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
> +

This was done before in the calling wrapper function. No need to do so
again.

> +	/* make sure we have enough free space */
> +	if (txq->nb_tx_free < txq->tx_free_thresh)
> +		ci_tx_free_bufs_vec(txq, idpf_tx_desc_done, false);
> +

This looks wrong to me. From what I see idpf_tx_desc_done always returns
true when using the splitq model, which means that you need to check the
completion queue counts before freeing buffers. That is done in the wrapper
function below, but here you only compare the free count against threshold
meaning that you will free buffer without actually checking for completions
are not, right?

> +	nb_commit = (uint16_t)RTE_MIN(txq->nb_tx_free, nb_pkts);
> +	nb_pkts = nb_commit;
> +	if (unlikely(nb_pkts == 0))
> +		return 0;
> +
> +	txdp = (struct idpf_flex_tx_sched_desc *)&txq->desc_ring[tx_id];
> +	txep = (void *)txq->sw_ring;
> +	txep += tx_id;

Why the cast and using sw_ring rather than sw_ring_vec pointer. Also should
not need separate addition - can use the same style as when assigning txdp.
What about: "txep = &txq->sw_ring_vec[tx_id];" ?

> +
> +	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
> +
> +	n = (uint16_t)(txq->nb_tx_desc - tx_id);
> +	if (nb_commit >= n) {
> +		ci_tx_backlog_entry_vec(txep, tx_pkts, n);
> +
> +		idpf_splitq_vtx_avx2(txdp, tx_pkts, n - 1, cmd_dtype);
> +		tx_pkts += (n - 1);
> +		txdp += (n - 1);
> +
> +		idpf_splitq_vtx1_avx2(txdp, *tx_pkts++, cmd_dtype);
> +

Is there a reason for writing n-1 entries in bulk and then writing the last
one individually? From what I see, the flags and all are the same for them.

> +		nb_commit = (uint16_t)(nb_commit - n);
> +
> +		tx_id = 0;
> +		txq->tx_next_rs = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> +		txdp = &txq->desc_ring[tx_id];
> +		txep = (void *)txq->sw_ring;
> +		txep += tx_id;

Same comment as above, except that here we know that tx_id == 0 so, a
separate addition is definitely not necessary! :)

> +	}
> +
> +	ci_tx_backlog_entry_vec(txep, tx_pkts, nb_commit);
> +
> +	idpf_splitq_vtx_avx2(txdp, tx_pkts, nb_commit, cmd_dtype);
> +
> +	tx_id = (uint16_t)(tx_id + nb_commit);
> +	if (tx_id > txq->tx_next_rs)
> +		txq->tx_next_rs =
> +		(uint16_t)(txq->tx_next_rs + txq->tx_rs_thresh);
> +

Watch indentation levels here. This looks like two code lines.

> +	txq->tx_tail = tx_id;
> +
> +	IDPF_PCI_REG_WRITE(txq->qtx_tail, txq->tx_tail);
> +
> +	return nb_pkts;
> +}
> +
> +uint16_t
> +idpf_dp_splitq_xmit_pkts_avx2(void *tx_queue, struct rte_mbuf **tx_pkts,
> +					uint16_t nb_pkts)
> +{
> +	struct ci_tx_queue *txq = (struct ci_tx_queue *)tx_queue;
> +	uint16_t nb_tx = 0;
> +
> +	while (nb_pkts) {
> +		uint16_t ret, num;
> +		idpf_splitq_scan_cq_ring(txq->complq);
> +
> +		if (txq->rs_compl_count > txq->tx_free_thresh) {
> +			ci_tx_free_bufs_vec(txq, idpf_tx_desc_done, false);
> +			txq->rs_compl_count -= txq->tx_rs_thresh;
> +		}
> +
> +		num = (uint16_t)RTE_MIN(nb_pkts, txq->tx_rs_thresh);
> +		ret = idpf_splitq_xmit_fixed_burst_vec_avx2(tx_queue,
> +								&tx_pkts[nb_tx], num);

Line doesn't need that much indentation. It also doesn't need to be wrapped
as it's only 93 chars wide.

> +		nb_tx += ret;
> +		nb_pkts -= ret;
> +		if (ret < num)
> +			break;
> +	}
> +
> +	return nb_tx;
> +}
> +
> +RTE_EXPORT_INTERNAL_SYMBOL(idpf_dp_splitq_xmit_pkts_avx2)
> diff --git a/drivers/net/intel/idpf/idpf_rxtx.c b/drivers/net/intel/idpf/idpf_rxtx.c
> index 1c725065df..6950fabb49 100644
> --- a/drivers/net/intel/idpf/idpf_rxtx.c
> +++ b/drivers/net/intel/idpf/idpf_rxtx.c
> @@ -850,6 +850,15 @@ idpf_set_tx_function(struct rte_eth_dev *dev)
>  				return;
>  			}
>  #endif /* CC_AVX512_SUPPORT */
> +			if (tx_simd_width == RTE_VECT_SIMD_256) {
> +				PMD_DRV_LOG(NOTICE,
> +					    "Using Split AVX2 Vector Tx (port %d).",
> +					    dev->data->port_id);
> +				dev->tx_pkt_burst = idpf_dp_splitq_xmit_pkts_avx2;
> +				dev->tx_pkt_prepare = idpf_dp_prep_pkts;
> +				return;
> +			}
> +
>  		}
>  		PMD_DRV_LOG(NOTICE,
>  			    "Using Split Scalar Tx (port %d).",
> -- 
> 2.34.1
>

     prev parent reply	other threads:[~2025-10-09 14:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250917052658.582872-1-shaiq.wani@intel.com/>
2025-09-25  9:20 ` [PATCH v2 0/2] net/idpf: enable AVX2 for split queue Rx/Tx Shaiq Wani
2025-09-25  9:20   ` [PATCH v2 1/2] net/idpf: enable AVX2 for split queue Rx Shaiq Wani
2025-09-25 16:38     ` Bruce Richardson
2025-09-25  9:20   ` [PATCH v2 2/2] net/idpf: enable AVX2 for split queue Tx Shaiq Wani
2025-09-25 16:47     ` Bruce Richardson
2025-09-26  8:54 ` [PATCH v3 0/2] enable AVX2 for split queue Rx/Tx Shaiq Wani
2025-09-26  8:54   ` [PATCH v3 1/2] net/idpf: enable AVX2 for split queue Rx Shaiq Wani
2025-09-26 11:40     ` Bruce Richardson
2025-09-26 13:09     ` Burakov, Anatoly
2025-09-26  8:54   ` [PATCH v3 2/2] net/idpf: enable AVX2 for split queue Tx Shaiq Wani
2025-09-30  9:07 ` [PATCH v4 0/2] net/idpf: enable AVX2 for split queue Rx/Tx Shaiq Wani
2025-09-30  9:07   ` [PATCH v4 1/2] net/idpf: enable AVX2 for split queue Rx Shaiq Wani
2025-09-30 13:28     ` Burakov, Anatoly
2025-09-30  9:07   ` [PATCH v4 2/2] net/idpf: enable AVX2 for split queue Tx Shaiq Wani
2025-09-30 13:28     ` Burakov, Anatoly
2025-10-01  7:56 ` [PATCH v5 0/2] net/idpf: enable AVX2 for split queue Rx/Tx Shaiq Wani
2025-10-01  7:56   ` [PATCH v5 1/2] net/idpf: enable AVX2 for split queue Rx Shaiq Wani
2025-10-02 13:47     ` Burakov, Anatoly
2025-10-01  7:56   ` [PATCH v5 2/2] net/idpf: enable AVX2 for split queue Tx Shaiq Wani
2025-10-03  9:49 ` [PATCH v6 0/2] net/idpf: enable AVX2 for split queue Rx/Tx Shaiq Wani
2025-10-03  9:49   ` [PATCH v6 1/2] net/idpf: enable AVX2 for split queue Rx Shaiq Wani
2025-10-09 14:51     ` Bruce Richardson
2025-10-03  9:49   ` [PATCH v6 2/2] net/idpf: enable AVX2 for split queue Tx Shaiq Wani
2025-10-09 14:25     ` Bruce Richardson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOfF6lYDICH2GsYE@bricha3-mobl1.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=aman.deep.singh@intel.com \
    --cc=dev@dpdk.org \
    --cc=shaiq.wani@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).