DPDK patches and discussions
 help / color / mirror / Atom feed
From: Adrien Mazarguil <adrien.mazarguil@6wind.com>
To: Ophir Munk <ophirmu@mellanox.com>
Cc: dev@dpdk.org, Thomas Monjalon <thomas@monjalon.net>,
	Olga Shern <olgas@mellanox.com>, Matan Azrad <matan@mellanox.com>
Subject: Re: [dpdk-dev] [PATCH v2 6/7] net/mlx4: improve performance of one Tx segment
Date: Wed, 25 Oct 2017 18:50:21 +0200	[thread overview]
Message-ID: <20171025165021.GJ26782@6wind.com> (raw)
In-Reply-To: <1508768520-4810-7-git-send-email-ophirmu@mellanox.com>

On Mon, Oct 23, 2017 at 02:21:59PM +0000, Ophir Munk wrote:
> From: Matan Azrad <matan@mellanox.com>
> 
> Since one segment shouldn't use additional memory to save segments
> byte_count for writing them in different order we can prevent
> additional memory unnecessary usage in this case.
> By the way, prevent loop management.
> 
> All for performance improvement.

...of single-segent scenario? In my opinion the TX burst function doesn't
know the likeliest use case of the application unless it first checks some
user-provided configuration, e.g. some flag telling it TX gather is a rare
occurrence.

Multiple segment TX is actually quite common even for small packet
sizes. Applications may find it easier to prepend a cloned mbuf segment to
all packets in order to perform some encapsulation than to memcpy() its
contents inside the headroom of each packet to send. It's much more
efficient CPU-wise and a better use of HW capabilities.

likely() and unlikely() must be very carefully used in order to not wreck
the performance of the non-ideal (real-world, non-benchmarking, however you
want to call it) scenario, so when in doubt, keep them for exceptions and
error checks.

I can't accept this patch without performance results for single and
multiple-segments use cases which show they both benefit from it.

A few more comments below.

> Signed-off-by: Matan Azrad <matan@mellanox.com>
> 
> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 125 +++++++++++++++++++++++++++++--------------
>  1 file changed, 85 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index e8d9a35..3236552 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -310,7 +310,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  		uint32_t owner_opcode = MLX4_OPCODE_SEND;
>  		struct mlx4_wqe_ctrl_seg *ctrl;
>  		struct mlx4_wqe_data_seg *dseg;
> -		struct rte_mbuf *sbuf;
>  		union {
>  			uint32_t flags;
>  			uint16_t flags16[2];
> @@ -363,12 +362,12 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  		dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
>  				sizeof(struct mlx4_wqe_ctrl_seg));
>  		/* Fill the data segments with buffer information. */
> -		for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
> -			addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> +		if (likely(buf->nb_segs == 1)) {
> +			addr = rte_pktmbuf_mtod(buf, uintptr_t);
>  			rte_prefetch0((volatile void *)addr);
>  			/* Handle WQE wraparound. */
> -			if (unlikely(dseg >=
> -			    (struct mlx4_wqe_data_seg *)sq->eob))
> +			if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)
> +					sq->eob))

Besides the fact this coding style change is unrelated to this commit, this
is one example of unlikely() that should not be unlikely(). While it only
occurs every time the index wraps at the end of the ring, it's still
extremely likely and expected given the number of packets processed per
second.

>  				dseg = (struct mlx4_wqe_data_seg *)sq->buf;
>  			dseg->addr = rte_cpu_to_be_64(addr);
>  			/* Memory region key (big endian). */
> @@ -392,44 +391,90 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  				break;
>  			}
>  	#endif /* NDEBUG */
> -			if (likely(sbuf->data_len)) {
> -				byte_count = rte_cpu_to_be_32(sbuf->data_len);
> -			} else {
> -				/*
> -				 * Zero length segment is treated as inline
> -				 * segment with zero data.
> -				 */
> -				byte_count = RTE_BE32(0x80000000);
> -			}
> -			/*
> -			 * If the data segment is not at the beginning
> -			 * of a Tx basic block (TXBB) then write the
> -			 * byte count, else postpone the writing to
> -			 * just before updating the control segment.
> -			 */
> -			if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> -				/*
> -				 * Need a barrier here before writing the
> -				 * byte_count fields to make sure that all the
> -				 * data is visible before the byte_count field
> -				 * is set. otherwise, if the segment begins a
> -				 * new cacheline, the HCA prefetcher could grab
> -				 * the 64-byte chunk and get a valid
> -				 * (!= 0xffffffff) byte count but stale data,
> -				 * and end up sending the wrong data.
> -				 */
> -				rte_io_wmb();
> -				dseg->byte_count = byte_count;
> -			} else {
> +			/* Need a barrier here before writing the byte_count. */
> +			rte_io_wmb();
> +			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
> +		} else {
> +			/* Fill the data segments with buffer information. */
> +			struct rte_mbuf *sbuf;
> +
> +			for (sbuf = buf;
> +				 sbuf != NULL;
> +				 sbuf = sbuf->next, dseg++) {
> +				addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> +				rte_prefetch0((volatile void *)addr);
> +				/* Handle WQE wraparound. */
> +				if (unlikely(dseg >=
> +					(struct mlx4_wqe_data_seg *)sq->eob))
> +					dseg = (struct mlx4_wqe_data_seg *)
> +							sq->buf;
> +				dseg->addr = rte_cpu_to_be_64(addr);
> +				/* Memory region key (big endian). */
> +				dseg->lkey = mlx4_txq_mp2mr(txq,
> +						mlx4_txq_mb2mp(sbuf));
> +		#ifndef NDEBUG

I didn't catch this in the original review, coding rules prohibit indented
preprocessor directives. You must remove the extra indent if you're
modifying them.

> +				if (unlikely(dseg->lkey ==
> +					rte_cpu_to_be_32((uint32_t)-1))) {
> +					/* MR does not exist. */
> +					DEBUG("%p: unable to get MP <-> MR association",
> +						  (void *)txq);
> +					/*
> +					 * Restamp entry in case of failure.
> +					 * Make sure that size is written
> +					 * correctly, note that we give
> +					 * ownership to the SW, not the HW.
> +					 */
> +					ctrl->fence_size =
> +						(wqe_real_size >> 4) & 0x3f;
> +					mlx4_txq_stamp_freed_wqe(sq, head_idx,
> +					    (sq->head & sq->txbb_cnt) ? 0 : 1);
> +					elt->buf = NULL;
> +					break;
> +				}
> +		#endif /* NDEBUG */
> +				if (likely(sbuf->data_len)) {
> +					byte_count =
> +					  rte_cpu_to_be_32(sbuf->data_len);
> +				} else {
> +					/*
> +					 * Zero length segment is treated as
> +					 * inline segment with zero data.
> +					 */
> +					byte_count = RTE_BE32(0x80000000);
> +				}
>  				/*
> -				 * This data segment starts at the beginning of
> -				 * a new TXBB, so we need to postpone its
> -				 * byte_count writing for later.
> +				 * If the data segment is not at the beginning
> +				 * of a Tx basic block (TXBB) then write the
> +				 * byte count, else postpone the writing to
> +				 * just before updating the control segment.
>  				 */
> -				pv[pv_counter].dseg = dseg;
> -				pv[pv_counter++].val = byte_count;
> +				if ((uintptr_t)dseg &
> +					(uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> +					/*
> +					 * Need a barrier here before writing
> +					 * the byte_count fields to make sure
> +					 * that all the data is visible before
> +					 * the byte_count field is set.
> +					 * Otherwise, if the segment begins a
> +					 * new cacheline, the HCA prefetcher
> +					 * could grab the 64-byte chunk and get
> +					 * a valid (!= 0xffffffff) byte count
> +					 * but stale data, and end up sending
> +					 * the wrong data.
> +					 */
> +					rte_io_wmb();
> +					dseg->byte_count = byte_count;
> +				} else {
> +					/*
> +					 * This data segment starts at the
> +					 * beginning of a new TXBB, so we
> +					 * need to postpone its byte_count
> +					 * writing for later.
> +					 */
> +					pv[pv_counter].dseg = dseg;
> +					pv[pv_counter++].val = byte_count;
> +				}
>  			}
> -		}

Where did that block go? Isn't there an unnecessary indentation level here?

>  		/* Write the first DWORD of each TXBB save earlier. */
>  		if (pv_counter) {
>  			/* Need a barrier before writing the byte_count. */
> -- 
> 2.7.4
> 

-- 
Adrien Mazarguil
6WIND

  reply	other threads:[~2017-10-25 16:50 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1508752838-30408-1-git-send-email-ophirmu@mellanox.com>
2017-10-23 14:21 ` [dpdk-dev] [PATCH v2 0/7] net/mlx4: follow-up on new TX datapath introduced in RC1 Ophir Munk
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 1/7] net/mlx4: remove error flows from Tx fast path Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 2/7] net/mlx4: inline more Tx functions Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-25 21:42       ` Ophir Munk
2017-10-26  7:48         ` Adrien Mazarguil
2017-10-26 14:27           ` Ophir Munk
2017-10-29 19:30             ` Ophir Munk
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 3/7] net/mlx4: save lkey in big-endian format Ophir Munk
2017-10-23 15:24     ` Nélio Laranjeiro
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions Ophir Munk
2017-10-24 13:51     ` Nélio Laranjeiro
2017-10-24 20:36       ` Ophir Munk
2017-10-25  7:50         ` Nélio Laranjeiro
2017-10-26 10:31           ` Matan Azrad
2017-10-26 12:12             ` Nélio Laranjeiro
2017-10-26 12:30               ` Matan Azrad
2017-10-26 13:44                 ` Nélio Laranjeiro
2017-10-26 16:21                   ` Matan Azrad
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 5/7] net/mlx4: remove unnecessary variables in Tx burst Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 6/7] net/mlx4: improve performance of one Tx segment Ophir Munk
2017-10-25 16:50     ` Adrien Mazarguil [this message]
2017-10-23 14:22   ` [dpdk-dev] [PATCH v2 7/7] net/mlx4: separate Tx for multi-segments Ophir Munk
2017-10-25 16:50     ` Adrien Mazarguil
2017-10-30  8:15       ` Ophir Munk
2017-10-30 10:07   ` [dpdk-dev] [PATCH v3 0/7] Tx path improvements Matan Azrad
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 1/7] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:11         ` Matan Azrad
2017-10-31 10:16           ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 2/7] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-31 13:25         ` Ophir Munk
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 3/7] net/mlx4: merge Tx path functions Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:12         ` Matan Azrad
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 4/7] net/mlx4: remove completion counter in Tx burst Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 5/7] net/mlx4: separate Tx segment cases Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:23         ` Matan Azrad
2017-10-31 10:17           ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 19:47         ` Matan Azrad
2017-10-31 10:17           ` Adrien Mazarguil
2017-10-31 11:35             ` Matan Azrad
2017-10-31 13:21               ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 7/7] net/mlx4: remove empty Tx segment support Matan Azrad
2017-10-30 14:24       ` Adrien Mazarguil
2017-10-31 18:21     ` [dpdk-dev] [PATCH v4 0/8] net/mlx4: Tx path improvements Matan Azrad
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 1/8] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 2/8] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 3/8] net/mlx4: fix ring wraparound compiler hint Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 4/8] net/mlx4: merge Tx path functions Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 5/8] net/mlx4: remove duplicate handling in Tx burst Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 6/8] net/mlx4: separate Tx segment cases Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 7/8] net/mlx4: fix HW memory optimizations careless Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 8/8] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-11-02 13:41       ` [dpdk-dev] [PATCH] net/mlx4: fix missing include Adrien Mazarguil
2017-11-02 20:35         ` Ferruh Yigit
2017-11-02 16:42     ` [dpdk-dev] [PATCH v5 0/8] net/mlx4: Tx path improvements Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 1/8] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 2/8] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 3/8] net/mlx4: fix ring wraparound compiler hint Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 4/8] net/mlx4: merge Tx path functions Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 5/8] net/mlx4: remove duplicate handling in Tx burst Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 6/8] net/mlx4: separate Tx segment cases Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 7/8] net/mlx4: fix HW memory optimizations careless Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 8/8] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-11-02 17:07       ` [dpdk-dev] [PATCH v5 0/8] net/mlx4: Tx path improvements Adrien Mazarguil
2017-11-02 20:35         ` Ferruh Yigit
2017-11-02 20:41       ` Ferruh Yigit
2017-11-03  9:48         ` Adrien Mazarguil
2017-11-03 19:25       ` Ferruh Yigit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171025165021.GJ26782@6wind.com \
    --to=adrien.mazarguil@6wind.com \
    --cc=dev@dpdk.org \
    --cc=matan@mellanox.com \
    --cc=olgas@mellanox.com \
    --cc=ophirmu@mellanox.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).