From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <adrien.mazarguil@6wind.com>
Received: from mail-wr0-f196.google.com (mail-wr0-f196.google.com
 [209.85.128.196]) by dpdk.org (Postfix) with ESMTP id 43518F1F
 for <dev@dpdk.org>; Wed, 25 Oct 2017 18:50:33 +0200 (CEST)
Received: by mail-wr0-f196.google.com with SMTP id g90so644696wrd.6
 for <dev@dpdk.org>; Wed, 25 Oct 2017 09:50:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=6wind-com.20150623.gappssmtp.com; s=20150623;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to;
 bh=0v+/pN3KMKzmj3uQ1KHg8NkAqqeClO6Q32RSy+VPH4Q=;
 b=cxAVhbF5OGQxfGkXrinx1jfIBbsDamcn6HTn+183LEN7uPOSdkjuTD2/t33l81XFT9
 SRvamQfMM2/wZniHptsm6DRmbCYA0frjxRP/sEVn/0znYg6oaMgwxLzwJJQ10nDrcPT4
 6PSe6yw+dxNqB8a6KOro5WkfjrVzzH5kO+w6Rhoq5pZ4Jdx6rorYtJ4jYL2aPp1NhGpv
 A0Q2ExxH6ADSTrRxFGUgsHQdFUpyHdm5urtop0z2UQhND55nqbBjWMwcxR0jbgFtIvpw
 Kpt1pWXsW2PAi2ro5/ZkIdtcQJHhmrUSw+gSpVPpgc6c9CJp/1JO77C1bkwYYbVFEE+d
 jlPA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to;
 bh=0v+/pN3KMKzmj3uQ1KHg8NkAqqeClO6Q32RSy+VPH4Q=;
 b=KBfbhhFHGtW17l6FsvU0Io5Vfrpfx7PxRgH9SoOPXnjxRGQjUVaLjJbAnu5Kd/KwNV
 9FLKyfvHIl34nciaihdAbYip+yTvkQ/fEUQlrl0ntqHqlWo8aNNkA5APE9MuVJv2EMLk
 Nz9AmHBTZJsoBOxX+23AA40fWz5uRA4a1HrJ0MtSRUMNJLwl1mjCDFmMYZPCuusrSEsU
 4xjXG7rnMPqsNm9eKi/M9V8G+abbeSobHlZ8LdC4LEdX6lZLSEwKWBxabiANk0qDSsR/
 N+w3OCWE1Ksq+UNSVnYjVlCwUXgehIYeuWHotEZqTIXPeSaKuYCiZoUZPa3wCdWaZTd+
 Atxw==
X-Gm-Message-State: AMCzsaWAf2V31c9a5aqSLHANynpp1Bq/lIY1kp160ThkurdpvEq+OL0v
 vJd7JRbLRShBZoyFj3PhgsKWMg==
X-Google-Smtp-Source: ABhQp+Svs3h+QrNqhsp82rlUaLOWFu1kdowYRn0Hb92Kx+jv4H4fSX7wBCvIu7AQa4rEdWhdCB44lA==
X-Received: by 10.223.155.208 with SMTP id e16mr2976200wrc.161.1508950232877; 
 Wed, 25 Oct 2017 09:50:32 -0700 (PDT)
Received: from 6wind.com (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78])
 by smtp.gmail.com with ESMTPSA id o7sm3766842wrf.31.2017.10.25.09.50.32
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 25 Oct 2017 09:50:32 -0700 (PDT)
Date: Wed, 25 Oct 2017 18:50:21 +0200
From: Adrien Mazarguil <adrien.mazarguil@6wind.com>
To: Ophir Munk <ophirmu@mellanox.com>
Cc: dev@dpdk.org, Thomas Monjalon <thomas@monjalon.net>,
 Olga Shern <olgas@mellanox.com>, Matan Azrad <matan@mellanox.com>
Message-ID: <20171025165021.GJ26782@6wind.com>
References: <1508752838-30408-1-git-send-email-ophirmu@mellanox.com>
 <1508768520-4810-1-git-send-email-ophirmu@mellanox.com>
 <1508768520-4810-7-git-send-email-ophirmu@mellanox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1508768520-4810-7-git-send-email-ophirmu@mellanox.com>
Subject: Re: [dpdk-dev] [PATCH v2 6/7] net/mlx4: improve performance of one
	Tx segment
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Oct 2017 16:50:33 -0000

On Mon, Oct 23, 2017 at 02:21:59PM +0000, Ophir Munk wrote:
> From: Matan Azrad <matan@mellanox.com>
> 
> Since one segment shouldn't use additional memory to save segments
> byte_count for writing them in different order we can prevent
> additional memory unnecessary usage in this case.
> By the way, prevent loop management.
> 
> All for performance improvement.

...of single-segent scenario? In my opinion the TX burst function doesn't
know the likeliest use case of the application unless it first checks some
user-provided configuration, e.g. some flag telling it TX gather is a rare
occurrence.

Multiple segment TX is actually quite common even for small packet
sizes. Applications may find it easier to prepend a cloned mbuf segment to
all packets in order to perform some encapsulation than to memcpy() its
contents inside the headroom of each packet to send. It's much more
efficient CPU-wise and a better use of HW capabilities.

likely() and unlikely() must be very carefully used in order to not wreck
the performance of the non-ideal (real-world, non-benchmarking, however you
want to call it) scenario, so when in doubt, keep them for exceptions and
error checks.

I can't accept this patch without performance results for single and
multiple-segments use cases which show they both benefit from it.

A few more comments below.

> Signed-off-by: Matan Azrad <matan@mellanox.com>
> 
> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 125 +++++++++++++++++++++++++++++--------------
>  1 file changed, 85 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index e8d9a35..3236552 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -310,7 +310,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  		uint32_t owner_opcode = MLX4_OPCODE_SEND;
>  		struct mlx4_wqe_ctrl_seg *ctrl;
>  		struct mlx4_wqe_data_seg *dseg;
> -		struct rte_mbuf *sbuf;
>  		union {
>  			uint32_t flags;
>  			uint16_t flags16[2];
> @@ -363,12 +362,12 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  		dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
>  				sizeof(struct mlx4_wqe_ctrl_seg));
>  		/* Fill the data segments with buffer information. */
> -		for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
> -			addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> +		if (likely(buf->nb_segs == 1)) {
> +			addr = rte_pktmbuf_mtod(buf, uintptr_t);
>  			rte_prefetch0((volatile void *)addr);
>  			/* Handle WQE wraparound. */
> -			if (unlikely(dseg >=
> -			    (struct mlx4_wqe_data_seg *)sq->eob))
> +			if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)
> +					sq->eob))

Besides the fact this coding style change is unrelated to this commit, this
is one example of unlikely() that should not be unlikely(). While it only
occurs every time the index wraps at the end of the ring, it's still
extremely likely and expected given the number of packets processed per
second.

>  				dseg = (struct mlx4_wqe_data_seg *)sq->buf;
>  			dseg->addr = rte_cpu_to_be_64(addr);
>  			/* Memory region key (big endian). */
> @@ -392,44 +391,90 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
>  				break;
>  			}
>  	#endif /* NDEBUG */
> -			if (likely(sbuf->data_len)) {
> -				byte_count = rte_cpu_to_be_32(sbuf->data_len);
> -			} else {
> -				/*
> -				 * Zero length segment is treated as inline
> -				 * segment with zero data.
> -				 */
> -				byte_count = RTE_BE32(0x80000000);
> -			}
> -			/*
> -			 * If the data segment is not at the beginning
> -			 * of a Tx basic block (TXBB) then write the
> -			 * byte count, else postpone the writing to
> -			 * just before updating the control segment.
> -			 */
> -			if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> -				/*
> -				 * Need a barrier here before writing the
> -				 * byte_count fields to make sure that all the
> -				 * data is visible before the byte_count field
> -				 * is set. otherwise, if the segment begins a
> -				 * new cacheline, the HCA prefetcher could grab
> -				 * the 64-byte chunk and get a valid
> -				 * (!= 0xffffffff) byte count but stale data,
> -				 * and end up sending the wrong data.
> -				 */
> -				rte_io_wmb();
> -				dseg->byte_count = byte_count;
> -			} else {
> +			/* Need a barrier here before writing the byte_count. */
> +			rte_io_wmb();
> +			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
> +		} else {
> +			/* Fill the data segments with buffer information. */
> +			struct rte_mbuf *sbuf;
> +
> +			for (sbuf = buf;
> +				 sbuf != NULL;
> +				 sbuf = sbuf->next, dseg++) {
> +				addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
> +				rte_prefetch0((volatile void *)addr);
> +				/* Handle WQE wraparound. */
> +				if (unlikely(dseg >=
> +					(struct mlx4_wqe_data_seg *)sq->eob))
> +					dseg = (struct mlx4_wqe_data_seg *)
> +							sq->buf;
> +				dseg->addr = rte_cpu_to_be_64(addr);
> +				/* Memory region key (big endian). */
> +				dseg->lkey = mlx4_txq_mp2mr(txq,
> +						mlx4_txq_mb2mp(sbuf));
> +		#ifndef NDEBUG

I didn't catch this in the original review, coding rules prohibit indented
preprocessor directives. You must remove the extra indent if you're
modifying them.

> +				if (unlikely(dseg->lkey ==
> +					rte_cpu_to_be_32((uint32_t)-1))) {
> +					/* MR does not exist. */
> +					DEBUG("%p: unable to get MP <-> MR association",
> +						  (void *)txq);
> +					/*
> +					 * Restamp entry in case of failure.
> +					 * Make sure that size is written
> +					 * correctly, note that we give
> +					 * ownership to the SW, not the HW.
> +					 */
> +					ctrl->fence_size =
> +						(wqe_real_size >> 4) & 0x3f;
> +					mlx4_txq_stamp_freed_wqe(sq, head_idx,
> +					    (sq->head & sq->txbb_cnt) ? 0 : 1);
> +					elt->buf = NULL;
> +					break;
> +				}
> +		#endif /* NDEBUG */
> +				if (likely(sbuf->data_len)) {
> +					byte_count =
> +					  rte_cpu_to_be_32(sbuf->data_len);
> +				} else {
> +					/*
> +					 * Zero length segment is treated as
> +					 * inline segment with zero data.
> +					 */
> +					byte_count = RTE_BE32(0x80000000);
> +				}
>  				/*
> -				 * This data segment starts at the beginning of
> -				 * a new TXBB, so we need to postpone its
> -				 * byte_count writing for later.
> +				 * If the data segment is not at the beginning
> +				 * of a Tx basic block (TXBB) then write the
> +				 * byte count, else postpone the writing to
> +				 * just before updating the control segment.
>  				 */
> -				pv[pv_counter].dseg = dseg;
> -				pv[pv_counter++].val = byte_count;
> +				if ((uintptr_t)dseg &
> +					(uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> +					/*
> +					 * Need a barrier here before writing
> +					 * the byte_count fields to make sure
> +					 * that all the data is visible before
> +					 * the byte_count field is set.
> +					 * Otherwise, if the segment begins a
> +					 * new cacheline, the HCA prefetcher
> +					 * could grab the 64-byte chunk and get
> +					 * a valid (!= 0xffffffff) byte count
> +					 * but stale data, and end up sending
> +					 * the wrong data.
> +					 */
> +					rte_io_wmb();
> +					dseg->byte_count = byte_count;
> +				} else {
> +					/*
> +					 * This data segment starts at the
> +					 * beginning of a new TXBB, so we
> +					 * need to postpone its byte_count
> +					 * writing for later.
> +					 */
> +					pv[pv_counter].dseg = dseg;
> +					pv[pv_counter++].val = byte_count;
> +				}
>  			}
> -		}

Where did that block go? Isn't there an unnecessary indentation level here?

>  		/* Write the first DWORD of each TXBB save earlier. */
>  		if (pv_counter) {
>  			/* Need a barrier before writing the byte_count. */
> -- 
> 2.7.4
> 

-- 
Adrien Mazarguil
6WIND