From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <matan@mellanox.com>
Received: from EUR01-VE1-obe.outbound.protection.outlook.com
 (mail-ve1eur01on0071.outbound.protection.outlook.com [104.47.1.71])
 by dpdk.org (Postfix) with ESMTP id 590E32A5D
 for <dev@dpdk.org>; Wed,  6 Dec 2017 12:29:40 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com;
 s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;
 bh=JGdf5i20eF7KXOfZXhRBoKv0K6HVQObmgYq2nnAWiLs=;
 b=Vo5X80kh4l/7E+SuMWb4Qum9hgVM08MhLlOCwsLY8KT7T69HELZPesd0prpPjOAY/pHEswPhQBBtmm3k4ILziZp518v13qvg1OOMySqOHIgdU9ZB3CxPVxSM1XWoCVd/wENBo1xdgi3TjcicnpCAQQWRhfaPUnoXyaVm0dd+mJc=
Received: from HE1PR0502MB3659.eurprd05.prod.outlook.com (10.167.127.17) by
 HE1PR0502MB3657.eurprd05.prod.outlook.com (10.167.127.15) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id
 15.20.282.5; Wed, 6 Dec 2017 11:29:38 +0000
Received: from HE1PR0502MB3659.eurprd05.prod.outlook.com
 ([fe80::982e:2dce:9449:6891]) by HE1PR0502MB3659.eurprd05.prod.outlook.com
 ([fe80::982e:2dce:9449:6891%13]) with mapi id 15.20.0282.012; Wed, 6 Dec 2017
 11:29:38 +0000
From: Matan Azrad <matan@mellanox.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
CC: "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
Thread-Index: AQHTboEvmCliAVltFEaOPIA1bFPogqM2K5CA
Date: Wed, 6 Dec 2017 11:29:38 +0000
Message-ID: <HE1PR0502MB3659DB14EFE589B5FAF075AED2320@HE1PR0502MB3659.eurprd05.prod.outlook.com>
References: <1511871570-16826-1-git-send-email-matan@mellanox.com>
 <1511871570-16826-5-git-send-email-matan@mellanox.com>
 <20171206105831.GV4062@6wind.com>
In-Reply-To: <20171206105831.GV4062@6wind.com>
Accept-Language: en-US, he-IL
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=matan@mellanox.com; 
x-originating-ip: [193.47.165.251]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; HE1PR0502MB3657;
 6:WVLzKeZqloP1goHKelMPtRYk+47U3dlMG8PUdwDyct/P+FL7nCEzgrjGWkaOLL+1sZxc7/FTAZ1Z4ksOqtxU0g89FSeIO0V0Z+2u/8SHsdaoShSGkqKCTU5XrbhQ315QInoP1XT2Xu8D6PmQ8i+zfgyOkZn1D0BmGcIrhbkuBCCXX0RYf4NkntFgR3I4bmSNHLFOVPgVhY7ks7XtQftajJq47sutLgyz7K5YBVAnSST3vjACSLqZZlTb+nBnRJffN/WqwlerbeEswQIPWs9+5pNnK9FiVi/EzY8e+X6wAUGn93Kv0uukkARgEU+sRgMDDATMWwb4lhucDZGGiflpQCkLxm2kRzhTmUuMgbhEhPA=;
 5:P7FCfHFIRj7D6PbtZ27YSrLpx/c0Dv46kFkpZQyTzfAYBYiQk3pKiIJUFP0IGUteevma9mxtMWV/VFwawivmM5S+/L9y3seIQjEnDDS7pN+3hOfw/tBiTNeBTb5W0IaQisqrvY3kz2hiDkcl0hYiHhhFcObfiloRVNHm5qw2pv4=;
 24:iXrQCRlt9DLE84Z1Bx9vKGyCfVecBnqwBrfno5QD2JMuK5N/Of0pZSXk+owng0qk0e4MhkN/Mmj9Jbg3vwjz7pIgZuOdZ1bfE+GWi+BuqLk=;
 7:FJjUgOu2nuvpXztV896R08H/Ay2rTcmKaRYa2E2IS/epa+08Y+rp/zUL9lsWfqEtI9gJ7jIyYRXhlc7c9uHckO7CJGAV1Tur3nzIS7WegLTM3uLfiJXZk383YT4EYcn8kClLm6B8RlGNHjVNPuWrvpFFzSJ0XsDRWoyqP2vEQy+cusz6B2TkmF7iOfUDCbnakbEvHg3XFEFX8O3kD0HobJSYUeJNyvGaO8ipUDWQd6ektotir7zAwdeWBuNqMSIQ
x-ms-exchange-antispam-srfa-diagnostics: SSOS;
x-ms-office365-filtering-correlation-id: c69eb209-d985-4c89-c70c-08d53c9ca2d4
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
 RULEID:(5600026)(4604075)(4534020)(4602075)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603286);
 SRVR:HE1PR0502MB3657; 
x-ms-traffictypediagnostic: HE1PR0502MB3657:
x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr
x-microsoft-antispam-prvs: <HE1PR0502MB3657D5966827F92794BD8C46D2320@HE1PR0502MB3657.eurprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(17755550239193);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(6040450)(2401047)(8121501046)(5005006)(3002001)(3231022)(93006095)(93001095)(10201501046)(6055026)(6041248)(20161123558100)(20161123564025)(20161123562025)(20161123555025)(20161123560025)(201703131423075)(201703011903075)(201702281528075)(201703061421075)(6072148)(201708071742011);
 SRVR:HE1PR0502MB3657; BCL:0; PCL:0; RULEID:(100000803101)(100110400095);
 SRVR:HE1PR0502MB3657; 
x-forefront-prvs: 05134F8B4F
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(7966004)(366004)(346002)(376002)(39860400002)(13464003)(24454002)(199004)(189003)(316002)(102836003)(101416001)(6246003)(14454004)(6916009)(2950100002)(5250100002)(508600001)(5660300001)(53936002)(6116002)(3846002)(81166006)(7736002)(7696005)(2900100001)(8676002)(2906002)(99286004)(6506006)(6436002)(229853002)(74316002)(105586002)(3660700001)(3280700002)(4326008)(97736004)(81156014)(76176011)(305945005)(106356001)(33656002)(9686003)(66066001)(25786009)(86362001)(53546010)(55016002)(8936002)(68736007);
 DIR:OUT; SFP:1101; SCL:1; SRVR:HE1PR0502MB3657;
 H:HE1PR0502MB3659.eurprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords;
 A:1; MX:1; LANG:en; 
received-spf: None (protection.outlook.com: mellanox.com does not designate
 permitted sender hosts)
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: Mellanox.com
X-MS-Exchange-CrossTenant-Network-Message-Id: c69eb209-d985-4c89-c70c-08d53c9ca2d4
X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Dec 2017 11:29:38.5961 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR0502MB3657
Subject: Re: [dpdk-dev] [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Dec 2017 11:29:40 -0000

Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Wednesday, December 6, 2017 12:59 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case
>=20
> On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote:
> > mlx4 Tx block can handle up to 4 data segments or control segment + up
> > to 3 data segments. The first data segment in each not first Tx block
> > must validate Tx queue wraparound and must use IO memory barrier
> > before writing the byte count.
> >
> > The previous multi-segment code used "for" loop to iterate over all
> > packet segments and separated first Tx block data case by "if"
> > statments.
>=20
> statments =3D> statements
>=20
> >
> > Use switch case and unconditional branches instead of "for" loop can
> > optimize the case and prevents the unnecessary checks for each data
> > segment; This hints to compiler to create opitimized jump table.
>=20
> opitimized =3D> optimized
>=20
> >
> > Optimize this case by switch case and unconditional branches usage.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 165
> > +++++++++++++++++++++++++------------------
> >  drivers/net/mlx4/mlx4_rxtx.h |  33 +++++++++
> >  2 files changed, 128 insertions(+), 70 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 1d8240a..b9cb2fc 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -432,15 +432,14 @@ struct pv {
> >  	uint32_t head_idx =3D sq->head & sq->txbb_cnt_mask;
> >  	volatile struct mlx4_wqe_ctrl_seg *ctrl;
> >  	volatile struct mlx4_wqe_data_seg *dseg;
> > -	struct rte_mbuf *sbuf;
> > +	struct rte_mbuf *sbuf =3D buf;
> >  	uint32_t lkey;
> > -	uintptr_t addr;
> > -	uint32_t byte_count;
> >  	int pv_counter =3D 0;
> > +	int nb_segs =3D buf->nb_segs;
> >
> >  	/* Calculate the needed work queue entry size for this packet. */
> >  	wqe_real_size =3D sizeof(volatile struct mlx4_wqe_ctrl_seg) +
> > -		buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> > +		nb_segs * sizeof(volatile struct mlx4_wqe_data_seg);
> >  	nr_txbbs =3D MLX4_SIZE_TO_TXBBS(wqe_real_size);
> >  	/*
> >  	 * Check that there is room for this WQE in the send queue and that
> > @@ -457,67 +456,99 @@ struct pv {
> >  	dseg =3D (volatile struct mlx4_wqe_data_seg *)
> >  			((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg));
> >  	*pctrl =3D ctrl;
> > -	/* Fill the data segments with buffer information. */
> > -	for (sbuf =3D buf; sbuf !=3D NULL; sbuf =3D sbuf->next, dseg++) {
> > -		addr =3D rte_pktmbuf_mtod(sbuf, uintptr_t);
> > -		rte_prefetch0((volatile void *)addr);
> > -		/* Memory region key (big endian) for this memory pool. */
> > +	/*
> > +	 * Fill the data segments with buffer information.
> > +	 * First WQE TXBB head segment is always control segment,
> > +	 * so jump to tail TXBB data segments code for the first
> > +	 * WQE data segments filling.
> > +	 */
> > +	goto txbb_tail_segs;
> > +txbb_head_seg:
>=20
> I'm not fundamentally opposed to "goto" unlike a lot of people out there,
> but this doesn't look good. It's OK to use goto for error cases and to ex=
tricate
> yourself when trapped in an inner loop, also in some optimization scenari=
os
> where it sometimes make sense, but not when the same can be achieved
> through standard loop constructs and keywords.
>=20
> In this case I'm under the impression you could have managed with a do { =
... }
> while (...) construct. You need to try harder to reorganize these changes=
 or
> prove it can't be done without negatively impacting performance.
>=20
> Doing so should make this patch shorter as well.
>=20

I noticed this could be done with loop and without unconditional branches, =
but I checked it and found nice performance improvement using that way.
When I used the loop I degraded performance. =20
=20
> > +	/* Memory region key (big endian) for this memory pool. */
> > +	lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +	if (unlikely(lkey =3D=3D (uint32_t)-1)) {
> > +		DEBUG("%p: unable to get MP <-> MR association",
> > +		      (void *)txq);
> > +		return -1;
> > +	}
> > +	/* Handle WQE wraparound. */
> > +	if (dseg >=3D
> > +		(volatile struct mlx4_wqe_data_seg *)sq->eob)
> > +		dseg =3D (volatile struct mlx4_wqe_data_seg *)
> > +			sq->buf;
> > +	dseg->addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf,
> uintptr_t));
> > +	dseg->lkey =3D rte_cpu_to_be_32(lkey);
> > +	/*
> > +	 * This data segment starts at the beginning of a new
> > +	 * TXBB, so we need to postpone its byte_count writing
> > +	 * for later.
> > +	 */
> > +	pv[pv_counter].dseg =3D dseg;
> > +	/*
> > +	 * Zero length segment is treated as inline segment
> > +	 * with zero data.
> > +	 */
> > +	pv[pv_counter++].val =3D rte_cpu_to_be_32(sbuf->data_len ?
> > +						sbuf->data_len :
> 0x80000000);
> > +	sbuf =3D sbuf->next;
> > +	dseg++;
> > +	nb_segs--;
> > +txbb_tail_segs:
> > +	/* Jump to default if there are more than two segments remaining.
> */
> > +	switch (nb_segs) {
> > +	default:
> >  		lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > -		dseg->lkey =3D rte_cpu_to_be_32(lkey);
> > -		/* Calculate the needed work queue entry size for this
> packet */
> > -		if (unlikely(lkey =3D=3D rte_cpu_to_be_32((uint32_t)-1))) {
> > -			/* MR does not exist. */
> > +		if (unlikely(lkey =3D=3D (uint32_t)-1)) {
> >  			DEBUG("%p: unable to get MP <-> MR association",
> >  			      (void *)txq);
> >  			return -1;
> >  		}
> > -		if (likely(sbuf->data_len)) {
> > -			byte_count =3D rte_cpu_to_be_32(sbuf->data_len);
> > -		} else {
> > -			/*
> > -			 * Zero length segment is treated as inline segment
> > -			 * with zero data.
> > -			 */
> > -			byte_count =3D RTE_BE32(0x80000000);
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		sbuf =3D sbuf->next;
> > +		dseg++;
> > +		nb_segs--;
> > +		/* fallthrough */
> > +	case 2:
> > +		lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +		if (unlikely(lkey =3D=3D (uint32_t)-1)) {
> > +			DEBUG("%p: unable to get MP <-> MR association",
> > +			      (void *)txq);
> > +			return -1;
> >  		}
> > -		/*
> > -		 * If the data segment is not at the beginning of a
> > -		 * Tx basic block (TXBB) then write the byte count,
> > -		 * else postpone the writing to just before updating the
> > -		 * control segment.
> > -		 */
> > -		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > -			dseg->addr =3D rte_cpu_to_be_64(addr);
> > -			dseg->lkey =3D rte_cpu_to_be_32(lkey);
> > -#if RTE_CACHE_LINE_SIZE < 64
> > -			/*
> > -			 * Need a barrier here before writing the byte_count
> > -			 * fields to make sure that all the data is visible
> > -			 * before the byte_count field is set.
> > -			 * Otherwise, if the segment begins a new cacheline,
> > -			 * the HCA prefetcher could grab the 64-byte chunk
> and
> > -			 * get a valid (!=3D 0xffffffff) byte count but stale
> > -			 * data, and end up sending the wrong data.
> > -			 */
> > -			rte_io_wmb();
> > -#endif /* RTE_CACHE_LINE_SIZE */
> > -			dseg->byte_count =3D byte_count;
> > -		} else {
> > -			/*
> > -			 * This data segment starts at the beginning of a new
> > -			 * TXBB, so we need to postpone its byte_count
> writing
> > -			 * for later.
> > -			 */
> > -			/* Handle WQE wraparound. */
> > -			if (dseg >=3D
> > -			    (volatile struct mlx4_wqe_data_seg *)sq->eob)
> > -				dseg =3D (volatile struct mlx4_wqe_data_seg *)
> > -					sq->buf;
> > -			dseg->addr =3D rte_cpu_to_be_64(addr);
> > -			dseg->lkey =3D rte_cpu_to_be_32(lkey);
> > -			pv[pv_counter].dseg =3D dseg;
> > -			pv[pv_counter++].val =3D byte_count;
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		sbuf =3D sbuf->next;
> > +		dseg++;
> > +		nb_segs--;
> > +		/* fallthrough */
> > +	case 1:
> > +		lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
> > +		if (unlikely(lkey =3D=3D (uint32_t)-1)) {
> > +			DEBUG("%p: unable to get MP <-> MR association",
> > +			      (void *)txq);
> > +			return -1;
> > +		}
> > +		mlx4_fill_tx_data_seg(dseg, lkey,
> > +				      rte_pktmbuf_mtod(sbuf, uintptr_t),
> > +				      rte_cpu_to_be_32(sbuf->data_len ?
> > +						       sbuf->data_len :
> > +						       0x80000000));
> > +		nb_segs--;
> > +		if (nb_segs) {
> > +			sbuf =3D sbuf->next;
> > +			dseg++;
> > +			goto txbb_head_seg;
> >  		}
> > +		/* fallthrough */
> > +	case 0:
> > +		break;
> >  	}
>=20
> I think this "switch (nb_segs)" idea is an interesting approach, but shou=
ld
> occur inside a loop construct as previously described.
>=20

Same comment as above.

> >  	/* Write the first DWORD of each TXBB save earlier. */
> >  	if (pv_counter) {
> > @@ -583,7 +614,6 @@ struct pv {
> >  		} srcrb;
> >  		uint32_t head_idx =3D sq->head & sq->txbb_cnt_mask;
> >  		uint32_t lkey;
> > -		uintptr_t addr;
> >
> >  		/* Clean up old buffer. */
> >  		if (likely(elt->buf !=3D NULL)) {
> > @@ -618,24 +648,19 @@ struct pv {
> >  			dseg =3D (volatile struct mlx4_wqe_data_seg *)
> >  					((uintptr_t)ctrl +
> >  					sizeof(struct mlx4_wqe_ctrl_seg));
> > -			addr =3D rte_pktmbuf_mtod(buf, uintptr_t);
> > -			rte_prefetch0((volatile void *)addr);
> > -			dseg->addr =3D rte_cpu_to_be_64(addr);
> > -			/* Memory region key (big endian). */
> > +
> > +			ctrl->fence_size =3D (WQE_ONE_DATA_SEG_SIZE >> 4)
> & 0x3f;
> >  			lkey =3D mlx4_txq_mp2mr(txq,
> mlx4_txq_mb2mp(buf));
> > -			dseg->lkey =3D rte_cpu_to_be_32(lkey);
> > -			if (unlikely(dseg->lkey =3D=3D
> > -				rte_cpu_to_be_32((uint32_t)-1))) {
> > +			if (unlikely(lkey =3D=3D (uint32_t)-1)) {
> >  				/* MR does not exist. */
> >  				DEBUG("%p: unable to get MP <-> MR
> association",
> >  				      (void *)txq);
> >  				elt->buf =3D NULL;
> >  				break;
> >  			}
> > -			/* Never be TXBB aligned, no need compiler barrier.
> */
> > -			dseg->byte_count =3D rte_cpu_to_be_32(buf-
> >data_len);
> > -			/* Fill the control parameters for this packet. */
> > -			ctrl->fence_size =3D (WQE_ONE_DATA_SEG_SIZE >> 4)
> & 0x3f;
> > +			mlx4_fill_tx_data_seg(dseg, lkey,
> > +					      rte_pktmbuf_mtod(buf,
> uintptr_t),
> > +					      rte_cpu_to_be_32(buf-
> >data_len));
> >  			nr_txbbs =3D 1;
> >  		} else {
> >  			nr_txbbs =3D mlx4_tx_burst_segs(buf, txq, &ctrl); diff -
> -git
> > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index
> > 463df2b..8207232 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.h
> > +++ b/drivers/net/mlx4/mlx4_rxtx.h
> > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev
> *dev, uint16_t idx,
> >  	return mlx4_txq_add_mr(txq, mp, i);
> >  }
> >
> > +/**
> > + * Write Tx data segment to the SQ.
> > + *
> > + * @param dseg
> > + *   Pointer to data segment in SQ.
> > + * @param lkey
> > + *   Memory region lkey.
> > + * @param addr
> > + *   Data address.
> > + * @param byte_count
> > + *   Big Endian bytes count of the data to send.
>=20
> Big Endian =3D> Big endian
>=20
> How about using the dedicated type to properly document it?
> See rte_be32_t from rte_byteorder.h.
>=20
Learned new something, thanks, will check it.

> > + */
> > +static inline void
> > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg,
> > +		       uint32_t lkey, uintptr_t addr, uint32_t byte_count) {
> > +	dseg->addr =3D rte_cpu_to_be_64(addr);
> > +	dseg->lkey =3D rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE <
> 64
> > +	/*
> > +	 * Need a barrier here before writing the byte_count
> > +	 * fields to make sure that all the data is visible
> > +	 * before the byte_count field is set.
> > +	 * Otherwise, if the segment begins a new cacheline,
> > +	 * the HCA prefetcher could grab the 64-byte chunk and
> > +	 * get a valid (!=3D 0xffffffff) byte count but stale
> > +	 * data, and end up sending the wrong data.
> > +	 */
> > +	rte_io_wmb();
> > +#endif /* RTE_CACHE_LINE_SIZE */
> > +	dseg->byte_count =3D byte_count;
> > +}
> > +
>=20
> No need to expose this function in a header file. Note that rte_cpu_*() a=
nd
> rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h
> respectively.
>=20

Shouldn't inline functions be in header files?

> >  #endif /* MLX4_RXTX_H_ */
> > --
> > 1.8.3.1
> >
>=20
> --
> Adrien Mazarguil
> 6WIND