From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-VE1-obe.outbound.protection.outlook.com (mail-ve1eur01on0071.outbound.protection.outlook.com [104.47.1.71]) by dpdk.org (Postfix) with ESMTP id 590E32A5D for ; Wed, 6 Dec 2017 12:29:40 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=JGdf5i20eF7KXOfZXhRBoKv0K6HVQObmgYq2nnAWiLs=; b=Vo5X80kh4l/7E+SuMWb4Qum9hgVM08MhLlOCwsLY8KT7T69HELZPesd0prpPjOAY/pHEswPhQBBtmm3k4ILziZp518v13qvg1OOMySqOHIgdU9ZB3CxPVxSM1XWoCVd/wENBo1xdgi3TjcicnpCAQQWRhfaPUnoXyaVm0dd+mJc= Received: from HE1PR0502MB3659.eurprd05.prod.outlook.com (10.167.127.17) by HE1PR0502MB3657.eurprd05.prod.outlook.com (10.167.127.15) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.282.5; Wed, 6 Dec 2017 11:29:38 +0000 Received: from HE1PR0502MB3659.eurprd05.prod.outlook.com ([fe80::982e:2dce:9449:6891]) by HE1PR0502MB3659.eurprd05.prod.outlook.com ([fe80::982e:2dce:9449:6891%13]) with mapi id 15.20.0282.012; Wed, 6 Dec 2017 11:29:38 +0000 From: Matan Azrad To: Adrien Mazarguil CC: "dev@dpdk.org" Thread-Topic: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case Thread-Index: AQHTboEvmCliAVltFEaOPIA1bFPogqM2K5CA Date: Wed, 6 Dec 2017 11:29:38 +0000 Message-ID: References: <1511871570-16826-1-git-send-email-matan@mellanox.com> <1511871570-16826-5-git-send-email-matan@mellanox.com> <20171206105831.GV4062@6wind.com> In-Reply-To: <20171206105831.GV4062@6wind.com> Accept-Language: en-US, he-IL Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=matan@mellanox.com; x-originating-ip: [193.47.165.251] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; HE1PR0502MB3657; 6:WVLzKeZqloP1goHKelMPtRYk+47U3dlMG8PUdwDyct/P+FL7nCEzgrjGWkaOLL+1sZxc7/FTAZ1Z4ksOqtxU0g89FSeIO0V0Z+2u/8SHsdaoShSGkqKCTU5XrbhQ315QInoP1XT2Xu8D6PmQ8i+zfgyOkZn1D0BmGcIrhbkuBCCXX0RYf4NkntFgR3I4bmSNHLFOVPgVhY7ks7XtQftajJq47sutLgyz7K5YBVAnSST3vjACSLqZZlTb+nBnRJffN/WqwlerbeEswQIPWs9+5pNnK9FiVi/EzY8e+X6wAUGn93Kv0uukkARgEU+sRgMDDATMWwb4lhucDZGGiflpQCkLxm2kRzhTmUuMgbhEhPA=; 5:P7FCfHFIRj7D6PbtZ27YSrLpx/c0Dv46kFkpZQyTzfAYBYiQk3pKiIJUFP0IGUteevma9mxtMWV/VFwawivmM5S+/L9y3seIQjEnDDS7pN+3hOfw/tBiTNeBTb5W0IaQisqrvY3kz2hiDkcl0hYiHhhFcObfiloRVNHm5qw2pv4=; 24:iXrQCRlt9DLE84Z1Bx9vKGyCfVecBnqwBrfno5QD2JMuK5N/Of0pZSXk+owng0qk0e4MhkN/Mmj9Jbg3vwjz7pIgZuOdZ1bfE+GWi+BuqLk=; 7:FJjUgOu2nuvpXztV896R08H/Ay2rTcmKaRYa2E2IS/epa+08Y+rp/zUL9lsWfqEtI9gJ7jIyYRXhlc7c9uHckO7CJGAV1Tur3nzIS7WegLTM3uLfiJXZk383YT4EYcn8kClLm6B8RlGNHjVNPuWrvpFFzSJ0XsDRWoyqP2vEQy+cusz6B2TkmF7iOfUDCbnakbEvHg3XFEFX8O3kD0HobJSYUeJNyvGaO8ipUDWQd6ektotir7zAwdeWBuNqMSIQ x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: c69eb209-d985-4c89-c70c-08d53c9ca2d4 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(5600026)(4604075)(4534020)(4602075)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603286); SRVR:HE1PR0502MB3657; x-ms-traffictypediagnostic: HE1PR0502MB3657: x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(17755550239193); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040450)(2401047)(8121501046)(5005006)(3002001)(3231022)(93006095)(93001095)(10201501046)(6055026)(6041248)(20161123558100)(20161123564025)(20161123562025)(20161123555025)(20161123560025)(201703131423075)(201703011903075)(201702281528075)(201703061421075)(6072148)(201708071742011); SRVR:HE1PR0502MB3657; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:HE1PR0502MB3657; x-forefront-prvs: 05134F8B4F x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(7966004)(366004)(346002)(376002)(39860400002)(13464003)(24454002)(199004)(189003)(316002)(102836003)(101416001)(6246003)(14454004)(6916009)(2950100002)(5250100002)(508600001)(5660300001)(53936002)(6116002)(3846002)(81166006)(7736002)(7696005)(2900100001)(8676002)(2906002)(99286004)(6506006)(6436002)(229853002)(74316002)(105586002)(3660700001)(3280700002)(4326008)(97736004)(81156014)(76176011)(305945005)(106356001)(33656002)(9686003)(66066001)(25786009)(86362001)(53546010)(55016002)(8936002)(68736007); DIR:OUT; SFP:1101; SCL:1; SRVR:HE1PR0502MB3657; H:HE1PR0502MB3659.eurprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-Network-Message-Id: c69eb209-d985-4c89-c70c-08d53c9ca2d4 X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Dec 2017 11:29:38.5961 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR0502MB3657 Subject: Re: [dpdk-dev] [PATCH 4/8] net/mlx4: optimize Tx multi-segment case X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Dec 2017 11:29:40 -0000 Hi Adrien > -----Original Message----- > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com] > Sent: Wednesday, December 6, 2017 12:59 PM > To: Matan Azrad > Cc: dev@dpdk.org > Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case >=20 > On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote: > > mlx4 Tx block can handle up to 4 data segments or control segment + up > > to 3 data segments. The first data segment in each not first Tx block > > must validate Tx queue wraparound and must use IO memory barrier > > before writing the byte count. > > > > The previous multi-segment code used "for" loop to iterate over all > > packet segments and separated first Tx block data case by "if" > > statments. >=20 > statments =3D> statements >=20 > > > > Use switch case and unconditional branches instead of "for" loop can > > optimize the case and prevents the unnecessary checks for each data > > segment; This hints to compiler to create opitimized jump table. >=20 > opitimized =3D> optimized >=20 > > > > Optimize this case by switch case and unconditional branches usage. > > > > Signed-off-by: Matan Azrad > > --- > > drivers/net/mlx4/mlx4_rxtx.c | 165 > > +++++++++++++++++++++++++------------------ > > drivers/net/mlx4/mlx4_rxtx.h | 33 +++++++++ > > 2 files changed, 128 insertions(+), 70 deletions(-) > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c > > b/drivers/net/mlx4/mlx4_rxtx.c index 1d8240a..b9cb2fc 100644 > > --- a/drivers/net/mlx4/mlx4_rxtx.c > > +++ b/drivers/net/mlx4/mlx4_rxtx.c > > @@ -432,15 +432,14 @@ struct pv { > > uint32_t head_idx =3D sq->head & sq->txbb_cnt_mask; > > volatile struct mlx4_wqe_ctrl_seg *ctrl; > > volatile struct mlx4_wqe_data_seg *dseg; > > - struct rte_mbuf *sbuf; > > + struct rte_mbuf *sbuf =3D buf; > > uint32_t lkey; > > - uintptr_t addr; > > - uint32_t byte_count; > > int pv_counter =3D 0; > > + int nb_segs =3D buf->nb_segs; > > > > /* Calculate the needed work queue entry size for this packet. */ > > wqe_real_size =3D sizeof(volatile struct mlx4_wqe_ctrl_seg) + > > - buf->nb_segs * sizeof(volatile struct mlx4_wqe_data_seg); > > + nb_segs * sizeof(volatile struct mlx4_wqe_data_seg); > > nr_txbbs =3D MLX4_SIZE_TO_TXBBS(wqe_real_size); > > /* > > * Check that there is room for this WQE in the send queue and that > > @@ -457,67 +456,99 @@ struct pv { > > dseg =3D (volatile struct mlx4_wqe_data_seg *) > > ((uintptr_t)ctrl + sizeof(struct mlx4_wqe_ctrl_seg)); > > *pctrl =3D ctrl; > > - /* Fill the data segments with buffer information. */ > > - for (sbuf =3D buf; sbuf !=3D NULL; sbuf =3D sbuf->next, dseg++) { > > - addr =3D rte_pktmbuf_mtod(sbuf, uintptr_t); > > - rte_prefetch0((volatile void *)addr); > > - /* Memory region key (big endian) for this memory pool. */ > > + /* > > + * Fill the data segments with buffer information. > > + * First WQE TXBB head segment is always control segment, > > + * so jump to tail TXBB data segments code for the first > > + * WQE data segments filling. > > + */ > > + goto txbb_tail_segs; > > +txbb_head_seg: >=20 > I'm not fundamentally opposed to "goto" unlike a lot of people out there, > but this doesn't look good. It's OK to use goto for error cases and to ex= tricate > yourself when trapped in an inner loop, also in some optimization scenari= os > where it sometimes make sense, but not when the same can be achieved > through standard loop constructs and keywords. >=20 > In this case I'm under the impression you could have managed with a do { = ... } > while (...) construct. You need to try harder to reorganize these changes= or > prove it can't be done without negatively impacting performance. >=20 > Doing so should make this patch shorter as well. >=20 I noticed this could be done with loop and without unconditional branches, = but I checked it and found nice performance improvement using that way. When I used the loop I degraded performance. =20 =20 > > + /* Memory region key (big endian) for this memory pool. */ > > + lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf)); > > + if (unlikely(lkey =3D=3D (uint32_t)-1)) { > > + DEBUG("%p: unable to get MP <-> MR association", > > + (void *)txq); > > + return -1; > > + } > > + /* Handle WQE wraparound. */ > > + if (dseg >=3D > > + (volatile struct mlx4_wqe_data_seg *)sq->eob) > > + dseg =3D (volatile struct mlx4_wqe_data_seg *) > > + sq->buf; > > + dseg->addr =3D rte_cpu_to_be_64(rte_pktmbuf_mtod(sbuf, > uintptr_t)); > > + dseg->lkey =3D rte_cpu_to_be_32(lkey); > > + /* > > + * This data segment starts at the beginning of a new > > + * TXBB, so we need to postpone its byte_count writing > > + * for later. > > + */ > > + pv[pv_counter].dseg =3D dseg; > > + /* > > + * Zero length segment is treated as inline segment > > + * with zero data. > > + */ > > + pv[pv_counter++].val =3D rte_cpu_to_be_32(sbuf->data_len ? > > + sbuf->data_len : > 0x80000000); > > + sbuf =3D sbuf->next; > > + dseg++; > > + nb_segs--; > > +txbb_tail_segs: > > + /* Jump to default if there are more than two segments remaining. > */ > > + switch (nb_segs) { > > + default: > > lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf)); > > - dseg->lkey =3D rte_cpu_to_be_32(lkey); > > - /* Calculate the needed work queue entry size for this > packet */ > > - if (unlikely(lkey =3D=3D rte_cpu_to_be_32((uint32_t)-1))) { > > - /* MR does not exist. */ > > + if (unlikely(lkey =3D=3D (uint32_t)-1)) { > > DEBUG("%p: unable to get MP <-> MR association", > > (void *)txq); > > return -1; > > } > > - if (likely(sbuf->data_len)) { > > - byte_count =3D rte_cpu_to_be_32(sbuf->data_len); > > - } else { > > - /* > > - * Zero length segment is treated as inline segment > > - * with zero data. > > - */ > > - byte_count =3D RTE_BE32(0x80000000); > > + mlx4_fill_tx_data_seg(dseg, lkey, > > + rte_pktmbuf_mtod(sbuf, uintptr_t), > > + rte_cpu_to_be_32(sbuf->data_len ? > > + sbuf->data_len : > > + 0x80000000)); > > + sbuf =3D sbuf->next; > > + dseg++; > > + nb_segs--; > > + /* fallthrough */ > > + case 2: > > + lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf)); > > + if (unlikely(lkey =3D=3D (uint32_t)-1)) { > > + DEBUG("%p: unable to get MP <-> MR association", > > + (void *)txq); > > + return -1; > > } > > - /* > > - * If the data segment is not at the beginning of a > > - * Tx basic block (TXBB) then write the byte count, > > - * else postpone the writing to just before updating the > > - * control segment. > > - */ > > - if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) { > > - dseg->addr =3D rte_cpu_to_be_64(addr); > > - dseg->lkey =3D rte_cpu_to_be_32(lkey); > > -#if RTE_CACHE_LINE_SIZE < 64 > > - /* > > - * Need a barrier here before writing the byte_count > > - * fields to make sure that all the data is visible > > - * before the byte_count field is set. > > - * Otherwise, if the segment begins a new cacheline, > > - * the HCA prefetcher could grab the 64-byte chunk > and > > - * get a valid (!=3D 0xffffffff) byte count but stale > > - * data, and end up sending the wrong data. > > - */ > > - rte_io_wmb(); > > -#endif /* RTE_CACHE_LINE_SIZE */ > > - dseg->byte_count =3D byte_count; > > - } else { > > - /* > > - * This data segment starts at the beginning of a new > > - * TXBB, so we need to postpone its byte_count > writing > > - * for later. > > - */ > > - /* Handle WQE wraparound. */ > > - if (dseg >=3D > > - (volatile struct mlx4_wqe_data_seg *)sq->eob) > > - dseg =3D (volatile struct mlx4_wqe_data_seg *) > > - sq->buf; > > - dseg->addr =3D rte_cpu_to_be_64(addr); > > - dseg->lkey =3D rte_cpu_to_be_32(lkey); > > - pv[pv_counter].dseg =3D dseg; > > - pv[pv_counter++].val =3D byte_count; > > + mlx4_fill_tx_data_seg(dseg, lkey, > > + rte_pktmbuf_mtod(sbuf, uintptr_t), > > + rte_cpu_to_be_32(sbuf->data_len ? > > + sbuf->data_len : > > + 0x80000000)); > > + sbuf =3D sbuf->next; > > + dseg++; > > + nb_segs--; > > + /* fallthrough */ > > + case 1: > > + lkey =3D mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf)); > > + if (unlikely(lkey =3D=3D (uint32_t)-1)) { > > + DEBUG("%p: unable to get MP <-> MR association", > > + (void *)txq); > > + return -1; > > + } > > + mlx4_fill_tx_data_seg(dseg, lkey, > > + rte_pktmbuf_mtod(sbuf, uintptr_t), > > + rte_cpu_to_be_32(sbuf->data_len ? > > + sbuf->data_len : > > + 0x80000000)); > > + nb_segs--; > > + if (nb_segs) { > > + sbuf =3D sbuf->next; > > + dseg++; > > + goto txbb_head_seg; > > } > > + /* fallthrough */ > > + case 0: > > + break; > > } >=20 > I think this "switch (nb_segs)" idea is an interesting approach, but shou= ld > occur inside a loop construct as previously described. >=20 Same comment as above. > > /* Write the first DWORD of each TXBB save earlier. */ > > if (pv_counter) { > > @@ -583,7 +614,6 @@ struct pv { > > } srcrb; > > uint32_t head_idx =3D sq->head & sq->txbb_cnt_mask; > > uint32_t lkey; > > - uintptr_t addr; > > > > /* Clean up old buffer. */ > > if (likely(elt->buf !=3D NULL)) { > > @@ -618,24 +648,19 @@ struct pv { > > dseg =3D (volatile struct mlx4_wqe_data_seg *) > > ((uintptr_t)ctrl + > > sizeof(struct mlx4_wqe_ctrl_seg)); > > - addr =3D rte_pktmbuf_mtod(buf, uintptr_t); > > - rte_prefetch0((volatile void *)addr); > > - dseg->addr =3D rte_cpu_to_be_64(addr); > > - /* Memory region key (big endian). */ > > + > > + ctrl->fence_size =3D (WQE_ONE_DATA_SEG_SIZE >> 4) > & 0x3f; > > lkey =3D mlx4_txq_mp2mr(txq, > mlx4_txq_mb2mp(buf)); > > - dseg->lkey =3D rte_cpu_to_be_32(lkey); > > - if (unlikely(dseg->lkey =3D=3D > > - rte_cpu_to_be_32((uint32_t)-1))) { > > + if (unlikely(lkey =3D=3D (uint32_t)-1)) { > > /* MR does not exist. */ > > DEBUG("%p: unable to get MP <-> MR > association", > > (void *)txq); > > elt->buf =3D NULL; > > break; > > } > > - /* Never be TXBB aligned, no need compiler barrier. > */ > > - dseg->byte_count =3D rte_cpu_to_be_32(buf- > >data_len); > > - /* Fill the control parameters for this packet. */ > > - ctrl->fence_size =3D (WQE_ONE_DATA_SEG_SIZE >> 4) > & 0x3f; > > + mlx4_fill_tx_data_seg(dseg, lkey, > > + rte_pktmbuf_mtod(buf, > uintptr_t), > > + rte_cpu_to_be_32(buf- > >data_len)); > > nr_txbbs =3D 1; > > } else { > > nr_txbbs =3D mlx4_tx_burst_segs(buf, txq, &ctrl); diff - > -git > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index > > 463df2b..8207232 100644 > > --- a/drivers/net/mlx4/mlx4_rxtx.h > > +++ b/drivers/net/mlx4/mlx4_rxtx.h > > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev > *dev, uint16_t idx, > > return mlx4_txq_add_mr(txq, mp, i); > > } > > > > +/** > > + * Write Tx data segment to the SQ. > > + * > > + * @param dseg > > + * Pointer to data segment in SQ. > > + * @param lkey > > + * Memory region lkey. > > + * @param addr > > + * Data address. > > + * @param byte_count > > + * Big Endian bytes count of the data to send. >=20 > Big Endian =3D> Big endian >=20 > How about using the dedicated type to properly document it? > See rte_be32_t from rte_byteorder.h. >=20 Learned new something, thanks, will check it. > > + */ > > +static inline void > > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg, > > + uint32_t lkey, uintptr_t addr, uint32_t byte_count) { > > + dseg->addr =3D rte_cpu_to_be_64(addr); > > + dseg->lkey =3D rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE < > 64 > > + /* > > + * Need a barrier here before writing the byte_count > > + * fields to make sure that all the data is visible > > + * before the byte_count field is set. > > + * Otherwise, if the segment begins a new cacheline, > > + * the HCA prefetcher could grab the 64-byte chunk and > > + * get a valid (!=3D 0xffffffff) byte count but stale > > + * data, and end up sending the wrong data. > > + */ > > + rte_io_wmb(); > > +#endif /* RTE_CACHE_LINE_SIZE */ > > + dseg->byte_count =3D byte_count; > > +} > > + >=20 > No need to expose this function in a header file. Note that rte_cpu_*() a= nd > rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h > respectively. >=20 Shouldn't inline functions be in header files? > > #endif /* MLX4_RXTX_H_ */ > > -- > > 1.8.3.1 > > >=20 > -- > Adrien Mazarguil > 6WIND