From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f195.google.com (mail-wr0-f195.google.com [209.85.128.195]) by dpdk.org (Postfix) with ESMTP id 72AE0271 for ; Wed, 6 Dec 2017 12:55:44 +0100 (CET) Received: by mail-wr0-f195.google.com with SMTP id g53so3618025wra.2 for ; Wed, 06 Dec 2017 03:55:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=wDHUu5PzTMGo51y+DuRzvQym7DjPHtx/xltlawNONdo=; b=QBR6PqjNEnmeIp1zxTvQ14hYpCCympdoae5rljOVwtjq3Lj3G6OLEnwg0QPqr0vf/R hbVI5YmDx2rLJXgRl8+vUDzMbQ49nGkhCleAeW0JKH1M6NRly6OFzxojsZUnzeNe3B+5 SxEtrJCUKMQvCAKzyfSu0jGBeYR1HFJHO5aeUCo/g9nWpG4aVtspm0prQwwsZKMHZaJe CaT+E6LQ6ZCZ+HZFbDgDfY/g932fLTWVg/rOAbvpuKkqjFMx/WiW1rZs5lcuf6KFEhrW yDwJmDL3eXXvXCVKSDxak65SFiaHg0u+ToIFUS3IjiI2hMhcYb17+ucexCI/IERhjQc0 ErIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=wDHUu5PzTMGo51y+DuRzvQym7DjPHtx/xltlawNONdo=; b=Fdp5gDQ3wqHfdmZcrjHUWUmKvZNDLKjLH/4CR18qKcA+RoV7c07Xzv1jh2ghKClwSp 4g20MNvP2VzkFsfNZ10NTNGQXtPYmf5VnIytGaMjAXaP16kU1q74UDCNxzAD6Iq1Eb5C 9d5zd/4q/Ia32dtNcosw/WVES5nYDcJCetuNmHgOUI1cYo2irPeBjz6OgLvA+kWhgj56 jSelrnBhh/iFPF7ngMNMMCRHxc2FpAjK293u9pokRw6xxgpw3yu2eRpbeg3slU67YiiP I7wnWFTC4aKCJgAdHyAGLpfutARe10+/MoBtGW2VnumRUw7PWyDKBKLhvZCuy2lmCra5 n7SQ== X-Gm-Message-State: AJaThX7bn2QMLp05Qt5bMK8fXGzJwXq6CcxOdPlbGCdeqfmEErQKN7To kJqDGRO+PKK7D0jNKFlrZ6O9LJWV X-Google-Smtp-Source: AGs4zMbqI6tNbb75UdPjiQRApf9dshJ1fnbpqMV+wqjXonZEuxgAQPviB5vFtLPVcKAOYawIx3NbiQ== X-Received: by 10.223.182.19 with SMTP id f19mr18562593wre.81.1512561344182; Wed, 06 Dec 2017 03:55:44 -0800 (PST) Received: from 6wind.com (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78]) by smtp.gmail.com with ESMTPSA id b33sm2553497wrg.48.2017.12.06.03.55.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Dec 2017 03:55:43 -0800 (PST) Date: Wed, 6 Dec 2017 12:55:31 +0100 From: Adrien Mazarguil To: Matan Azrad Cc: "dev@dpdk.org" Message-ID: <20171206115531.GA4062@6wind.com> References: <1511871570-16826-1-git-send-email-matan@mellanox.com> <1511871570-16826-5-git-send-email-matan@mellanox.com> <20171206105831.GV4062@6wind.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH 4/8] net/mlx4: optimize Tx multi-segment case X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Dec 2017 11:55:44 -0000 On Wed, Dec 06, 2017 at 11:29:38AM +0000, Matan Azrad wrote: > Hi Adrien > > > -----Original Message----- > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com] > > Sent: Wednesday, December 6, 2017 12:59 PM > > To: Matan Azrad > > Cc: dev@dpdk.org > > Subject: Re: [PATCH 4/8] net/mlx4: optimize Tx multi-segment case > > > > On Tue, Nov 28, 2017 at 12:19:26PM +0000, Matan Azrad wrote: > > > mlx4 Tx block can handle up to 4 data segments or control segment + up > > > to 3 data segments. The first data segment in each not first Tx block > > > must validate Tx queue wraparound and must use IO memory barrier > > > before writing the byte count. > > > > > > The previous multi-segment code used "for" loop to iterate over all > > > packet segments and separated first Tx block data case by "if" > > > statments. > > > > statments => statements > > > > > > > > Use switch case and unconditional branches instead of "for" loop can > > > optimize the case and prevents the unnecessary checks for each data > > > segment; This hints to compiler to create opitimized jump table. > > > > opitimized => optimized > > > > > > > > Optimize this case by switch case and unconditional branches usage. > > > > > > Signed-off-by: Matan Azrad > > > --- > > > drivers/net/mlx4/mlx4_rxtx.c | 165 > > > +++++++++++++++++++++++++------------------ > > > drivers/net/mlx4/mlx4_rxtx.h | 33 +++++++++ > > > 2 files changed, 128 insertions(+), 70 deletions(-) > > > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c > > > + /* > > > + * Fill the data segments with buffer information. > > > + * First WQE TXBB head segment is always control segment, > > > + * so jump to tail TXBB data segments code for the first > > > + * WQE data segments filling. > > > + */ > > > + goto txbb_tail_segs; > > > +txbb_head_seg: > > > > I'm not fundamentally opposed to "goto" unlike a lot of people out there, > > but this doesn't look good. It's OK to use goto for error cases and to extricate > > yourself when trapped in an inner loop, also in some optimization scenarios > > where it sometimes make sense, but not when the same can be achieved > > through standard loop constructs and keywords. > > > > In this case I'm under the impression you could have managed with a do { ... } > > while (...) construct. You need to try harder to reorganize these changes or > > prove it can't be done without negatively impacting performance. > > > > Doing so should make this patch shorter as well. > > > > I noticed this could be done with loop and without unconditional branches, but I checked it and found nice performance improvement using that way. > When I used the loop I degraded performance. OK, you can leave it as is in the meantime then, we'll keep that in mind for the next refactoring iteration. > > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index > > > 463df2b..8207232 100644 > > > --- a/drivers/net/mlx4/mlx4_rxtx.h > > > +++ b/drivers/net/mlx4/mlx4_rxtx.h > > > @@ -212,4 +212,37 @@ int mlx4_tx_queue_setup(struct rte_eth_dev > > *dev, uint16_t idx, > > > return mlx4_txq_add_mr(txq, mp, i); > > > } > > > > > > +/** > > > + * Write Tx data segment to the SQ. > > > + * > > > + * @param dseg > > > + * Pointer to data segment in SQ. > > > + * @param lkey > > > + * Memory region lkey. > > > + * @param addr > > > + * Data address. > > > + * @param byte_count > > > + * Big Endian bytes count of the data to send. > > > > Big Endian => Big endian > > > > How about using the dedicated type to properly document it? > > See rte_be32_t from rte_byteorder.h. > > > Learned new something, thanks, will check it. > > > > + */ > > > +static inline void > > > +mlx4_fill_tx_data_seg(volatile struct mlx4_wqe_data_seg *dseg, > > > + uint32_t lkey, uintptr_t addr, uint32_t byte_count) { > > > + dseg->addr = rte_cpu_to_be_64(addr); > > > + dseg->lkey = rte_cpu_to_be_32(lkey); #if RTE_CACHE_LINE_SIZE < > > 64 > > > + /* > > > + * Need a barrier here before writing the byte_count > > > + * fields to make sure that all the data is visible > > > + * before the byte_count field is set. > > > + * Otherwise, if the segment begins a new cacheline, > > > + * the HCA prefetcher could grab the 64-byte chunk and > > > + * get a valid (!= 0xffffffff) byte count but stale > > > + * data, and end up sending the wrong data. > > > + */ > > > + rte_io_wmb(); > > > +#endif /* RTE_CACHE_LINE_SIZE */ > > > + dseg->byte_count = byte_count; > > > +} > > > + > > > > No need to expose this function in a header file. Note that rte_cpu_*() and > > rte_io*() require the inclusion of rte_byteorder.h and rte_atomic.h > > respectively. > > > > Shouldn't inline functions be in header files? Not necessarily, like other function types actually. They have to be declared where needed, in a header file only if several files depend on them, otherwise they bring namespace pollution for no apparent reason. -- Adrien Mazarguil 6WIND