From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f68.google.com (mail-wm0-f68.google.com [74.125.82.68]) by dpdk.org (Postfix) with ESMTP id 568621B2D0 for ; Thu, 26 Oct 2017 09:49:06 +0200 (CEST) Received: by mail-wm0-f68.google.com with SMTP id r68so6307995wmr.3 for ; Thu, 26 Oct 2017 00:49:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=gG81MfuDOLD2l7iYglzsumDbtA5K7iX/QZG4u5qmeUI=; b=IUj70HdVQ4kROOxznkfDlddXNzZlCP5Z9kZIOrryqgrJonAjr1mJ4AxW7Cjm1u2OsO mIus3fQBJkfN8a7VbkfJLlWF5bMhjqjmhYrEIXpyKdCHU4W6JZbprwe8yi/Ou9KCYSt2 zcAMGsvkQfQOCQrPPNuJEeE+HzkptzokJD+X14jc1tNadqicqWeTl0y8HlIBYaD0L2x4 pkKTRBt1sCteCChUe02JrMsdMDfFtMqOv/UOifj+kfYxBrFkcopODDxMUXhU2/KrmcYs DKyuBPiv4r+b91tMB9vEZhURdeN4gBoZMNoYz9Rw9nzM1v35rGKTOlQSparqnEWa5v09 kHng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=gG81MfuDOLD2l7iYglzsumDbtA5K7iX/QZG4u5qmeUI=; b=hVd9EinY0dlbVMojJWKfr8O/h9BzDfjCmIRUq3u/IrZQI2+rKfSteZaU72o/peVKS3 aVPfstji4LJq1U9HzmLWxEXWTcNBg9mSnpHMa6lH1MMSg2aINe4RhE/qqk5KgBYpieHL Bv5Hx+4Wg/baYVrtXea/ifpuErb7RGGJ+eXSe0JGKjIfxaufcLdZq94PTSj8TiYT4siC pVCE3q8W4ZzDcoJlk8XlPe8DvXx6MKCJ4nACYC9lJ2LDXCwXE5ECxFoFm8Bx+vBKduuP 6BA2Zk+jlpiaz+Vh2BupxIsHNJsLsBWcF7t+rx3XCh4JdwYMxBXrw/a/S6iWgeTmkoTM DLEw== X-Gm-Message-State: AMCzsaU8treEtLWTgw6l3K8ZWFtQxBNorLVea2+yIj3Uu1ZZGW1QKaTm wZ5LaCsqOMoQxyEU8p/J9jc0ZQ== X-Google-Smtp-Source: ABhQp+Q8kH7WddsOEaGPf7KWiB+E0AV8YgmY2rlYfH/5aFk+4flSVZXi08bUf2R7aE3HsHbL3fchlA== X-Received: by 10.80.201.12 with SMTP id o12mr26226025edh.98.1509004145693; Thu, 26 Oct 2017 00:49:05 -0700 (PDT) Received: from 6wind.com (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78]) by smtp.gmail.com with ESMTPSA id f39sm3528293edf.83.2017.10.26.00.49.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 26 Oct 2017 00:49:04 -0700 (PDT) Date: Thu, 26 Oct 2017 09:48:53 +0200 From: Adrien Mazarguil To: Ophir Munk Cc: "dev@dpdk.org" , Thomas Monjalon , Olga Shern , Matan Azrad Message-ID: <20171026074853.GL26782@6wind.com> References: <1508752838-30408-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-3-git-send-email-ophirmu@mellanox.com> <20171025164938.GH26782@6wind.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH v2 2/7] net/mlx4: inline more Tx functions X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Oct 2017 07:49:06 -0000 Hi Ophir, Please see below. On Wed, Oct 25, 2017 at 09:42:46PM +0000, Ophir Munk wrote: > Hi Adrien, > > On Wednesday, October 25, 2017 7:50 PM, Adrien Mazarguil wrote: > > > > Hi Ophir, > > > > On Mon, Oct 23, 2017 at 02:21:55PM +0000, Ophir Munk wrote: > > > Change functions to inline on Tx fast path to improve performance > > > > > > Inside the inline function call other functions to handle "unlikely" > > > cases such that the inline function code footprint is small. > > > > > > Signed-off-by: Ophir Munk > > > > Reading this, it's like adding __rte_always_inline improves performance at > > all, which I doubt unless you can show proof through performance results. > > > > When in doubt, leave it to the compiler, the static keyword is usually enough > > of a hint. Too much forced inlining may actually be harmful. > > > > What this patch really does is splitting the heavy lookup/registration function > > in two halves with one small static inline function for the lookup part that > > calls the separate registration part in the unlikely event MR is not already > > registered. > > > > Thankfully the compiler doesn't inline the large registration function back, > > which results in the perceived performance improvement for the time being, > > however there is no guarantee it won't happen in the future (you didn't use > > the noinline keyword on the registration function for that). > > > > Therefore I have a bunch of comments and suggestions, see below. > > > > > --- > > > drivers/net/mlx4/mlx4_rxtx.c | 43 > > > ++++++------------------------------ > > > drivers/net/mlx4/mlx4_rxtx.h | 52 > > > +++++++++++++++++++++++++++++++++++++++++++- > > > 2 files changed, 58 insertions(+), 37 deletions(-) > > > > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c > > > b/drivers/net/mlx4/mlx4_rxtx.c index 011ea79..ae37f9b 100644 > > > --- a/drivers/net/mlx4/mlx4_rxtx.c > > > +++ b/drivers/net/mlx4/mlx4_rxtx.c > > > @@ -220,54 +220,25 @@ mlx4_txq_complete(struct txq *txq) > > > return 0; > > > } > > > > > > -/** > > > - * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from > > > which > > > - * the cloned mbuf is allocated is returned instead. > > > - * > > > - * @param buf > > > - * Pointer to mbuf. > > > - * > > > - * @return > > > - * Memory pool where data is located for given mbuf. > > > - */ > > > -static struct rte_mempool * > > > -mlx4_txq_mb2mp(struct rte_mbuf *buf) > > > -{ > > > - if (unlikely(RTE_MBUF_INDIRECT(buf))) > > > - return rte_mbuf_from_indirect(buf)->pool; > > > - return buf->pool; > > > -} > > > > > > /** > > > - * Get memory region (MR) <-> memory pool (MP) association from txq- > > >mp2mr[]. > > > - * Add MP to txq->mp2mr[] if it's not registered yet. If mp2mr[] is > > > full, > > > - * remove an entry first. > > > + * Add memory region (MR) <-> memory pool (MP) association to txq- > > >mp2mr[]. > > > + * If mp2mr[] is full, remove an entry first. > > > * > > > * @param txq > > > * Pointer to Tx queue structure. > > > * @param[in] mp > > > - * Memory pool for which a memory region lkey must be returned. > > > + * Memory pool for which a memory region lkey must be added > > > + * @param[in] i > > > + * Index in memory pool (MP) where to add memory region (MR) > > > * > > > * @return > > > - * mr->lkey on success, (uint32_t)-1 on failure. > > > + * Added mr->lkey on success, (uint32_t)-1 on failure. > > > */ > > > -uint32_t > > > -mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp) > > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp, > > > +uint32_t i) > > > { > > > - unsigned int i; > > > struct ibv_mr *mr; > > > > > > - for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) { > > > - if (unlikely(txq->mp2mr[i].mp == NULL)) { > > > - /* Unknown MP, add a new MR for it. */ > > > - break; > > > - } > > > - if (txq->mp2mr[i].mp == mp) { > > > - assert(txq->mp2mr[i].lkey != (uint32_t)-1); > > > - assert(txq->mp2mr[i].mr->lkey == txq- > > >mp2mr[i].lkey); > > > - return txq->mp2mr[i].lkey; > > > - } > > > - } > > > /* Add a new entry, register MR first. */ > > > DEBUG("%p: discovered new memory pool \"%s\" (%p)", > > > (void *)txq, mp->name, (void *)mp); diff --git > > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index > > > e10bbca..719ef45 100644 > > > --- a/drivers/net/mlx4/mlx4_rxtx.h > > > +++ b/drivers/net/mlx4/mlx4_rxtx.h > > > @@ -53,6 +53,7 @@ > > > > > > #include "mlx4.h" > > > #include "mlx4_prm.h" > > > +#include "mlx4_utils.h" > > > > Why? > > > > > > > > /** Rx queue counters. */ > > > struct mlx4_rxq_stats { > > > @@ -160,7 +161,6 @@ void mlx4_rx_queue_release(void *dpdk_rxq); > > > > > > /* mlx4_rxtx.c */ > > > > > > -uint32_t mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp); > > > uint16_t mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, > > > uint16_t pkts_n); > > > uint16_t mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, @@ > > > -169,6 +169,8 @@ uint16_t mlx4_tx_burst_removed(void *dpdk_txq, > > struct rte_mbuf **pkts, > > > uint16_t pkts_n); > > > uint16_t mlx4_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, > > > uint16_t pkts_n); > > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp, > > > + unsigned int i); > > > > > > /* mlx4_txq.c */ > > > > > > @@ -177,4 +179,52 @@ int mlx4_tx_queue_setup(struct rte_eth_dev > > *dev, uint16_t idx, > > > const struct rte_eth_txconf *conf); void > > > mlx4_tx_queue_release(void *dpdk_txq); > > > > > > +/** > > > + * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from > > > +which > > > + * the cloned mbuf is allocated is returned instead. > > > + * > > > + * @param buf > > > + * Pointer to mbuf. > > > + * > > > + * @return > > > + * Memory pool where data is located for given mbuf. > > > + */ > > > +static __rte_always_inline struct rte_mempool * mlx4_txq_mb2mp(struct > > > +rte_mbuf *buf) { > > > + if (unlikely(RTE_MBUF_INDIRECT(buf))) > > > + return rte_mbuf_from_indirect(buf)->pool; > > > + return buf->pool; > > > +} > > > + > > > +/** > > > + * Get memory region (MR) <-> memory pool (MP) association from txq- > > >mp2mr[]. > > > + * Call mlx4_txq_add_mr() if MP is not registered yet. > > > + * > > > + * @param txq > > > + * Pointer to Tx queue structure. > > > + * @param[in] mp > > > + * Memory pool for which a memory region lkey must be returned. > > > + * > > > + * @return > > > + * mr->lkey on success, (uint32_t)-1 on failure. > > > + */ > > > +static __rte_always_inline uint32_t > > > > Note __rte_always_inline is defined in rte_common.h and should be > > explicitly included (however don't do that, see below). > > > > > +mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp) { > > > + unsigned int i; > > > + > > > + for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) { > > > + if (unlikely(txq->mp2mr[i].mp == NULL)) { > > > + /* Unknown MP, add a new MR for it. */ > > > + break; > > > + } > > > + if (txq->mp2mr[i].mp == mp) { > > > + assert(txq->mp2mr[i].lkey != (uint32_t)-1); > > > + assert(txq->mp2mr[i].mr->lkey == txq- > > >mp2mr[i].lkey); > > > > assert() requires assert.h (but don't include it, see subsequent suggestion). > > > > > + return txq->mp2mr[i].lkey; > > > + } > > > + } > > > + return mlx4_txq_add_mr(txq, mp, i); > > > +} > > > #endif /* MLX4_RXTX_H_ */ > > > > So as described above, these functions do not need the __rte_always_inline, > > please remove it. They also do not need to be located in a header file; the > > reason it's the case for their mlx5 counterparts is that they have to be shared > > between vectorized/non-vectorized code. No such requirement here, you > > should move them back to their original spot. > > > > Static function mlx4_txq_mp2mr() must be in a header file because it is shared by 2 files: mlx4_txq.c and mlx4_rxtx.c. > It is not related to vectorized/non-vectorized code in mlx5. > Having said that -__rte_always_inline is required as well otherwise compilation fails with > drivers/net/mlx4/mlx4_rxtx.h:200:1: error: 'mlx4_txq_mp2mr' defined but not used [-Werror=unused-function] > for files which include mlx4_rxtx.h All right, then what you were looking or was static inline, not *force* inline. The former is a hint, the latter doesn't leave much of a choice to the compiler, it means you're sure this way brings the most performance, however for this patch I really think inlining plays a really minor part (even changes anything at all) compared to dividing this function, which is the real performance improvement. > > My suggestion for this performance improvement is to move > > mlx4_txq_add_mr() to a different file, mlx4_mr.c looks like a good > > candidate. This fact will ensure it's never inlined and far away from the data > > path. > > > > Function mlx4_txq_add_mr() is relatively small. > What do you say about preceding it with __attribute((noinline)) instead of creating a new file? What I mean is you should declare mlx4_txq_add_mr() which does the heavy lifting inside mlx4_mr.c and provide its definition in mlx4.h instead of mlx4_rxtx.h. Then, mlx4_txq_mp2mr() can remain defined in mlx4_rxtx.c in its original spot as a non-static function with its public declaration remaining in mlx4_rxtx.h for users outside of this file. The fact mlx4_txq_mp2mr() remains defined in that file *before* mlx4_post_send()'s definition where it's needed allows the compiler to optimize it away as if it was static inline thanks to -O3, that is, unless it thinks doing so would hurt performance, but as a (now) small function this shouldn't be an issue. Other reasons includes that doing so would make a smaller diff that focuses on the performance improvement itself. The extra performance brought by a statically inlined version of mlx4_txq_mp2mr() is not needed in mlx4_txq.c, whose only purpose is to set up queues. -- Adrien Mazarguil 6WIND