DPDK patches and discussions
 help / color / mirror / Atom feed
From: Adrien Mazarguil <adrien.mazarguil@6wind.com>
To: Ophir Munk <ophirmu@mellanox.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>,
	Thomas Monjalon <thomas@monjalon.net>,
	Olga Shern <olgas@mellanox.com>, Matan Azrad <matan@mellanox.com>
Subject: Re: [dpdk-dev] [PATCH v2 2/7] net/mlx4: inline more Tx functions
Date: Thu, 26 Oct 2017 09:48:53 +0200	[thread overview]
Message-ID: <20171026074853.GL26782@6wind.com> (raw)
In-Reply-To: <DB5PR05MB1254024B3918471086AC914DD1440@DB5PR05MB1254.eurprd05.prod.outlook.com>

Hi Ophir,

Please see below.

On Wed, Oct 25, 2017 at 09:42:46PM +0000, Ophir Munk wrote:
> Hi Adrien,
> 
> On Wednesday, October 25, 2017 7:50 PM, Adrien Mazarguil wrote:
> > 
> > Hi Ophir,
> > 
> > On Mon, Oct 23, 2017 at 02:21:55PM +0000, Ophir Munk wrote:
> > > Change functions to inline on Tx fast path to improve performance
> > >
> > > Inside the inline function call other functions to handle "unlikely"
> > > cases such that the inline function code footprint is small.
> > >
> > > Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
> > 
> > Reading this, it's like adding __rte_always_inline improves performance at
> > all, which I doubt unless you can show proof through performance results.
> > 
> > When in doubt, leave it to the compiler, the static keyword is usually enough
> > of a hint. Too much forced inlining may actually be harmful.
> > 
> > What this patch really does is splitting the heavy lookup/registration function
> > in two halves with one small static inline function for the lookup part that
> > calls the separate registration part in the unlikely event MR is not already
> > registered.
> > 
> > Thankfully the compiler doesn't inline the large registration function back,
> > which results in the perceived performance improvement for the time being,
> > however there is no guarantee it won't happen in the future (you didn't use
> > the noinline keyword on the registration function for that).
> > 
> > Therefore I have a bunch of comments and suggestions, see below.
> > 
> > > ---
> > >  drivers/net/mlx4/mlx4_rxtx.c | 43
> > > ++++++------------------------------
> > >  drivers/net/mlx4/mlx4_rxtx.h | 52
> > > +++++++++++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 58 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > > b/drivers/net/mlx4/mlx4_rxtx.c index 011ea79..ae37f9b 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > > @@ -220,54 +220,25 @@ mlx4_txq_complete(struct txq *txq)
> > >  	return 0;
> > >  }
> > >
> > > -/**
> > > - * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from
> > > which
> > > - * the cloned mbuf is allocated is returned instead.
> > > - *
> > > - * @param buf
> > > - *   Pointer to mbuf.
> > > - *
> > > - * @return
> > > - *   Memory pool where data is located for given mbuf.
> > > - */
> > > -static struct rte_mempool *
> > > -mlx4_txq_mb2mp(struct rte_mbuf *buf)
> > > -{
> > > -	if (unlikely(RTE_MBUF_INDIRECT(buf)))
> > > -		return rte_mbuf_from_indirect(buf)->pool;
> > > -	return buf->pool;
> > > -}
> > >
> > >  /**
> > > - * Get memory region (MR) <-> memory pool (MP) association from txq-
> > >mp2mr[].
> > > - * Add MP to txq->mp2mr[] if it's not registered yet. If mp2mr[] is
> > > full,
> > > - * remove an entry first.
> > > + * Add memory region (MR) <-> memory pool (MP) association to txq-
> > >mp2mr[].
> > > + * If mp2mr[] is full, remove an entry first.
> > >   *
> > >   * @param txq
> > >   *   Pointer to Tx queue structure.
> > >   * @param[in] mp
> > > - *   Memory pool for which a memory region lkey must be returned.
> > > + *   Memory pool for which a memory region lkey must be added
> > > + * @param[in] i
> > > + *   Index in memory pool (MP) where to add memory region (MR)
> > >   *
> > >   * @return
> > > - *   mr->lkey on success, (uint32_t)-1 on failure.
> > > + *   Added mr->lkey on success, (uint32_t)-1 on failure.
> > >   */
> > > -uint32_t
> > > -mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp)
> > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp,
> > > +uint32_t i)
> > >  {
> > > -	unsigned int i;
> > >  	struct ibv_mr *mr;
> > >
> > > -	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
> > > -		if (unlikely(txq->mp2mr[i].mp == NULL)) {
> > > -			/* Unknown MP, add a new MR for it. */
> > > -			break;
> > > -		}
> > > -		if (txq->mp2mr[i].mp == mp) {
> > > -			assert(txq->mp2mr[i].lkey != (uint32_t)-1);
> > > -			assert(txq->mp2mr[i].mr->lkey == txq-
> > >mp2mr[i].lkey);
> > > -			return txq->mp2mr[i].lkey;
> > > -		}
> > > -	}
> > >  	/* Add a new entry, register MR first. */
> > >  	DEBUG("%p: discovered new memory pool \"%s\" (%p)",
> > >  	      (void *)txq, mp->name, (void *)mp); diff --git
> > > a/drivers/net/mlx4/mlx4_rxtx.h b/drivers/net/mlx4/mlx4_rxtx.h index
> > > e10bbca..719ef45 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.h
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.h
> > > @@ -53,6 +53,7 @@
> > >
> > >  #include "mlx4.h"
> > >  #include "mlx4_prm.h"
> > > +#include "mlx4_utils.h"
> > 
> > Why?
> > 
> > >
> > >  /** Rx queue counters. */
> > >  struct mlx4_rxq_stats {
> > > @@ -160,7 +161,6 @@ void mlx4_rx_queue_release(void *dpdk_rxq);
> > >
> > >  /* mlx4_rxtx.c */
> > >
> > > -uint32_t mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp);
> > > uint16_t mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts,
> > >  		       uint16_t pkts_n);
> > >  uint16_t mlx4_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, @@
> > > -169,6 +169,8 @@ uint16_t mlx4_tx_burst_removed(void *dpdk_txq,
> > struct rte_mbuf **pkts,
> > >  			       uint16_t pkts_n);
> > >  uint16_t mlx4_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts,
> > >  			       uint16_t pkts_n);
> > > +uint32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp,
> > > +				unsigned int i);
> > >
> > >  /* mlx4_txq.c */
> > >
> > > @@ -177,4 +179,52 @@ int mlx4_tx_queue_setup(struct rte_eth_dev
> > *dev, uint16_t idx,
> > >  			const struct rte_eth_txconf *conf);  void
> > > mlx4_tx_queue_release(void *dpdk_txq);
> > >
> > > +/**
> > > + * Get memory pool (MP) from mbuf. If mbuf is indirect, the pool from
> > > +which
> > > + * the cloned mbuf is allocated is returned instead.
> > > + *
> > > + * @param buf
> > > + *   Pointer to mbuf.
> > > + *
> > > + * @return
> > > + *   Memory pool where data is located for given mbuf.
> > > + */
> > > +static __rte_always_inline struct rte_mempool * mlx4_txq_mb2mp(struct
> > > +rte_mbuf *buf) {
> > > +	if (unlikely(RTE_MBUF_INDIRECT(buf)))
> > > +		return rte_mbuf_from_indirect(buf)->pool;
> > > +	return buf->pool;
> > > +}
> > > +
> > > +/**
> > > + * Get memory region (MR) <-> memory pool (MP) association from txq-
> > >mp2mr[].
> > > + * Call mlx4_txq_add_mr() if MP is not registered yet.
> > > + *
> > > + * @param txq
> > > + *   Pointer to Tx queue structure.
> > > + * @param[in] mp
> > > + *   Memory pool for which a memory region lkey must be returned.
> > > + *
> > > + * @return
> > > + *   mr->lkey on success, (uint32_t)-1 on failure.
> > > + */
> > > +static __rte_always_inline uint32_t
> > 
> > Note __rte_always_inline is defined in rte_common.h and should be
> > explicitly included (however don't do that, see below).
> > 
> > > +mlx4_txq_mp2mr(struct txq *txq, struct rte_mempool *mp) {
> > > +	unsigned int i;
> > > +
> > > +	for (i = 0; (i != RTE_DIM(txq->mp2mr)); ++i) {
> > > +		if (unlikely(txq->mp2mr[i].mp == NULL)) {
> > > +			/* Unknown MP, add a new MR for it. */
> > > +			break;
> > > +		}
> > > +		if (txq->mp2mr[i].mp == mp) {
> > > +			assert(txq->mp2mr[i].lkey != (uint32_t)-1);
> > > +			assert(txq->mp2mr[i].mr->lkey == txq-
> > >mp2mr[i].lkey);
> > 
> > assert() requires assert.h (but don't include it, see subsequent suggestion).
> > 
> > > +			return txq->mp2mr[i].lkey;
> > > +		}
> > > +	}
> > > +	return mlx4_txq_add_mr(txq, mp, i);
> > > +}
> > >  #endif /* MLX4_RXTX_H_ */
> > 
> > So as described above, these functions do not need the __rte_always_inline,
> > please remove it. They also do not need to be located in a header file; the
> > reason it's the case for their mlx5 counterparts is that they have to be shared
> > between vectorized/non-vectorized code. No such requirement here, you
> > should move them back to their original spot.
> > 
> 
> Static function mlx4_txq_mp2mr() must be in a header file because it is shared by 2 files: mlx4_txq.c and mlx4_rxtx.c.
> It is not related to vectorized/non-vectorized code in mlx5.
> Having said that -__rte_always_inline is required as well otherwise compilation fails with 
> drivers/net/mlx4/mlx4_rxtx.h:200:1: error: 'mlx4_txq_mp2mr' defined but not used [-Werror=unused-function]
> for files which include mlx4_rxtx.h

All right, then what you were looking or was static inline, not *force*
inline. The former is a hint, the latter doesn't leave much of a choice to
the compiler, it means you're sure this way brings the most performance,
however for this patch I really think inlining plays a really minor part
(even changes anything at all) compared to dividing this function, which is
the real performance improvement.

> > My suggestion for this performance improvement is to move
> > mlx4_txq_add_mr() to a different file, mlx4_mr.c looks like a good
> > candidate. This fact will ensure it's never inlined and far away from the data
> > path.
> > 
> 
> Function mlx4_txq_add_mr() is relatively small. 
> What do you say about preceding it with __attribute((noinline)) instead of creating a new file?

What I mean is you should declare mlx4_txq_add_mr() which does the heavy
lifting inside mlx4_mr.c and provide its definition in mlx4.h instead of
mlx4_rxtx.h.

Then, mlx4_txq_mp2mr() can remain defined in mlx4_rxtx.c in its original
spot as a non-static function with its public declaration remaining in
mlx4_rxtx.h for users outside of this file.

The fact mlx4_txq_mp2mr() remains defined in that file *before*
mlx4_post_send()'s definition where it's needed allows the compiler to
optimize it away as if it was static inline thanks to -O3, that is, unless
it thinks doing so would hurt performance, but as a (now) small function
this shouldn't be an issue.

Other reasons includes that doing so would make a smaller diff that focuses
on the performance improvement itself. The extra performance brought by a
statically inlined version of mlx4_txq_mp2mr() is not needed in mlx4_txq.c,
whose only purpose is to set up queues.

-- 
Adrien Mazarguil
6WIND

  reply	other threads:[~2017-10-26  7:49 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1508752838-30408-1-git-send-email-ophirmu@mellanox.com>
2017-10-23 14:21 ` [dpdk-dev] [PATCH v2 0/7] net/mlx4: follow-up on new TX datapath introduced in RC1 Ophir Munk
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 1/7] net/mlx4: remove error flows from Tx fast path Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 2/7] net/mlx4: inline more Tx functions Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-25 21:42       ` Ophir Munk
2017-10-26  7:48         ` Adrien Mazarguil [this message]
2017-10-26 14:27           ` Ophir Munk
2017-10-29 19:30             ` Ophir Munk
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 3/7] net/mlx4: save lkey in big-endian format Ophir Munk
2017-10-23 15:24     ` Nélio Laranjeiro
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions Ophir Munk
2017-10-24 13:51     ` Nélio Laranjeiro
2017-10-24 20:36       ` Ophir Munk
2017-10-25  7:50         ` Nélio Laranjeiro
2017-10-26 10:31           ` Matan Azrad
2017-10-26 12:12             ` Nélio Laranjeiro
2017-10-26 12:30               ` Matan Azrad
2017-10-26 13:44                 ` Nélio Laranjeiro
2017-10-26 16:21                   ` Matan Azrad
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 5/7] net/mlx4: remove unnecessary variables in Tx burst Ophir Munk
2017-10-25 16:49     ` Adrien Mazarguil
2017-10-23 14:21   ` [dpdk-dev] [PATCH v2 6/7] net/mlx4: improve performance of one Tx segment Ophir Munk
2017-10-25 16:50     ` Adrien Mazarguil
2017-10-23 14:22   ` [dpdk-dev] [PATCH v2 7/7] net/mlx4: separate Tx for multi-segments Ophir Munk
2017-10-25 16:50     ` Adrien Mazarguil
2017-10-30  8:15       ` Ophir Munk
2017-10-30 10:07   ` [dpdk-dev] [PATCH v3 0/7] Tx path improvements Matan Azrad
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 1/7] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:11         ` Matan Azrad
2017-10-31 10:16           ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 2/7] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-31 13:25         ` Ophir Munk
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 3/7] net/mlx4: merge Tx path functions Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:12         ` Matan Azrad
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 4/7] net/mlx4: remove completion counter in Tx burst Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 5/7] net/mlx4: separate Tx segment cases Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 18:23         ` Matan Azrad
2017-10-31 10:17           ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-10-30 14:23       ` Adrien Mazarguil
2017-10-30 19:47         ` Matan Azrad
2017-10-31 10:17           ` Adrien Mazarguil
2017-10-31 11:35             ` Matan Azrad
2017-10-31 13:21               ` Adrien Mazarguil
2017-10-30 10:07     ` [dpdk-dev] [PATCH v3 7/7] net/mlx4: remove empty Tx segment support Matan Azrad
2017-10-30 14:24       ` Adrien Mazarguil
2017-10-31 18:21     ` [dpdk-dev] [PATCH v4 0/8] net/mlx4: Tx path improvements Matan Azrad
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 1/8] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 2/8] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 3/8] net/mlx4: fix ring wraparound compiler hint Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 4/8] net/mlx4: merge Tx path functions Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 5/8] net/mlx4: remove duplicate handling in Tx burst Matan Azrad
2017-11-02 13:42         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 6/8] net/mlx4: separate Tx segment cases Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 7/8] net/mlx4: fix HW memory optimizations careless Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-10-31 18:21       ` [dpdk-dev] [PATCH v4 8/8] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-11-02 13:43         ` Adrien Mazarguil
2017-11-02 13:41       ` [dpdk-dev] [PATCH] net/mlx4: fix missing include Adrien Mazarguil
2017-11-02 20:35         ` Ferruh Yigit
2017-11-02 16:42     ` [dpdk-dev] [PATCH v5 0/8] net/mlx4: Tx path improvements Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 1/8] net/mlx4: remove error flows from Tx fast path Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 2/8] net/mlx4: associate MR to MP in a short function Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 3/8] net/mlx4: fix ring wraparound compiler hint Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 4/8] net/mlx4: merge Tx path functions Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 5/8] net/mlx4: remove duplicate handling in Tx burst Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 6/8] net/mlx4: separate Tx segment cases Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 7/8] net/mlx4: fix HW memory optimizations careless Matan Azrad
2017-11-02 16:42       ` [dpdk-dev] [PATCH v5 8/8] net/mlx4: mitigate Tx path memory barriers Matan Azrad
2017-11-02 17:07       ` [dpdk-dev] [PATCH v5 0/8] net/mlx4: Tx path improvements Adrien Mazarguil
2017-11-02 20:35         ` Ferruh Yigit
2017-11-02 20:41       ` Ferruh Yigit
2017-11-03  9:48         ` Adrien Mazarguil
2017-11-03 19:25       ` Ferruh Yigit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171026074853.GL26782@6wind.com \
    --to=adrien.mazarguil@6wind.com \
    --cc=dev@dpdk.org \
    --cc=matan@mellanox.com \
    --cc=olgas@mellanox.com \
    --cc=ophirmu@mellanox.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).