DPDK patches and discussions
 help / color / mirror / Atom feed
From: Neil Horman <nhorman@tuxdriver.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
Date: Thu, 18 Sep 2014 13:56:41 -0400
Message-ID: <20140918175641.GL20389@hmsreliant.think-freely.org> (raw)
In-Reply-To: <20140918154235.GB12120@BRICHA3-MOBL>

On Thu, Sep 18, 2014 at 04:42:36PM +0100, Bruce Richardson wrote:
> On Thu, Sep 18, 2014 at 11:29:30AM -0400, Neil Horman wrote:
> > On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> > > On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > > > On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > > > To: Richardson, Bruce
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > > > > > perf
> > > > > > 
> > > > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > > > prefetch for the second mbuf cache line.
> > > > > > > Also move assignment of l2/l3 length values only when needed.
> > > > > > >
> > > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > > > ---
> > > > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > > > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > index 6f702b3..c0bb49f 100644
> > > > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > > > > **tx_pkts,
> > > > > > >  		ixgbe_xmit_cleanup(txq);
> > > > > > >  	}
> > > > > > >
> > > > > > > +	rte_prefetch0(&txe->mbuf->pool);
> > > > > > > +
> > > > > > 
> > > > > > Can you explain what all of these prefetches are doing?  It looks to me like
> > > > > > they're just fetching the first caheline of the mempool structure, which it
> > > > > > appears amounts to the pools name.  I don't see that having any use here.
> > > > > > 
> > > > > This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> > > > > 
> > > > > What I've done with the prefetches is two-fold:
> > > > > 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
> > > > ok, this makes some sense, but you're not guaranteed to either have that
> > > > prefetch be needed, nor are you certain it will still be in cache by the time
> > > > you get to the free call.  Seems like it might be preferable to prefecth the
> > > > data pointed to by tx_pkt, as you're sure to use that every loop iteration.
> > > 
> > > The vast majority of the times the prefetch is necessary, and it does help 
> > > performance doing things this way. If the prefetch is not necessary, it's 
> > > just one extra instruction, while, if it is needed, having the prefetch 
> > > occur 20 cycles before access (picking an arbitrary value) means that we 
> > > have cut down the time it takes to pull the data from cache when it is 
> > > needed by 20 cycles.
> > I understand how prefetch works. What I'm concerned about is its overuse, and
> > its tendency to frequently need re-calibration (though I admit I missed the &
> > operator in the patch, and thought you were prefetching the contents of the
> > struct, not the pointer value itself).  As you say, if the pool pointer is
> > almost certain to be used, then it may well make sense to prefetch the data, but
> > in doing so, you potentially evict something that you were about to use, so
> > you're not doing yourself any favors.  I understand that you've validated this
> > experimentally, and so it works, right now.  I just like to be very careful
> > about how prefetch happens, as it can easily (and sliently) start hurting far
> > more than it helps.
> > 
> > > As for the value pointed to by tx_pkt, since this is a 
> > > packet the app has just been working on, it's almost certainly already in 
> > > l1/l2 cache.  
> > > 
> > Not sure I follow you here.  tx_pkts is an array of mbufs passed to the pmd from
> > rte_eth_tx_burts, which in turn is called by the application.  I don't see any
> > reasonable guarantee that any of those packets have been touch in sufficiently
> > recent history that they are likely to be in cache.  It seems like, if you do
> > want to do prefetching, interrotagting nb_tx and doing a prefetch of an
> > approriate stride to fill multiple cachelines with successive mbuf headers might
> > provide superior performance.
> > Neil
> >
> Prefetching the mbuf is probably best left to the application. For all our 
> sample applications used for benchmarking, and almost certainly the vast 
> majority of all our example applications, the packet being transmitted is 
> already in cache on the core itself. Adding a prefetch to the tx function I 
> would expect to see a performance decrease in both testpmd and l3fwd apps.  
> I would be useful for apps where the packets are passed from one core to 
> another core which does no processing of them before transmitting them - but 
> in that case, it's better to have the TX thread of the app do the prefetch 
> rather than forcing it in the driver and reduce the performance of those 
> apps that have the packets already in cache.
> 

Regarding the performance decrease, I think you're trying to have it both ways
here.  Above you indicate that if the prefetch of the pool pointer isn't needed
its just an extra instruction, which I think is true.  But now you are saying
that if the tx buffers are in cache, the extra instructions will have an impact.
Granted its potentially nb_tx prefetches, not one, but none of them stall the
cpu pipeline as far as Im aware, so I can't imagine 1 prefetch vs several will
have a significant impact on performance.

Regarding where to do prefecth.  Leaving prefetch in the hands of the application is a
bad idea, because the application has no visibility into the code path once you
enter the DPDK. It doesn't know if the buffers are going to be accessed in 20
cycles or 20,000 cycles, which will be all the difference between a useful and
harmful prefetch.  Sure you can calibrate your application to correspond to a
given version of the dpdk and optimize such a prefetch, but that will be
completely obsoleted the first time the dpdk transmit path changes.

As for the use of prefetching tx buffers at all, I think theres several cases
where you might find that those buffers are vicimized in cache.  consider the
situation where a receive interrupt triggers on a cpu right before rte_eth_trans
is called.  For a heavily loaded system, the receive buffers may frequently push
the soon-to-be-transmitted buffers out of cache.

> The prefetch added by the patch under discussion doesn't suffer from this 
> issue as the data being prefetched is for the mbuf that was previously 
> transmitted some time previously, and the tx function has fully looped back 
> around the TX ring to get to it again.
> 

I get what you're saying here, that after the first prefetch the data stays hot
in cache because it is continually re-accessed.  Thats fine.  But that would
happen after the first fetch anyway, without the prefetch.

You know what would put this argument to rest?  If you could run whatever
benchmark you were running under the perf utility so we could see the L1 cache
misses from the baseline dpdk, the variant where you prefetch the pool pointer,
and a variant in which you prefetch the next tx buf at the top of the loop.

Neil



> /Bruce
> 
> 

  reply	other threads:[~2014-09-18 17:51 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
2014-09-17 10:01 ` [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
2014-09-17 10:01 ` [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
2014-09-17 15:21   ` Neil Horman
2014-09-17 15:35     ` Richardson, Bruce
2014-09-17 17:59       ` Neil Horman
2014-09-18 13:36         ` Bruce Richardson
2014-09-18 15:29           ` Neil Horman
2014-09-18 15:42             ` Bruce Richardson
2014-09-18 17:56               ` Neil Horman [this message]
2014-09-17 10:01 ` [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
2014-09-17 15:29   ` Neil Horman
2014-09-18 15:53     ` Richardson, Bruce
2014-09-18 17:13       ` Thomas Monjalon
2014-09-18 18:08         ` Neil Horman
2014-09-19  9:18           ` Richardson, Bruce
2014-09-19 10:24             ` Neil Horman
2014-09-19 10:28               ` Richardson, Bruce
2014-09-19 15:18                 ` Neil Horman
2014-09-18 18:03       ` Neil Horman
2014-09-17 10:01 ` [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field Bruce Richardson
2014-09-17 15:35   ` Neil Horman
2014-09-17 16:02     ` Richardson, Bruce
2014-09-17 18:29       ` Neil Horman
2014-09-17 10:01 ` [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf Bruce Richardson
2014-09-17 20:46   ` Stephen Hemminger
2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
2014-09-23 17:02     ` Neil Horman
2014-09-24  9:03       ` Richardson, Bruce
2014-09-24 10:05         ` Neil Horman
2014-11-07 12:30         ` Thomas Monjalon
2014-11-07 13:49           ` Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 4/5] mbuf: add userdata pointer field Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 5/5] mbuf: switch vlan_tci and reserved2 fields Bruce Richardson
2014-09-29 15:58   ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 De Lara Guarch, Pablo
2014-10-08 12:31     ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140918175641.GL20389@hmsreliant.think-freely.org \
    --to=nhorman@tuxdriver.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git