From: Bruce Richardson <bruce.richardson@intel.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
Cc: Vladyslav Buslov <Vladyslav.Buslov@harmonicinc.com>,
"Wu, Jingjing" <jingjing.wu@intel.com>,
"Yigit, Ferruh" <ferruh.yigit@intel.com>,
"Zhang, Helin" <helin.zhang@intel.com>,
"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
Date: Thu, 13 Oct 2016 11:18:49 +0100 [thread overview]
Message-ID: <20161013101849.GA132256@bricha3-MOBL3> (raw)
In-Reply-To: <2601191342CEEE43887BDE71AB9772583F0C09AD@irsmsx105.ger.corp.intel.com>
On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> Hi Vladislav,
>
> > > > > >
> > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > i40e_rx_alloc_bufs
> > > > > > >
> > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > <vladyslav.buslov@harmonicinc.com>
> > > > > > > ---
> > > > > > > drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > i40e_rx_queue
> > > > > *rxq)
> > > > > > > /* Translate descriptor info to mbuf parameters */
> > > > > > > for (j = 0; j < nb_dd; j++) {
> > > > > > > mb = rxep[j].mbuf;
> > > > > > > + rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > RTE_PKTMBUF_HEADROOM));
> > > > >
> > > > > Why did prefetch here? I think if application need to deal with
> > > > > packet, it is more suitable to put it in application.
> > > > >
> > > > > > > qword1 = rte_le_to_cpu_64(\
> > > > > > > rxdp[j].wb.qword1.status_error_len);
> > > > > > > pkt_len = ((qword1 &
> > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > *rxq)
> > > > > > >
> > > > > > > rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > > for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > - if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > + if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > > /* Prefetch next mbuf */
> > > > > > > - rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > + rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > + rte_prefetch0(&rxep[i +
> > > > > > > + 1].mbuf->cacheline1);
> > >
> > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > specially for that case.
> >
> > Thanks for pointing that out.
> > I'll submit new patch if you decide to move forward with this development.
> >
> > >
> > > > > > > + }
> > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > performance increase is observed but minor decrease.
> > > > > Can you share will us when it will benefit the performance in your
> > > scenario ?
> > > > >
> > > > >
> > > > > Thanks
> > > > > Jingjing
> > > >
> > > > Hello Jingjing,
> > > >
> > > > Thanks for code review.
> > > >
> > > > My use case: We have simple distributor thread that receives packets
> > > > from port and distributes them among worker threads according to VLAN
> > > and MAC address hash.
> > > >
> > > > While working on performance optimization we determined that most of
> > > CPU usage of this thread is in DPDK.
> > > > As and optimization we decided to switch to rx burst alloc function,
> > > > however that caused additional performance degradation compared to
> > > scatter rx mode.
> > > > In profiler two major culprits were:
> > > > 1. Access to packet data Eth header in application code. (cache miss)
> > > > 2. Setting next packet descriptor field to NULL in DPDK
> > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > line that was not
> > > > prefetched)
> > >
> > > I wonder what will happen if we'll remove any prefetches here?
> > > Would it make things better or worse (and by how much)?
> >
> > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> >
> > >
> > > > After applying my fixes performance improved compared to scatter rx
> > > mode.
> > > >
> > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > DPDK because it is done in scatter rx mode. (in
> > > > i40e_recv_scattered_pkts)
> > > > It can be moved to application side but IMO it is better to be consistent
> > > across all rx modes.
> > >
> > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > packet's data.
> >
> > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> > packets directly from rx queues of NIC.
> > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> > eth header and can't be easily mitigated in application code.
> > I assume it is ubiquitous use case for DPDK.
>
> Yes it is a quite common use-case.
> Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> Again, as I know, none of PMDs for Intel devices prefetches packet's data in simple (single segment) RX mode.
> Another thing that some people may argue then - why only one cache line is prefetched,
> in some use-cases might need to look at 2-nd one.
>
There is a build-time config setting for this behaviour for exactly the reasons
called out here - in some apps you get a benefit, in others you see a perf
hit. The default is "on", which makes sense for most cases, I think.
>From common_base:
CONFIG_RTE_PMD_PACKET_PREFETCH=y$
/Bruce
next prev parent reply other threads:[~2016-10-13 10:18 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-14 17:27 [dpdk-dev] [PATCH] Add missing prefetches to i40e bulk rx path Vladyslav Buslov
2016-07-14 17:27 ` [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx Vladyslav Buslov
2016-09-14 13:24 ` Ferruh Yigit
2016-10-10 13:25 ` Wu, Jingjing
2016-10-10 17:05 ` Vladyslav Buslov
2016-10-11 8:51 ` Ananyev, Konstantin
2016-10-11 9:24 ` Vladyslav Buslov
2016-10-12 0:04 ` Ananyev, Konstantin
2016-10-13 10:18 ` Bruce Richardson [this message]
2016-10-13 10:30 ` Ananyev, Konstantin
2016-11-15 12:19 ` Ferruh Yigit
2016-11-15 13:27 ` Vladyslav Buslov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161013101849.GA132256@bricha3-MOBL3 \
--to=bruce.richardson@intel.com \
--cc=Vladyslav.Buslov@harmonicinc.com \
--cc=dev@dpdk.org \
--cc=ferruh.yigit@intel.com \
--cc=helin.zhang@intel.com \
--cc=jingjing.wu@intel.com \
--cc=konstantin.ananyev@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).