From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 93A146CCA for ; Thu, 13 Oct 2016 12:18:53 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga101.jf.intel.com with ESMTP; 13 Oct 2016 03:18:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,339,1473145200"; d="scan'208";a="1069871866" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.210.139]) by fmsmga002.fm.intel.com with SMTP; 13 Oct 2016 03:18:50 -0700 Received: by (sSMTP sendmail emulation); Thu, 13 Oct 2016 11:18:49 +0025 Date: Thu, 13 Oct 2016 11:18:49 +0100 From: Bruce Richardson To: "Ananyev, Konstantin" Cc: Vladyslav Buslov , "Wu, Jingjing" , "Yigit, Ferruh" , "Zhang, Helin" , "dev@dpdk.org" Message-ID: <20161013101849.GA132256@bricha3-MOBL3> References: <20160714172719.17502-1-vladyslav.buslov@harmonicinc.com> <20160714172719.17502-2-vladyslav.buslov@harmonicinc.com> <18156776-3658-a97d-3fbc-19c1a820a04d@intel.com> <9BB6961774997848B5B42BEC655768F80E277DFC@SHSMSX103.ccr.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0C0408@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772583F0C09AD@irsmsx105.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2601191342CEEE43887BDE71AB9772583F0C09AD@irsmsx105.ger.corp.intel.com> Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?= =?iso-8859-1?Q?opment?= Ireland Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Oct 2016 10:18:54 -0000 On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote: > Hi Vladislav, > > > > > > > > > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote: > > > > > > > Added prefetch of first packet payload cacheline in > > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in > > > > > > > i40e_rx_alloc_bufs > > > > > > > > > > > > > > Signed-off-by: Vladyslav Buslov > > > > > > > > > > > > > > --- > > > > > > > drivers/net/i40e/i40e_rxtx.c | 7 +++++-- > > > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c > > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644 > > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c > > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c > > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct > > > i40e_rx_queue > > > > > *rxq) > > > > > > > /* Translate descriptor info to mbuf parameters */ > > > > > > > for (j = 0; j < nb_dd; j++) { > > > > > > > mb = rxep[j].mbuf; > > > > > > > + rte_prefetch0(RTE_PTR_ADD(mb->buf_addr, > > > > > > RTE_PKTMBUF_HEADROOM)); > > > > > > > > > > Why did prefetch here? I think if application need to deal with > > > > > packet, it is more suitable to put it in application. > > > > > > > > > > > > qword1 = rte_le_to_cpu_64(\ > > > > > > > rxdp[j].wb.qword1.status_error_len); > > > > > > > pkt_len = ((qword1 & > > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >> > > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue > > > > > *rxq) > > > > > > > > > > > > > > rxdp = &rxq->rx_ring[alloc_idx]; > > > > > > > for (i = 0; i < rxq->rx_free_thresh; i++) { > > > > > > > - if (likely(i < (rxq->rx_free_thresh - 1))) > > > > > > > + if (likely(i < (rxq->rx_free_thresh - 1))) { > > > > > > > /* Prefetch next mbuf */ > > > > > > > - rte_prefetch0(rxep[i + 1].mbuf); > > > > > > > + rte_prefetch0(&rxep[i + 1].mbuf->cacheline0); > > > > > > > + rte_prefetch0(&rxep[i + > > > > > > > + 1].mbuf->cacheline1); > > > > > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h, > > > specially for that case. > > > > Thanks for pointing that out. > > I'll submit new patch if you decide to move forward with this development. > > > > > > > > > > > > + } > > > > > Agree with this change. And when I test it by testpmd with iofwd, no > > > > > performance increase is observed but minor decrease. > > > > > Can you share will us when it will benefit the performance in your > > > scenario ? > > > > > > > > > > > > > > > Thanks > > > > > Jingjing > > > > > > > > Hello Jingjing, > > > > > > > > Thanks for code review. > > > > > > > > My use case: We have simple distributor thread that receives packets > > > > from port and distributes them among worker threads according to VLAN > > > and MAC address hash. > > > > > > > > While working on performance optimization we determined that most of > > > CPU usage of this thread is in DPDK. > > > > As and optimization we decided to switch to rx burst alloc function, > > > > however that caused additional performance degradation compared to > > > scatter rx mode. > > > > In profiler two major culprits were: > > > > 1. Access to packet data Eth header in application code. (cache miss) > > > > 2. Setting next packet descriptor field to NULL in DPDK > > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache > > > > line that was not > > > > prefetched) > > > > > > I wonder what will happen if we'll remove any prefetches here? > > > Would it make things better or worse (and by how much)? > > > > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it. > > > > > > > > > After applying my fixes performance improved compared to scatter rx > > > mode. > > > > > > > > I assumed that prefetch of first cache line of packet data belongs to > > > > DPDK because it is done in scatter rx mode. (in > > > > i40e_recv_scattered_pkts) > > > > It can be moved to application side but IMO it is better to be consistent > > > across all rx modes. > > > > > > I would agree with Jingjing here, probably PMD should avoid to prefetch > > > packet's data. > > > > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver. > > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one. > > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive > > packets directly from rx queues of NIC. > > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each > > eth header and can't be easily mitigated in application code. > > I assume it is ubiquitous use case for DPDK. > > Yes it is a quite common use-case. > Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency. > From other side there are scenarios where this prefetch is excessive and can cause some drop in performance. > Again, as I know, none of PMDs for Intel devices prefetches packet's data in simple (single segment) RX mode. > Another thing that some people may argue then - why only one cache line is prefetched, > in some use-cases might need to look at 2-nd one. > There is a build-time config setting for this behaviour for exactly the reasons called out here - in some apps you get a benefit, in others you see a perf hit. The default is "on", which makes sense for most cases, I think. >>From common_base: CONFIG_RTE_PMD_PACKET_PREFETCH=y$ /Bruce