From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 7A320C4A2 for ; Wed, 17 Jun 2015 20:50:58 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP; 17 Jun 2015 11:50:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,634,1427785200"; d="scan'208";a="729417339" Received: from irsmsx109.ger.corp.intel.com ([163.33.3.23]) by fmsmga001.fm.intel.com with ESMTP; 17 Jun 2015 11:50:56 -0700 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.245]) by IRSMSX109.ger.corp.intel.com ([169.254.13.200]) with mapi id 14.03.0224.002; Wed, 17 Jun 2015 19:50:55 +0100 From: "Ananyev, Konstantin" To: "dev@dpdk.org" Thread-Topic: [dpdk-dev] rte_mbuf.next in 2nd cacheline Thread-Index: AQHQo8caRiJzxo5OE0esWvWQH/eOLp2wyZLwgAAHTBCAAAHucIAAA0wAgAAw7yCAAAAmwA== Date: Wed, 17 Jun 2015 18:50:54 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725836A1239C@irsmsx105.ger.corp.intel.com> References: <20150617140648.GC8208@bricha3-MOBL3> <0DE313B5-C9F0-4879-9D92-838ED088202C@cisco.com> <27EA8870B328F74E88180827A0F816396BD43720@xmb-aln-x10.cisco.com> <59AF69C657FD0841A61C55336867B5B0345592CD@IRSMSX103.ger.corp.intel.com> <1FD9B82B8BF2CF418D9A1000154491D97450B186@ORSMSX102.amr.corp.intel.com> <27EA8870B328F74E88180827A0F816396BD43891@xmb-aln-x10.cisco.com> <2601191342CEEE43887BDE71AB97725836A1237C@irsmsx105.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB97725836A1237C@irsmsx105.ger.corp.intel.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Jun 2015 18:50:59 -0000 Hi Dave, > From: Dave Barach (dbarach) [mailto:dbarach@cisco.com] > Sent: Wednesday, June 17, 2015 4:46 PM > To: Venkatesan, Venky; Richardson, Bruce; olivier.matz@6wind.com; Ananyev= , Konstantin > Cc: Damjan Marion (damarion) > Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline >=20 > Dear Venky, >=20 > The first thing I noticed - on a specific piece of hardware, yadda yadda = yadda - was that the i40e driver speed-path spent an ungodly > amount of time stalled in i40e_rx_alloc_bufs(.) in rte_mbuf_refcnt_set. M= umble, missing prefetch. I added a stride-of-1 prefetch, > which made the first stall go away. See below. >=20 > Next thing I noticed: a stall setting mb->next =3D 0, in the non-scattere= d RX case. So I added the (commented-out) rte_prefetch0 > (&pfmb->next). At that point, I decided to move the buffer metadata aroun= d to eliminate the second prefetch. >=20 > Taken together, these changes made a 10% PPS difference, again in a speci= fic use case. That seems like a valid point, but why moving next to the first cache-line = is considered as the only possible option? As Bruce suggested before, we can try to get rid of touching next in non-sc= attered RX routines. That I think should provide similar performance improvement for i40e/ixgbe = fast-path scalar RX code.=20 Or you are talking about hit in your app layer code, when you start chain m= bufs together? >=20 > Damjan was going to produce a cleaned up version of the rte_mbuf.h diffs,= of the form: >=20 > struct rte_mbuf { > #ifdef CONFIG_OFFLOAD_IN_FIRST_CACHE_LINE > =A0 Offload-in-first-cache-line; > =A0 Next-in-second-cache-line; > #else > =A0 Offload-in-second-cache-line; > =A0 Next-in-first-cache-line;r > #endif > }; >=20 > .along with a parallel change in the kernel module version. As first, with the proposed below changes ixgbe vector RX routine would be = broken. Of course, it could be fixed by putting even more conditional compilation a= round - enable vector RX, only when OFFLOAD_IN_FIRST_CACHE_LINE enabled, etc. Second, how long it would take before someone else would like to introduce = another mbuf fields swap? All for perfectly good reason of course. Let say, swap 'hash' and 'segn' (to keep vector RX working), or 'next' and 'userdata', or put tx_offload into the first line (traffic ge= nerators). I think if we'll go that way (allow mbuf fields swapping at build-time) we= 'll end up with totally unmaintainable code. Not to mention problems with ABI compatibility) (shared libraries, multiple= processes, KNI). So, I think we better stick with one mbuf format. If it's absolutely necessary to move next into the first cache, I think it = has to be done for all configs. Though, from what you described with i40e_rx_alloc_bufs() - that looks like a flaw in particular implementation, that might be fixed wi= thout changing mbuf format. >=20 > I'm out of LSU bandwidth / prefetch slots in a number of apps - mostly L2= apps - which will never use h/w offload results. I can see > your point about not turning dpdk buffer metadata into a garbage can. On = the other hand, the problems we face often involve MTU > =3D> scattered RX, with a mix of small and large packets. As in: enough P= PS to care about extra cache-line fills. >=20 > Hope this explains it a bit. We can obviously make these changes in our o= wn repos, but I can't imagine that we're the only folks who > might not use h/w offload results. >=20 > Thanks. Dave >=20 >=20 > diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_r= xtx.c > index 9c7be6f..1200361 100644 > --- a/lib/librte_pmd_i40e/i40e_rxtx.c > +++ b/lib/librte_pmd_i40e/i40e_rxtx.c > @@ -779,9 +779,15 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq) >=20 > =A0=A0=A0=A0=A0 rxdp =3D &rxq->rx_ring[alloc_idx]; > =A0=A0=A0=A0=A0 for (i =3D 0; i < rxq->rx_free_thresh; i++) { > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 if (i < (rxq->rx_free_thre= sh - 1)) { > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 st= ruct rte_mbuf *pfmb; > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 pf= mb =3D rxep[i+1].mbuf; > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 rt= e_prefetch0 (pfmb); > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 //= rte_prefetch0 (&pfmb->next); Wonder does your compiler unroll that loop? If not, wonder, would manually unrolling it (by 4 or so) help? Konstantin =20 > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 } > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb =3D rxep[i].mbuf; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 rte_mbuf_refcnt_set(mb, 1); > -=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb->next =3D NULL; > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb->next =3D NULL; /* $$$ in second cache= line */ > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb->data_off =3D RTE_PKTMBUF_HEADROOM; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb->nb_segs =3D 1; > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 mb->port =3D rxq->port_id; >=20 >=20 > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > index 55749bc..efd7f4e 100644 > --- a/lib/librte_mbuf/rte_mbuf.h > +++ b/lib/librte_mbuf/rte_mbuf.h > @@ -288,6 +288,12 @@ struct rte_mbuf { > =A0=A0=A0=A0=A0 uint32_t pkt_len;=A0=A0=A0=A0=A0=A0=A0=A0 /**< Total pkt = len: sum of all segments. */ > =A0=A0=A0=A0=A0 uint16_t vlan_tci;=A0=A0=A0=A0=A0=A0=A0 /**< VLAN Tag Con= trol Identifier (CPU order) */ > =A0=A0=A0=A0=A0 uint16_t reserved; > +=A0=A0=A0=A0 uint32_t seqn; /**< Sequence number. See also rte_reorder_i= nsert() */ > + > +=A0=A0=A0=A0 struct rte_mbuf *next;=A0=A0=A0 /**< Next segment of scatte= red packet. */ > + > +=A0=A0=A0=A0 /* second cache line - fields only used in slow path or on = TX */ > +=A0=A0=A0=A0 MARKER cacheline1 __rte_cache_aligned; > =A0=A0=A0=A0=A0 union { > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 uint32_t rss;=A0=A0=A0=A0 /**< RSS hash= result if RSS enabled */ > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 struct { > @@ -307,18 +313,12 @@ struct rte_mbuf { > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 uint32_t usr;=A0=A0=A0=A0 =A0 /**< User= defined tags. See rte_distributor_process() */ > =A0=A0=A0=A0=A0 } hash;=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 /**< hash information */ >=20 > -=A0=A0=A0=A0 uint32_t seqn; /**< Sequence number. See also rte_reorder_i= nsert() */ > - > -=A0=A0=A0=A0 /* second cache line - fields only used in slow path or on = TX */ > -=A0=A0=A0=A0 MARKER cacheline1 __rte_cache_aligned; > - > =A0=A0=A0=A0=A0 union { > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 void *userdata;=A0=A0 /**< Can be used = for external metadata */ > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 uint64_t udata64; /**< Allow 8-byte use= rdata on 32-bit */ > =A0=A0=A0=A0=A0 }; >=20 > =A0=A0=A0=A0=A0 struct rte_mempool *pool; /**< Pool from which mbuf was a= llocated. */ > -=A0=A0=A0=A0 struct rte_mbuf *next;=A0=A0=A0 /**< Next segment of scatte= red packet. */ >=20 > =A0=A0=A0=A0=A0 /* fields to support TX offloads */ > =A0=A0=A0=A0=A0 union { >=20 >=20 >=20 >=20 > Thanks. Dave >=20 > From: Venkatesan, Venky [mailto:venky.venkatesan@intel.com] > Sent: Wednesday, June 17, 2015 11:03 AM > To: Richardson, Bruce; Dave Barach (dbarach); olivier.matz@6wind.com; Ana= nyev, Konstantin > Cc: Damjan Marion (damarion) > Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline >=20 > Dave, >=20 > Is there a patch that I can look at? From your description what you seem = to indicate is that the mbuf structure becomes variant > between applications - that to me is likely a non-starter. If rte_mbuf is= variant between implementations, then =A0we really have no > commonality that all VNFs can rely on. >=20 > I may be getting a bit ahead of things here but letting different apps im= plement different mbuf layouts means, to achieve any level of > commonality for upper level apps we need to go to accessor functions (not= inline macros) to get various components - pretty much > like ODP does; all that does is add function call overhead per access and= that is completely non-performant. We toyed with that about > 5 years ago and tossed it out. >=20 > Bruce, >=20 > To some extent I need to get an understanding of why this performance dro= p is actually happening. Since SNB we have had a paired > cache line prefetcher that brings in the cache line pair. I think the rea= son that we have the perf issue is that half the mbufs actually > begin on an odd cache line boundary - i.e. those are the ones that will s= uffer a cache miss on rte_mbuf.next. Could you verify that this > is the case, and see what happens if we address all mbufs beginning on an= even cache line? That runs into another potential issue with > 4K aliasing, but I may have a way to avoid that. >=20 > Regards, > -Venky >=20 >=20 > From: Richardson, Bruce > Sent: Wednesday, June 17, 2015 7:50 AM > To: Dave Barach (dbarach); olivier.matz@6wind.com; Ananyev, Konstantin > Cc: Damjan Marion (damarion); Venkatesan, Venky > Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline >=20 > Hi, >=20 > it's something we can look at - especially once we see the proposed patch= . >=20 > However, my question is, if an app is cycle constrained such that the ext= ra cache-line reference for jumbo frames cause a problem, is > that app not likely to really suffer performance issues when run with sma= ller e.g. 256 byte packet sizes?=A0 And in the inverse case, if an > app can deal with small packets, is it not likely able to take the extra = hit for the jumbo frames? This was our thinking when doing the > cache-line split, but perhaps the logic doesn't hold up in the cases you = refer to. >=20 > Regards, > /Bruce >=20 > From: Dave Barach (dbarach) [mailto:dbarach@cisco.com] > Sent: Wednesday, June 17, 2015 3:37 PM > To: Richardson, Bruce; olivier.matz@6wind.com; Ananyev, Konstantin > Cc: Damjan Marion (damarion) > Subject: RE: [dpdk-dev] rte_mbuf.next in 2nd cacheline >=20 > Dear Bruce, >=20 > We've implemented a number of use cases which can't or at least don't cur= rently use hardware offload data. Examples: L2 bridging, > integrated routing and bridging, various flavors of tunneling, ipv6 MAP, = ipv6 segment routing, and so on. >=20 > You're not considering the cost of additional pressure on the load-store = unit, memory system, caused when mb->next must be > prefetched, set, and cleared. It doesn't matter that the packets are larg= e, the cost still exists. As we transition to 40 and 100g NICs, > large-packet PPS-per-core becomes more of an issue. >=20 > Damjan has offered to make the layout of the buffer metadata configurable= , so that folks can "have it their way." Given that it's a ~10 > line change - with minimal comedic potential - it seems like a reasonable= way to go. >=20 > Thoughts? >=20 > Thanks. Dave Barach > Cisco Fellow >=20 > From: Damjan Marion (damarion) > Sent: Wednesday, June 17, 2015 10:17 AM > To: Dave Barach (dbarach) > Subject: Fwd: [dpdk-dev] rte_mbuf.next in 2nd cacheline >=20 >=20 >=20 > Begin forwarded message: >=20 > From: Bruce Richardson > Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline > Date: 17 Jun 2015 16:06:48 CEST > To: "Damjan Marion (damarion)" > Cc: Olivier MATZ , "Ananyev, Konstantin" , "dev@dpdk.org" > >=20 > On Wed, Jun 17, 2015 at 01:55:57PM +0000, Damjan Marion (damarion) wrote: >=20 > On 15 Jun 2015, at 16:12, Bruce Richardson w= rote: >=20 > The next pointers always start out as NULL when the mbuf pool is created.= The > only time it is set to non-NULL is when we have chained mbufs. If we neve= r have > any chained mbufs, we never need to touch the next field, or even read it= - since > we have the num-segments count in the first cache line. If we do have a m= ulti-segment > mbuf, it's likely to be a big packet, so we have more processing time ava= ilable > and we can then take the hit of setting the next pointer. >=20 > There are applications which are not using rx offload, but they deal with= chained mbufs. > Why they are less important than ones using rx offload? This is something= people > should be able to configure on build time. >=20 > It's not that they are less important, it's that the packet processing cy= cle count > budget is going to be greater. A packet which is 64 bytes, or 128 bytes i= n size > can make use of a number of RX offloads to reduce it's processing time. H= owever, > a 64/128 packet is not going to be split across multiple buffers [unless = we > are dealing with a very unusual setup!]. >=20 > To handle 64 byte packets at 40G line rate, one has 50 cycles per core pe= r packet > when running at 3GHz. [3000000000 cycles / 59.5 mpps]. > If we assume that we are dealing with fairly small buffers > here, and that anything greater than 1k packets are chained, we still hav= e 626 > cycles per 3GHz core per packet to work with for that 1k packet. Given th= at > "normal" DPDK buffers are 2k in size, we have over a thousand cycles per = packet > for any packet that is split. >=20 > In summary, packets spread across multiple buffers are large packets, and= so have > larger packet cycle count budgets and so can much better absorb the cost = of > touching a second cache line in the mbuf than a 64-byte packet can. There= fore, > we optimize for the 64B packet case. >=20 > Hope this clarifies things a bit. >=20 > Regards, > /Bruce