From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 8A9FB1FE for ; Tue, 15 Jul 2014 20:06:38 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP; 15 Jul 2014 11:07:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,667,1400050800"; d="scan'208";a="573572235" Received: from irsmsx104.ger.corp.intel.com ([163.33.3.159]) by orsmga002.jf.intel.com with ESMTP; 15 Jul 2014 11:07:07 -0700 Received: from irsmsx107.ger.corp.intel.com (163.33.3.99) by IRSMSX104.ger.corp.intel.com (163.33.3.159) with Microsoft SMTP Server (TLS) id 14.3.123.3; Tue, 15 Jul 2014 19:07:07 +0100 Received: from irsmsx103.ger.corp.intel.com ([169.254.3.62]) by IRSMSX107.ger.corp.intel.com ([169.254.10.112]) with mapi id 14.03.0123.003; Tue, 15 Jul 2014 19:07:07 +0100 From: "Richardson, Bruce" To: "Ananyev, Konstantin" , "dev@dpdk.org" , "Olivier MATZ (olivier.matz@6wind.com)" Thread-Topic: Making space in mbuf data-structure Thread-Index: Ac+XFzceY+GkJCrySPOdN5NThUK1YwI9SOfAAA3TmgA= Date: Tue, 15 Jul 2014 18:07:06 +0000 Message-ID: <59AF69C657FD0841A61C55336867B5B0343AF121@IRSMSX103.ger.corp.intel.com> References: <59AF69C657FD0841A61C55336867B5B02CF13E32@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB97725821338148@IRSMSX105.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB97725821338148@IRSMSX105.ger.corp.intel.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] Making space in mbuf data-structure X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2014 18:06:39 -0000 > -----Original Message----- > From: Ananyev, Konstantin > Sent: Tuesday, July 15, 2014 2:31 AM > To: Richardson, Bruce; dev@dpdk.org > Subject: RE: Making space in mbuf data-structure >=20 > Hi Bruce, >=20 > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce > > Sent: Friday, July 04, 2014 12:39 AM > > To: dev@dpdk.org > > Subject: [dpdk-dev] Making space in mbuf data-structure > > > > Hi all, > > > > At this stage it's been well recognised that we need to make more space= in the > mbuf data structure for new fields. We in Intel have > > had some discussion on this and this email is to outline what our curre= nt > thinking and approach on this issue is, and look for additional > > suggestions and feedback. > > > > Firstly, we believe that there is no possible way that we can ever fit = all the > fields we need to fit into a 64-byte mbuf, and so we need to > > start looking at a 128-byte mbuf instead. Examples of new fields that n= eed to > fit in, include - 32-64 bits for additional offload flags, 32- > > bits more for filter information for support for the new filters in the= i40e > driver, an additional 2-4 bytes for storing info on a second > > vlan tag, 4-bytes for storing a sequence number to enable packet reorde= ring in > future, as well as potentially a number of other fields > > or splitting out fields that are superimposed over each other right now= , e.g. for > the qos scheduler. We also want to allow space for use > > by other non-Intel NIC drivers that may be open-sourced to dpdk.org in = the > future too, where they support fields and offloads that > > our hardware doesn't. > > > > If we accept the fact of a 2-cache-line mbuf, then the issue becomes ho= w to > rework things so that we spread our fields over the two > > cache lines while causing the lowest slow-down possible. The general > approach that we are looking to take is to focus the first cache > > line on fields that are updated on RX , so that receive only deals with= one > cache line. The second cache line can be used for application > > data and information that will only be used on the TX leg. This would a= llow us > to work on the first cache line in RX as now, and have the > > second cache line being prefetched in the background so that it is avai= lable > when necessary. Hardware prefetches should help us out > > here. We also may move rarely used, or slow-path RX fields e.g. such as= those > for chained mbufs with jumbo frames, to the second > > cache line, depending upon the performance impact and bytes savings > achieved. > > > > With this approach, we still need to make space in the first cache line= for > information for the new or expanded receive offloads. First > > off the blocks is to look at moving the mempool pointer into the second= cache > line. This will free-up 8 bytes in cache line 0, with a field > > that is only used when cleaning up after TX. A prototype patch for this= change > is given below, for your review and comment. Initial > > quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=3D250 -- > txqflags=3D0xf01 --burst=3D32 --txrst=3D32 --txfreet=3D32 --rxfreet=3D32 = -- > > total-num-mbufs=3D65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P -- > config=3D"(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight > > performance decrease with testpmd and equal or slightly better performa= nce > with l3fwd. These would both have been using the > > vector PMD - I haven't looked to change the fallback TX implementation = yet > for this change, since it's not as performance optimized > > and hence cycle-count sensitive. >=20 > Regarding code changes itself: > If I understand your code changes correctly: > ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs = in the > bulk belong to the same mempool. > If we have let say 2 groups of mbufs from 2 different mempools - it would= n't be > able to use put_bulk. > While, as I remember, current implementation is able to use put_bulk() i= n such > case. > I think, it is possible to change you code to do a better grouping. Given two sets of mbufs from two separate pool this proposed change will us= e put bulk for the first group encountered, but not the second. We can look= at enhancing it to work with different groups, but this version works well= where all or most mbufs in a TX queue come from a common queue. In the cas= e of having every second packet or every third packet coming from a differe= nt mempool (for the two or three mempool case), this version should also pe= rform better as it doesn't rely on having long runs of packets from the sam= e mempool. However, this is very much WIP, so we'll see if we can come up with a bette= r solution if we think one is necessary. (Most apps seem to use just one me= mpool per numa socket at most, though, so I don't think we should over-opti= mise for multiple mempools). >=20 > > > > Beyond this change, I'm also investigating potentially moving the "next= " > pointer to the second cache line, but it's looking harder to > > move without serious impact, so we'll have to see there what can be don= e. > Other fields I'll examine in due course too, and send on > > any findings I may have. >=20 > Could you explain it a bit? > I always had an impression that moving next pointer to the 2-nd cache lin= e > would have less impact then moving mempool pointer. > My thinking was: next is used only in scatter versions of RX/TX which are= slower > than optimised RX/TX anyway. > So few extra cycles wouldn't be that noticeable. > Though I admit I never did any measurements for that case. I was busy with a number of things last week, so work on moving the next po= inter slowed somewhat, but I think I have something reasonable now. The tro= uble with moving the next pointer was that it is being explicitly set to NU= LL in all our RX code paths, and the TX code path would read but not clear = it. Once I examined all our TX and RX code paths - of which there are more = than necessary, I feel, I was able to change things so that the NULL pointe= r is set on free rather than on allocation. The final issue was then to ens= ure that the rx_scattered_pkts function didn't suffer undue slowdown due to= the second cache line, and after some experimentation, an extra prefetch s= eems to have helped here. At this point, I think I have a prototype with both pool and next pointers = on a second cache line where both vector rx/tx and scalar (full-featured) c= ode paths are performing well, with sub 5% slowdown. However, more testing = is needed with a wider range of apps and packet types to confirm this. >=20 > About space savings for the first cache line: > - I still think that Oliver's suggestion of replacing data pointer (8 byt= es) with data > offset (2 bytes) makes a lot of sense. Yes, I would tend to agree. I'm going to take this and run additional testi= ng on this change and see how it fits in with everything else. Some concern= s were expressed about potential performance impacts of such a change, so I= 'm going to try and follow those up so that we can get a data-driven answer= to this. If no performance impact can be seen (and none has been observed = so far, to the best of my knowledge), I'm all in favour of the change. Also Olivier's change to merge pktmbuf into the main mbuf itself (and elimi= nate ctrlmbuf) is something that I think we want to keep. It allows us much= greater freedom to move fields about to fit things in well, with good alig= nment. Regards, /Bruce