From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 9B80A5A8B for ; Mon, 15 Jun 2015 17:24:13 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP; 15 Jun 2015 08:23:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,618,1427785200"; d="scan'208";a="727717629" Received: from bricha3-mobl3.ger.corp.intel.com ([10.243.20.22]) by fmsmga001.fm.intel.com with SMTP; 15 Jun 2015 08:23:48 -0700 Received: by (sSMTP sendmail emulation); Mon, 15 Jun 2015 16:23:46 +0025 Date: Mon, 15 Jun 2015 16:23:46 +0100 From: Bruce Richardson To: Olivier MATZ Message-ID: <20150615152346.GC580@bricha3-MOBL3> References: <87110795-201A-4A1E-A4CC-A778AA7C8218@cisco.com> <557ED116.7040508@6wind.com> <20150615134409.GA7500@bricha3-MOBL3> <2601191342CEEE43887BDE71AB97725836A0A838@irsmsx105.ger.corp.intel.com> <557EDB91.9010503@6wind.com> <20150615141258.GA580@bricha3-MOBL3> <557EE1A0.609@6wind.com> <2601191342CEEE43887BDE71AB97725836A0A8A8@irsmsx105.ger.corp.intel.com> <557EECFF.3090402@6wind.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <557EECFF.3090402@6wind.com> Organization: Intel Shannon Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Cc: "dev@dpdk.org" , "Damjan Marion \(damarion\)" Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Jun 2015 15:24:14 -0000 On Mon, Jun 15, 2015 at 05:19:27PM +0200, Olivier MATZ wrote: > > > On 06/15/2015 04:52 PM, Ananyev, Konstantin wrote: > > Hi Olivier, > > > >> -----Original Message----- > >> From: Olivier MATZ [mailto:olivier.matz@6wind.com] > >> Sent: Monday, June 15, 2015 3:31 PM > >> To: Richardson, Bruce > >> Cc: Ananyev, Konstantin; dev@dpdk.org; Damjan Marion (damarion) > >> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline > >> > >> > >> > >> On 06/15/2015 04:12 PM, Bruce Richardson wrote: > >>> On Mon, Jun 15, 2015 at 04:05:05PM +0200, Olivier MATZ wrote: > >>>> Hi, > >>>> > >>>> On 06/15/2015 03:54 PM, Ananyev, Konstantin wrote: > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > >>>>>> Sent: Monday, June 15, 2015 2:44 PM > >>>>>> To: Olivier MATZ > >>>>>> Cc: dev@dpdk.org; Damjan Marion (damarion) > >>>>>> Subject: Re: [dpdk-dev] rte_mbuf.next in 2nd cacheline > >>>>>> > >>>>>> On Mon, Jun 15, 2015 at 03:20:22PM +0200, Olivier MATZ wrote: > >>>>>>> Hi Damjan, > >>>>>>> > >>>>>>> On 06/10/2015 11:47 PM, Damjan Marion (damarion) wrote: > >>>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> We noticed 7% performance improvement by simply moving rte_mbuf.next field to the 1st cache line. > >>>>>>>> > >>>>>>>> Currently, it falls under /* second cache line - fields only used in slow path or on TX */ > >>>>>>>> but it is actually used at several places in rx fast path. (e.g.: i40e_rx_alloc_bufs() is setting that field to NULL). > >>>>>>>> > >>>>>>>> Is there anything we can do here (stop using next field, or move it to 1st cache line)? > >>>>>>> > >>>>>>> Agree, this is also something I noticed, see: > >>>>>>> http://dpdk.org/ml/archives/dev/2015-February/014400.html > >>>>>>> > >>>>>>> I did not have the time to do performance testing, but it's something > >>>>>>> I'd like to do as soon as I can. I don't see any obvious reason not to > >>>>>>> do it. > >>>>>>> > >>>>>>> It seems we currently just have enough room to do it (8 bytes are > >>>>>>> remaining in the first cache line when compiled in 64 bits). > >>>>>> > >>>>>> This, to me, is the obvious reason not to do it! It prevents us from taking in > >>>>>> any other offload fields in the RX fast-path into the mbuf. > >>>>>> > >>>>>> That being said, I can see why we might want to look to move it - but from the > >>>>>> work done in the ixgbe driver, I'd be hopeful we can get as much performance with > >>>>>> it on the second cache line for most cases, through judicious use of prefetching, > >>>>>> or otherwise. > >>>>>> > >>>>>> It took a lot of work and investigation to get free space in the mbuf - especially > >>>>>> in cache line 0, and I'd like to avoid just filling the cache line up again as > >>>>>> long as we possibly can! > >>>>> > >>>>> Yep, agree with Bruce here. > >>>>> Plus, with packet_type going to be 4B and vlan_tci_outer, > >>>>> we just don't have 8 free bytes at the first cache line any more. > >>>> > >>>> I don't understand why m->next would not be a better candidate than > >>>> rx offload fields to be in the first cache line. For instance, m->next > >>>> is mandatory and must be initialized when allocating a mbuf (to be > >>>> compared with m->seqn for instance, which is also in the first cache > >>>> line). So if we want to do some room in the first cache line, I > >>>> think we can. > >>>> > >>>> To me, the only reason for not doing it now is because we don't > >>>> have a full performance test report (several use-cases, drivers, ...) > >>>> that shows it's better. > >>>> > >>> Because the "next" field is not mandatory to be set on initialization. It can > >>> instead be set only when needed, and cleared on free if it is used. > >>> > >>> The next pointers always start out as NULL when the mbuf pool is created. The > >>> only time it is set to non-NULL is when we have chained mbufs. If we never have > >>> any chained mbufs, we never need to touch the next field, or even read it - since > >>> we have the num-segments count in the first cache line. If we do have a multi-segment > >>> mbuf, it's likely to be a big packet, so we have more processing time available > >>> and we can then take the hit of setting the next pointer. Whenever we go to > >>> free that mbuf for that packet, the code to do the freeing obviously needs to > >>> read the next pointer so as to free all the buffers in the chain, and so it can > >>> also reset the next pointer to NULL when doing so. > >>> > >>> In this way, we can ensure that the next pointer on cache line 1 is not a problem > >>> in our fast path. > >> > >> This is a good idea, but looking at the drivers, it seems that today > >> they all set m->next to NULL in the rx function. What you are suggesting > >> is to remove all of them, and document somewhere that all mbufs in a > >> pool are supposed to have their m->next set to NULL, correct? > >> > >> I think what you are describing could also apply to reference counter > >> (set to 1 by default), right? > > > > We probably can reset next to NULL at __rte_pktmbuf_prefree_seg(), > > at the same time we do reset refcnt to 0. > > Is that what you suggesting? > > Yes, I can give it a try. > > Why would we need to change that function? The main free_seg function (which is called from rte_pktmbuf_free() function) already sets the next pointer to NULL? Is there some edge case in the code now where we are missing setting the next pointer to NULL on free? Also, any code that is not using the regular free functions i.e. mbuf_free or free_seg - anything not starting with "-" :-) - is responsible itself for ensuring that it frees things correctly. If it doesn't set next to NULL when it needs to - it needs to be fixed there, rather than changing the prefree_seg function. /Bruce > > > > > Konstantin > > > >> > >> > >> Olivier > >> > >> > >>> > >>> /Bruce > >>>