From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 7E0482949 for ; Thu, 9 Jun 2016 15:21:22 +0200 (CEST) Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga101.jf.intel.com with ESMTP; 09 Jun 2016 06:21:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,444,1459839600"; d="scan'208";a="118941693" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by fmsmga004.fm.intel.com with ESMTP; 09 Jun 2016 06:21:20 -0700 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.51]) by IRSMSX154.ger.corp.intel.com ([169.254.12.28]) with mapi id 14.03.0248.002; Thu, 9 Jun 2016 14:21:19 +0100 From: "Ananyev, Konstantin" To: Olivier Matz , "dev@dpdk.org" , Adrien Mazarguil CC: "Zhang, Helin" Thread-Topic: [dpdk-dev] [PATCH] mbuf: remove inconsistent assert statements Thread-Index: AQHRwWBHlYh+pJQko0aMG2e1V0BGOZ/fXbUQgAAQ/oCAABYKAIAAAzyAgAAD14CAAB8nYIABB64AgABWcxA= Date: Thu, 9 Jun 2016 13:21:18 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725836B6D90C@irsmsx105.ger.corp.intel.com> References: <1465374688-11729-1-git-send-email-adrien.mazarguil@6wind.com> <2601191342CEEE43887BDE71AB97725836B6CE18@irsmsx105.ger.corp.intel.com> <20160608122723.GK7621@6wind.com> <2601191342CEEE43887BDE71AB97725836B6CFDD@irsmsx105.ger.corp.intel.com> <20160608135751.GM7621@6wind.com> <57582797.8050300@6wind.com> <2601191342CEEE43887BDE71AB97725836B6D17A@irsmsx105.ger.corp.intel.com> <57591EEA.9040807@6wind.com> In-Reply-To: <57591EEA.9040807@6wind.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.180] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH] mbuf: remove inconsistent assert statements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Jun 2016 13:21:23 -0000 Hi Olivier, > -----Original Message----- > From: Olivier Matz [mailto:olivier.matz@6wind.com] > Sent: Thursday, June 09, 2016 8:47 AM > To: Ananyev, Konstantin; dev@dpdk.org; Adrien Mazarguil > Subject: Re: [dpdk-dev] [PATCH] mbuf: remove inconsistent assert statemen= ts >=20 > Hi Konstantin, >=20 > >>>>>> Yes, it refcnt supposed to be set to 0 by __rte_pktmbuf_prefree_se= g(). > >>>>>> Wright now, it is a user responsibility to make sure refcnt=3D=3D0= before pushing > >>>>>> mbuf back to the pool. > >>>>>> Not sure why do you consider that wrong? > >>>>> > >>>>> I do not consider this wrong and I'm all for using assert() to catc= h > >>>>> programming errors, however in this specific case, I think they are > >>>>> inconsistent and misleading. > >>>> > >>>> Honestly, I don't understand why. > >>>> Right now the rule of thumb is - when mbuf is in the pool, it's refc= nt should be equal zero. > >> > >> What is the purpose of this? Is there some code relying on this? > > > > The whole current implementation of mbuf_free code path relies on that. > > Straight here: > > if (likely(NULL !=3D (m =3D __rte_pktmbuf_prefree_seg(m)))) { > > m->next =3D NULL; > > __rte_mbuf_raw_free(m); > > } > > > > If we'll exclude indirect mbuf logic, all it does: > > if (rte_mbuf_refcnt_update(m, -1) =3D=3D 0) { > > m->next =3D NULL; > > __rte_mbuf_raw_free(m); > > } > > > > I.E.: > > decrement mbuf->refcnt. > > If new value of refcnt is zero, then put it back into the pool. > > > > So having ASERT(mbuf->refcnt=3D=3D0) inside > > __rte_mbuf_raw_free()/__rte_mbuf_raw_alloc() > > looks absolutely valid to me. > > I *has* to be zero at that point with current implementation, > > And if it is not then we probably have (or will have) a silent memory c= orruption. >=20 > This explains how the refcount is used, and why it is set > to zero before returning the mbuf to the pool with the mbuf > free functions. >>From my point, that shows that rte_pktmbuf_free() relies on the value of th= e refcnt to make a decision is it ok to put mbuf back to the pool or not. Right now it puts mbuf to the pool *only* if it's refcnt=3D=3D0. As discussed below, we probably can change it to be refcnt=3D=3D1 (if there really would be noticeable performance gain). But I think it still should be just one predefined value of refcnt (0 or 1)= . In theory it is possible to allow both (0 and 1), but that will make it hard to debug any alloc/free issues, plus would neglect any possible performance gain - as in that case raw_alloc (or it's analog) should still do mbuf->refcnt=3D1;=20 >=20 > It does not explain which code relies on the refcnt beeing 0 > while the mbuf is in the pool. >=20 >=20 > >> But since [1], it is allowed to call rte_mbuf_raw_alloc() in PMDs (all > >> PMDs were calling an internal function before). > > > > Yes, raw_alloc is public, NP with that. > > > >> We could argue that > >> rte_mbuf_raw_free() should also be made public for PMDs. > > > > We could, but right now it is not. > > Again, as I said, user could use it on his own but he obviously has to > > obey the rules and do manually what __rte_pktmbuf_prefree_seg() does. > > At least: > > > > rte_mbuf_refcnt_set(m, 0); > > __rte_mbuf_raw_free(m); > > > >> > >> As you said below, no-one promised that the free() reverts the malloc(= ), > >> but given the function names, one can legitimately expect that the > >> following code is valid: > >> > >> m =3D __rte_mbuf_raw_alloc(); > >> /* do nothing here */ > >> __rte_mbuf_raw_free(m); > > > > Surely people could (and would) expect various things... > > But the reality right now is: __rte_mbuf_raw_free() is an internal > > function and not counterpart of __rte_mbuf_raw_alloc(). > > If the people don't bother to read API docs or/and the actual code, > > I can't see how we can help them :) >=20 > Yes, of course, people should read the doc. > This does not prevent to have a nice API that behaves in a > natural way :) >=20 > By the way, the fact that today the mbuf refcnt should be 0 while > in a pool is not in the api doc, but in the code. Ok, I admit there is a bug in the docs, let's add this line to the PG and f= ix it :) >=20 > >> If no code relies on having the refcnt set to 0 when a mbuf is in > >> the pool, I suggest to relax this constraint as Adrien proposed. > > > > Why not just rename it to __rte_mbuf_raw_free_dont_use_after_raw_alloc(= )? > > To avoid any further confusion :) > > Seriously speaking I would prefer to leave it as it is. > > If you feel we have to introduce a counterpart of rte_mbuf_raw_alloc()= , > > we can make a new public one: > > > > rte_mbuf_raw_free(stuct rte_mbuf *m) > > { > > if (rte_mbuf_refcnt_update(m, -1) =3D=3D 0) > > __rte_mbuf_raw_free(m); > > } >=20 > This is an option, but I think it's not efficient to touch > the mbuf structure when allocating/freeing. See below. I don't think it is totally avoidable for generic case anyway. >=20 > >> Then, my opinion is that the refcount should be set to 1 in > >> rte_pktmbuf_reset(). > > > > I don't think we need to update rte_pktmbuf_reset(), > > it doesn't touch refcnt at all and probably better to keep it that way. >=20 > Why would it be better? First because pktmbuf_reset() is not very efficient one, and many PMDs avoid to use it. Second I think managing refcnt is more part of alloc/free then init procedu= re. > All mbuf struct initializations are done in that function, why would > it be different for the refcnt? >=20 > > To achieve what you suggesting, we probably need to: > > 1) update _rte_pktmbuf_prefree_seg and rte_pktmbuf_detach() to > > set refcnt back to 1, i.e: > > > > static inline struct rte_mbuf* __attribute__((always_inline)) > > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > { > > __rte_mbuf_sanity_check(m, 0); > > > > if (likely(rte_mbuf_refcnt_update(m, -1) =3D=3D 0)) { > > /* if this is an indirect mbuf, it is detached. */ > > if (RTE_MBUF_INDIRECT(m)) > > rte_pktmbuf_detach(m); > > + rte_mbuf_refcnt_set(m, 1); > > return m; > > } > > return NULL; > > } > > > > 2) either: > > a) update mbuf constructor function, so it sets refcnt=3D1. > > I suppose that is easy for rte_pktmbuf_init(), but it means tha= t all custom > > constructors should do the same. > > Which means possible changes in existing user code and all ABI = change related hassle. > > b) keep rte_mbuf_raw_alloc() setting mbuf->refcnt=3D1. > > But then I don't see how will get any performance gain here. > > > > So not sure is it really worth it. > > > >> And rte_pktmbuf_free() should not be allowed on > >> an uninitialized mbuf (yes, this would require some changes in PMDs). > > > > Not sure I understand you here... > > free() wouldn not be allowed on mbuf whose recnf=3D=3D0? > > But it is not allowed right now anyway. >=20 > Today: >=20 > /* allowed */ > m =3D rte_pktmbuf_alloc(); > rte_pktmbuf_free(m); >=20 > /* not allowed */ > m =3D rte_mbuf_raw_alloc(); > __rte_mbuf_raw_free(m); >=20 > /* we should do instead (strange): */ > m =3D rte_mbuf_raw_alloc(); > rte_pktmbuf_free(m); >=20 > What I suggest to have: >=20 > /* allowed, no change */ > m =3D rte_pktmbuf_alloc(); > rte_pktmbuf_free(m); >=20 > /* allowed, these functions would be symetrical */ > m =3D rte_mbuf_raw_alloc(); > rte_mbuf_raw_free(m); >=20 > /* not allowed, m->refcnt is uninitialized */ > m =3D rte_mbuf_raw_alloc(); > rte_pktmbuf_free(m); Hmm, and what it will buy us (except of symmetry)? >=20 >=20 >=20 > > > >> This would open the door for bulk allocation/free in the mbuf api. > > > > Hmm and what stops us having one right now? > > As I know we do have rte_pktmbuf_alloc_bulk(), I don't see > > why we can't have rte_pktmbuf_free_bulk() right now. >=20 > I think, not touching the refcnt on raw allocation allows a PMD to > do the following: >=20 > rte_mbuf_allow_raw_bulk(pool, &mbufs, n) > for (i =3D 0; i < n; i++) { > prefetch(m[i+1]); > rte_pktmbuf_reset(); /* including refcnt =3D 1 */ > do_stuff(m); > } You can do that part right now. That what many PMDs do, just function names are a bit different: rte_mempool_get_bulk(pool, &mbufs, n); for (i =3D 0; i < n; i++) { custom_rte_pktmbuf_fill(); /* including refcnt =3D 1 */ do_stuff(m); } > /* if all mbufs are not used, free the remaining */ > rte_mbuf_free_raw_bulk(&mbufs[i], n-i); Ok and what be a performance critical use-case for that? As I know, the only possible cases inside PMD when it allocates & free same mbuf internally are: - error handling, queue stop (which usually not that performance critical). - situation when you have cut crc bytes and they are in separate segment (again a corner case). What is the use case you have in your mind for that? >=20 > In that case, we can prefetch mbufs before using them, You can do that now, look at rte_pktmbuf_alloc_bulk(). Nothing stops you to add prefetch() here, if it will be really faster - extra parameter for rte_pktmbuf_alloc_bulk() or new function rte_pktmbuf_al= loc_bulk_prefetch(). > and > we can free the unused mbufs without touching the structure. In many case this is not possible anyway: you can't guarantee that all mbufs you get from upper layer are from the sa= me pool, and they all have refcnt=3D=3D1. There are probably some custom cases when you can safely assume that, but I am not sure we should restructure all mbuf API for that particular ca= se. In such situations some custom function should do. > It looks to be a good advantage. >=20 > Yes, we can do that with mempool functions. But I feel having a mbuf > API with type checking is better. >=20 > >> This could be done in several steps: > >> > >> 1/ remove these assert(), add introduce a public rte_mbuf_raw_free() > >> 2/ announce that rte_pktmbuf_free() won't work on uninitialized mbufs > >> in a future version, and ensure that all PMD are inline with this > >> requirement > >> 3/ later, move refcount initialization in rte_pktmbuf_reset() > >> > >> What do you think? > > > > I still think that assert() is on a right place :) > > *If* we'll change mbuf free code in a way that mbufs inside the pool > > should have refcnt=3D=3D1, then I think we'll change it to: > > RTE_ASSERT(rte_mbuf_refcnt_read(m) =3D=3D 1); > > But as I stated above, that change might cause some backward compatibil= ity hassle: 2.a) > > Or might not give us any performance gain: 2.b). >=20 > I suggest instead to have no constraint on the value of the refcnt > in the pool. >=20 > >> Another option is to remove the rte_mbuf_raw_alloc()/rte_mbuf_raw_free= () > >> and use mempool calls. > > > > Don't see why we have to remove them... > > Basically we have a bug in PMD, but instead of fixing it, > > you guys suggest to change mbuf code. > > Sounds a bit strange to me. > > Why not just make for these PMDs to call rte_mempool_put() directly, if= you are sure it is safe here? >=20 > Yes, there are some bugs in the PMDs regarding the refcnt value when > compiled with MBUF_DEBUG=3Dy. Ok, then I suggest we fix them first, then start to discuss possible mbuf o= ptimisations. It is really strange when invalid usage of generic API in some submodules d= rives the generic API change. > By the way, looking at the i40e code, we > have: >=20 > i40e_alloc_rx_queue_mbufs() > { > for (...) { > struct rte_mbuf *mbuf =3D rte_mbuf_raw_alloc(rxq->mp); > ... > rte_mbuf_refcnt_set(mbuf, 1); /* not needed */ > ... > } Yes, I agree it is unnecessary here, and could be safely removed, but how it relates to our discussion? >=20 > i40e_tx_free_bufs() > { > ... > if (txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) { > for (...) { > /* returning a mbuf with refcnt=3D1 */ > rte_mempool_put(txep->mbuf->pool, txep->mbuf); > ... Hmm, from what I see that code is not correct anyway - as we still can't guarantee that all mbufs we TX-ed are from the same pool. So I think that code has to be removed. Probably i40e maintainers have a better explanation why it is here. Again, personally I think the ETH_TXQ_FLAGS_NOREFCOUNT flag should also be removed (or deprecated) - I think it was left here from old days. >=20 > The fact that we can find many bugs related to that in PMDs is a > sign that the API is not understandable enough. For me it is just a sign that we need put a better effort in writing/review= ing the code. >=20 > The point is not to fix the bugs in PMDs. I think the point is > to enhance the mbuf API. As I said - I think bugs in PMDs shouldn't drive API changes :) They need to be fixed first. BTW, I looked who is using __rte_mbuf_raw_free() in the current DPDK code. I found only two instances: 1) drivers/net/mpipe/mpipe_tilegx.c: Inside mpipe_recv_flush_stack(struct mpipe_dev_priv *priv) As I can see that function does set mbuf->refcnt=3D1 before calling __raw_f= ree. So we should be fine here, I think. =20 2) drivers/net/ena/ena_ethdev.c Inside ena_rx_queue_release_bufs(). Again that one seems ok, as I can see from the code: ena PMD calls rte_mempool_get_bulk() and keep mbuf->refcnt=3D=3D0 till either: - that mbuf is freed by ena_rx_queue_release_bufs()=20 - or filled by eth_ena_recv_pkts() - at that stage mbuf->recnt is set to 1. So another question is what PMD is failing? >=20 > Hope I have convinced you ;) >=20 > > > >> But having a mbuf wrapper does not seem a bad > >> thing to me. > > > > We can add some extra wrapper then, something like: > > #define __RTE_MBUF_PUT_FREED(m) (rte_mempool_put((m)->pool, m)) > > ? >=20 > I think a wrapper should do the type checking (struct mbuf). Sure, we can have a wrapper inline function, this is just an example. >=20 > Thanks for this exchange. > Regards, > Olivier