DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Olivier Matz <olivier.matz@6wind.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	Adrien Mazarguil <adrien.mazarguil@6wind.com>
Cc: "Zhang, Helin" <helin.zhang@intel.com>
Subject: Re: [dpdk-dev] [PATCH] mbuf: remove inconsistent assert statements
Date: Thu, 9 Jun 2016 13:21:18 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB97725836B6D90C@irsmsx105.ger.corp.intel.com> (raw)
In-Reply-To: <57591EEA.9040807@6wind.com>

Hi Olivier,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, June 09, 2016 8:47 AM
> To: Ananyev, Konstantin; dev@dpdk.org; Adrien Mazarguil
> Subject: Re: [dpdk-dev] [PATCH] mbuf: remove inconsistent assert statements
> 
> Hi Konstantin,
> 
> >>>>>> Yes, it refcnt supposed to be set to 0 by __rte_pktmbuf_prefree_seg().
> >>>>>> Wright now, it is a user responsibility to make sure refcnt==0 before pushing
> >>>>>> mbuf back to the pool.
> >>>>>> Not sure why do you consider that wrong?
> >>>>>
> >>>>> I do not consider this wrong and I'm all for using assert() to catch
> >>>>> programming errors, however in this specific case, I think they are
> >>>>> inconsistent and misleading.
> >>>>
> >>>> Honestly, I don't understand why.
> >>>> Right now the rule of thumb is - when mbuf is in the pool, it's refcnt should be equal zero.
> >>
> >> What is the purpose of this? Is there some code relying on this?
> >
> > The whole current implementation of mbuf_free code path relies on that.
> > Straight here:
> > if (likely(NULL != (m = __rte_pktmbuf_prefree_seg(m)))) {
> >                 m->next = NULL;
> >                 __rte_mbuf_raw_free(m);
> >         }
> >
> > If we'll exclude indirect mbuf logic, all it does:
> > if (rte_mbuf_refcnt_update(m, -1) == 0) {
> > 	m->next = NULL;
> > 	__rte_mbuf_raw_free(m);
> > }
> >
> > I.E.:
> > decrement mbuf->refcnt.
> > If new value of refcnt is zero, then put it back into the pool.
> >
> > So having ASERT(mbuf->refcnt==0) inside
> > __rte_mbuf_raw_free()/__rte_mbuf_raw_alloc()
> > looks absolutely valid to me.
> > I *has* to be zero at that point with current implementation,
> > And if it is not then we probably have (or will have) a silent memory corruption.
> 
> This explains how the refcount is used, and why it is set
> to zero before returning the mbuf to the pool with the mbuf
> free functions.

>From my point, that shows that rte_pktmbuf_free() relies on the value of the refcnt
to make a decision is it ok to put mbuf back to the pool or not.
Right now it puts mbuf to the pool *only* if it's refcnt==0.
As discussed below, we probably can change it to be refcnt==1
(if there really would be noticeable performance gain).
But I think it still should be just one predefined value of refcnt (0 or 1).
In theory it is possible to allow both (0 and 1),
but that will make it hard to debug any alloc/free issues,
plus would neglect any possible performance gain -
as in that case raw_alloc (or it's analog) should still do
mbuf->refcnt=1; 

> 
> It does not explain which code relies on the refcnt beeing 0
> while the mbuf is in the pool.
> 
> 
> >> But since [1], it is allowed to call rte_mbuf_raw_alloc() in PMDs (all
> >> PMDs were calling an internal function before).
> >
> > Yes, raw_alloc is public, NP with that.
> >
> >> We could argue that
> >> rte_mbuf_raw_free() should also be made public for PMDs.
> >
> > We could, but right now it is not.
> > Again, as I said, user could use it on his own but he obviously has to
> > obey the rules and do manually what __rte_pktmbuf_prefree_seg() does.
> > At least:
> >
> > rte_mbuf_refcnt_set(m, 0);
> > __rte_mbuf_raw_free(m);
> >
> >>
> >> As you said below, no-one promised that the free() reverts the malloc(),
> >> but given the function names, one can legitimately expect that the
> >> following code is valid:
> >>
> >> m = __rte_mbuf_raw_alloc();
> >> /* do nothing here */
> >> __rte_mbuf_raw_free(m);
> >
> > Surely people could (and would) expect various things...
> > But the reality right now is: __rte_mbuf_raw_free() is an internal
> > function and not counterpart of __rte_mbuf_raw_alloc().
> > If the people don't bother to read API docs or/and the actual code,
> > I can't see how we can help them :)
> 
> Yes, of course, people should read the doc.
> This does not prevent to have a nice API that behaves in a
> natural way :)
> 
> By the way, the fact that today the mbuf refcnt should be 0 while
> in a pool is not in the api doc, but in the code.

Ok, I admit there is a bug in the docs, let's add this line to the PG and fix it :)

> 
> >> If no code relies on having the refcnt set to 0 when a mbuf is in
> >> the pool, I suggest to relax this constraint as Adrien proposed.
> >
> > Why not just rename it to __rte_mbuf_raw_free_dont_use_after_raw_alloc()?
> > To avoid any further confusion :)
> > Seriously speaking I would prefer to leave it as it is.
> > If you feel we have to introduce a counterpart of  rte_mbuf_raw_alloc(),
> > we can make a new public one:
> >
> > rte_mbuf_raw_free(stuct rte_mbuf *m)
> > {
> >       if (rte_mbuf_refcnt_update(m, -1) == 0)
> >                 __rte_mbuf_raw_free(m);
> > }
> 
> This is an option, but I think it's not efficient to touch
> the mbuf structure when allocating/freeing. See below.

I don't think it is totally avoidable for generic case anyway.

> 
> >> Then, my opinion is that the refcount should be set to 1 in
> >> rte_pktmbuf_reset().
> >
> > I don't think we need to update rte_pktmbuf_reset(),
> > it doesn't touch refcnt at all and probably better to keep it that way.
> 
> Why would it be better?

First because pktmbuf_reset() is not very efficient one, and many PMDs
avoid to use it.
Second I think managing refcnt is more part of alloc/free then init procedure.

> All mbuf struct initializations are done in that function, why would
> it be different for the refcnt?
> 
> > To achieve what you suggesting, we probably need to:
> > 1) update  _rte_pktmbuf_prefree_seg and rte_pktmbuf_detach() to
> > set refcnt back to 1, i.e:
> >
> > static inline struct rte_mbuf* __attribute__((always_inline))
> > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> >         __rte_mbuf_sanity_check(m, 0);
> >
> >         if (likely(rte_mbuf_refcnt_update(m, -1) == 0)) {
> >                 /* if this is an indirect mbuf, it is detached. */
> >                 if (RTE_MBUF_INDIRECT(m))
> >                         rte_pktmbuf_detach(m);
> > +	rte_mbuf_refcnt_set(m, 1);
> >                 return m;
> >         }
> >         return NULL;
> > }
> >
> > 2) either:
> >    a) update mbuf constructor function, so it sets refcnt=1.
> >         I suppose that is easy for rte_pktmbuf_init(), but it means that all custom
> >         constructors should do the same.
> >         Which means possible changes in existing user code and all ABI change related hassle.
> >   b) keep rte_mbuf_raw_alloc() setting mbuf->refcnt=1.
> >       But then I don't see how will get any performance gain here.
> >
> > So not sure is it really worth it.
> >
> >> And rte_pktmbuf_free() should not be allowed on
> >> an uninitialized mbuf (yes, this would require some changes in PMDs).
> >
> > Not sure I understand you here...
> > free() wouldn not be allowed on mbuf whose recnf==0?
> > But it is not allowed right now anyway.
> 
> Today:
> 
>   /* allowed */
>   m = rte_pktmbuf_alloc();
>   rte_pktmbuf_free(m);
> 
>   /* not allowed */
>   m = rte_mbuf_raw_alloc();
>   __rte_mbuf_raw_free(m);
> 
>   /* we should do instead (strange): */
>   m = rte_mbuf_raw_alloc();
>   rte_pktmbuf_free(m);
> 
> What I suggest to have:
> 
>   /* allowed, no change */
>   m = rte_pktmbuf_alloc();
>   rte_pktmbuf_free(m);
> 
>   /* allowed, these functions would be symetrical */
>   m = rte_mbuf_raw_alloc();
>   rte_mbuf_raw_free(m);
> 
>   /* not allowed, m->refcnt is uninitialized */
>   m = rte_mbuf_raw_alloc();
>   rte_pktmbuf_free(m);

Hmm, and what it will buy us (except of symmetry)?

> 
> 
> 
> >
> >> This would open the door for bulk allocation/free in the mbuf api.
> >
> > Hmm and what stops us having one right now?
> > As I know we do have rte_pktmbuf_alloc_bulk(), I don't see
> > why we can't have rte_pktmbuf_free_bulk() right now.
> 
> I think, not touching the refcnt on raw allocation allows a PMD to
> do the following:
> 
>   rte_mbuf_allow_raw_bulk(pool, &mbufs, n)
>   for (i = 0; i < n; i++) {
>     prefetch(m[i+1]);
>     rte_pktmbuf_reset(); /* including refcnt = 1 */
>     do_stuff(m);
>   }

You can do that part right now.
That what many PMDs do, just function names are a bit different:

rte_mempool_get_bulk(pool, &mbufs, n);
for (i = 0; i < n; i++) {
      custom_rte_pktmbuf_fill(); /* including refcnt = 1 */
      do_stuff(m);
}

>   /* if all mbufs are not used, free the remaining */
>   rte_mbuf_free_raw_bulk(&mbufs[i], n-i);

Ok and what be a performance critical use-case for that?
As I know, the only possible cases inside PMD when it allocates & free same
mbuf internally are:
- error handling, queue stop (which usually not that performance critical).
- situation when you have cut crc bytes and they are in separate segment
(again a corner case).
What is the use case you have in your mind for that?

> 
> In that case, we can prefetch mbufs before using them,

You can do that now, look at rte_pktmbuf_alloc_bulk().
Nothing stops you to add prefetch() here, if it will be really faster -
extra parameter for rte_pktmbuf_alloc_bulk() or new function rte_pktmbuf_alloc_bulk_prefetch().

> and
> we can free the unused mbufs without touching the structure.

In many case this is not possible anyway:
you can't guarantee that all mbufs you get from upper layer are from the same pool,
and they all have refcnt==1.
There are probably some custom cases when you can safely assume that,
but I am not sure we should restructure all mbuf API for that particular case.
In such situations some custom function should do.

> It looks to be a good advantage.
> 
> Yes, we can do that with mempool functions. But I feel having a mbuf
> API with type checking is better.
> 
> >> This could be done in several steps:
> >>
> >> 1/ remove these assert(), add introduce a public rte_mbuf_raw_free()
> >> 2/ announce that rte_pktmbuf_free() won't work on uninitialized mbufs
> >> in a future version, and ensure that all PMD are inline with this
> >> requirement
> >> 3/ later, move refcount initialization in rte_pktmbuf_reset()
> >>
> >> What do you think?
> >
> > I still think that assert() is on a right place :)
> > *If* we'll change mbuf free code in a way that mbufs inside the pool
> > should have refcnt==1, then I think we'll change it to:
> > RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
> > But as I stated above, that change might cause some backward compatibility hassle: 2.a)
> > Or might not give us any performance gain: 2.b).
> 
> I suggest instead to have no constraint on the value of the refcnt
> in the pool.
> 
> >> Another option is to remove the rte_mbuf_raw_alloc()/rte_mbuf_raw_free()
> >> and use mempool calls.
> >
> > Don't see why we have to remove them...
> > Basically we have a bug in PMD, but instead of fixing it,
> > you guys suggest to change mbuf code.
> > Sounds a bit strange to me.
> > Why not just make for these PMDs to call rte_mempool_put() directly, if you are sure it is safe here?
> 
> Yes, there are some bugs in the PMDs regarding the refcnt value when
> compiled with MBUF_DEBUG=y.

Ok, then I suggest we fix them first, then start to discuss possible mbuf optimisations.
It is really strange when invalid usage of generic API in some submodules drives the generic API change.

> By the way, looking at the i40e code, we
> have:
> 
>   i40e_alloc_rx_queue_mbufs()
>   {
>     for (...) {
>        struct rte_mbuf *mbuf = rte_mbuf_raw_alloc(rxq->mp);
>        ...
>        rte_mbuf_refcnt_set(mbuf, 1);  /* not needed */
>      ...
>   }

Yes, I agree it is unnecessary here, and could be safely removed,
but how it relates to our discussion?

> 
>   i40e_tx_free_bufs()
>   {
>     ...
>     if (txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) {
>       for (...) {
>          /* returning a mbuf with refcnt=1 */
>          rte_mempool_put(txep->mbuf->pool, txep->mbuf);
>    ...

Hmm, from what I see that code is not correct anyway -
as we still can't guarantee that all mbufs we TX-ed are from the same pool.
So I think that code has to be removed.
Probably i40e maintainers have a better explanation why it is here.
Again, personally I think the ETH_TXQ_FLAGS_NOREFCOUNT flag should
also be removed (or deprecated) - I think it was left here from old days.

> 
> The fact that we can find many bugs related to that in PMDs is a
> sign that the API is not understandable enough.

For me it is just a sign that we need put a better effort in writing/reviewing the code.

> 
> The point is not to fix the bugs in PMDs. I think the point is
> to enhance the mbuf API.

As I said - I think bugs in PMDs shouldn't drive API changes :)
They need to be fixed first.

BTW, I looked who is using __rte_mbuf_raw_free() in the current DPDK code.
I found only two instances:

1) drivers/net/mpipe/mpipe_tilegx.c:
Inside mpipe_recv_flush_stack(struct mpipe_dev_priv *priv)
As I can see that function does set mbuf->refcnt=1 before calling __raw_free.
So we should be fine here, I think.
  
2) drivers/net/ena/ena_ethdev.c
Inside ena_rx_queue_release_bufs().
Again that one seems ok, as I can see from the code:
ena PMD calls  rte_mempool_get_bulk() and keep mbuf->refcnt==0
till either:
-  that mbuf is freed by ena_rx_queue_release_bufs() 
- or filled by eth_ena_recv_pkts() - at that stage mbuf->recnt is set to 1.

So another question is what PMD is failing?

> 
> Hope I have convinced you ;)
> 
> >
> >> But having a mbuf wrapper does not seem a bad
> >> thing to me.
> >
> > We can add some extra wrapper then, something like:
> > #define __RTE_MBUF_PUT_FREED(m)	(rte_mempool_put((m)->pool, m))
> > ?
> 
> I think a wrapper should do the type checking (struct mbuf).

Sure, we can have a wrapper inline function, this is just an example.

> 
> Thanks for this exchange.
> Regards,
> Olivier

  reply	other threads:[~2016-06-09 13:21 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-08  8:31 Adrien Mazarguil
2016-06-08 10:34 ` Ananyev, Konstantin
2016-06-08 12:27   ` Adrien Mazarguil
2016-06-08 13:09     ` Ananyev, Konstantin
2016-06-08 13:57       ` Adrien Mazarguil
2016-06-08 14:11         ` Olivier Matz
2016-06-08 16:07           ` Ananyev, Konstantin
2016-06-09  7:46             ` Olivier Matz
2016-06-09 13:21               ` Ananyev, Konstantin [this message]
2016-06-09 14:19                 ` Bruce Richardson
2016-06-09 15:27                 ` Thomas Monjalon
2016-06-09 15:45                   ` Ananyev, Konstantin
2016-06-09 18:42                     ` Don Provan
2016-06-20 13:49   ` Adrien Mazarguil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2601191342CEEE43887BDE71AB97725836B6D90C@irsmsx105.ger.corp.intel.com \
    --to=konstantin.ananyev@intel.com \
    --cc=adrien.mazarguil@6wind.com \
    --cc=dev@dpdk.org \
    --cc=helin.zhang@intel.com \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).