DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>,
	Olivier MATZ <olivier.matz@6wind.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
Date: Fri, 27 Mar 2015 13:22:18 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB977258214080FD@irsmsx105.ger.corp.intel.com> (raw)
In-Reply-To: <20150327131649.GA9972@bricha3-MOBL3>



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Friday, March 27, 2015 1:17 PM
> To: Olivier MATZ
> Cc: Neil Horman; Ananyev, Konstantin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> 
> On Fri, Mar 27, 2015 at 02:10:33PM +0100, Olivier MATZ wrote:
> > Hi Neil,
> >
> > On 03/27/2015 01:44 PM, Neil Horman wrote:
> > > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> > >>> Sent: Friday, March 27, 2015 10:26 AM
> > >>> To: Wiles, Keith
> > >>> Cc: dev@dpdk.org
> > >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> > >>>
> > >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> > >>>>
> > >>>>
> > >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> > >>>>
> > >>>>> The current way is not the most efficient: if m->refcnt is 1, the second
> > >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> > >>>>> condition fails again, although the code suggest otherwise to branch
> > >>>>> prediction. Instead we should keep the second condition only, and remove
> > >>>>> the
> > >>>>> duplicate set to zero.
> > >>>>>
> > >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> > >>>>> ---
> > >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +----
> > >>>>> 1 file changed, 1 insertion(+), 4 deletions(-)
> > >>>>>
> > >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > >>>>> index 17ba791..3ec4024 100644
> > >>>>> --- a/lib/librte_mbuf/rte_mbuf.h
> > >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
> > >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > >>>>> {
> > >>>>> 	__rte_mbuf_sanity_check(m, 0);
> > >>>>>
> > >>>>> -	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>>> -			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>>> -
> > >>>>> -		rte_mbuf_refcnt_set(m, 0);
> > >>>>> +	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>>>
> > >>>>> 		/* if this is an indirect mbuf, then
> > >>>>> 		 *  - detach mbuf
> > >>>>
> > >>>> I fell for this one too, but read Bruce¹s email
> > >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html
> > >>>
> > >>> This is still the right thing to do though, Bruce's reasoning is erroneous.
> > >>
> > >> No, it is not. I believe Bruce comments is absolutely correct here.
> > >>
> > > You and bruce are wrong, I proved that below.
> > >
> > >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> > >>
> > >> It does.
> > >>
> > > assertions are meaningless without evidence.
> > >
> > >>> are the last user of the mbuf,
> > >>> you are only guaranteed that if the update
> > >>> operation returns zero.
> > >>>
> > >>> In other words:
> > >>> rte_mbuf_refcnt_update(m, -1)
> > >>>
> > >>> is an atomic operation
> > >>>
> > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>
> > >>>
> > >>> is not.
> > >>>
> > >>> To illustrate, on two cpus, this might occur:
> > >>>
> > >>> CPU0					CPU1
> > >>> rte_mbuf_refcnt_read			...
> > >>>    returns 1				rte_mbuf_refcnt_read
> > >>> ...					   returns 1
> > >>> execute if clause			execute if clause
> > >>
> > >>
> > >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
> > >> it is a bug in your code.
> > > At what point in time did I indicate this was about multiple frees?  Please
> > > re-read my post.
> > >
> > >> Such code wouldn't work properly doesn't matter do we use:
> > >>
> > >>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
> > >>
> > >> or just:
> > >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
> > >>
> > >> To illustrate it with your example:
> > >> Suppose m.refcnt==1
> > >>
> > >> CPU0 executes:
> > >>
> > >> rte_pktmbuf_free(m1)
> > >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> > >>
> > >> m2 = rte_pktmbuf_alloc(pool);
> > >>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
> > >>
> > >> /* m2 refcnt ==1 start using m2 */
> > >>
> > > Really missing the point here.
> > >
> > >> CPU1 executes:
> > >> rte_pktmbuf_free(m1)
> > >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> > >>
> > >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
> > >>
> > > Still missing the point. Please see below
> > >
> > >>>
> > >>> In the above scenario both cpus fell into the if clause because they both held a
> > >>> pointer to the same buffer and both got a return value of one, so they skipped
> > >>> the update portion of the if clause and both executed the internal block of the
> > >>> conditional expression.  you might be tempted to think thats ok, since that
> > >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> > >>> entire purpose of that if conditional above was to ensure that only one
> > >>> execution context ever executed the conditional for a given buffer.  Look at
> > >>> what else happens in that conditional:
> > >>>
> > >>> static inline struct rte_mbuf* __attribute__((always_inline))
> > >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > >>> {
> > >>>         __rte_mbuf_sanity_check(m, 0);
> > >>>
> > >>>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>
> > >>>                 rte_mbuf_refcnt_set(m, 0);
> > >>>
> > >>>                 /* if this is an indirect mbuf, then
> > >>>                  *  - detach mbuf
> > >>>                  *  - free attached mbuf segment
> > >>>                  */
> > >>>                 if (RTE_MBUF_INDIRECT(m)) {
> > >>>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
> > >>>                         rte_pktmbuf_detach(m);
> > >>>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
> > >>>                                 __rte_mbuf_raw_free(md);
> > >>>                 }
> > >>>                 return(m);
> > >>>         }
> > >>>         return (NULL);
> > >>> }
> > >>>
> > >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> > >>> and in the scenario I outlined above, that refcnt will underflow, likely causing
> > >>> a buffer leak.  Additionally, the return code of this function is designed to
> > >>> indicate to the caller if they were the last user of the buffer.  In the above
> > >>> scenario, two execution contexts will be told that they were, which is wrong.
> > >>>
> > >>> Zoltans patch is a good fix
> > >>
> > >> I don't think so.
> > >>
> > >>
> > >>>
> > >>> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > >>
> > >>
> > >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >>
> > >
> > > Again, this has nothing to do with how many times you free an object and
> > > everything to do with why you use atomics here in the first place.  The purpose
> > > of the if conditional in the above code is to ensure that the contents of the
> > > conditional block only get executed a single time, correct?  Ostensibly you
> > > don't want to execution contexts getting in there at the same time right?
> > >
> > > If you have a single buffer with refcnt=1, and two cpus are executing code that
> > > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
> > > the same time, they can race and both wind up in that conditional block, leading
> > > to underflow of the md pointer refcnt, which is bad.
> >
> >
> > You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does
> > not make sense. Even with the fix you have acked.
> >
> > CPU0                                   CPU1
> >
> > m = a_common_mbuf;                     m = a_common_mbuf;
> > rte_pktmbuf_free(m) // fully atomic
> > m2 = rte_pktmbuf_alloc()
> > // m2 returned the same addr than m
> > // as it was in the pool
> >                                        // should not access m here
> >                                        // whatever the operation
> >
> >
> > Your example below just shows that the current code is wrong if
> > several cores access a mbuf with refcnt=1 at the same time. That's
> > true, but that's not allowed.
> >
> > - If you want to give a mbuf to another core, you put it in a ring
> >   and stop to reference it on core 0, here not need to have refcnt
> >
> > - If you want to share a mbuf with another core, you increase the
> >   reference counter before sending it to core 1. Then, both cores
> >   will have to call rte_pktmbuf_free().
> >
> >
> >
> > Regards,
> > Olivier
> >
> >
> 
> +1
> 
> Also to note that the function this comment is being added to is a free-type
> function. If you are in that function, you are freeing the mbuf, so calls
> from multiple cores simultaneously is a double-free error.
> 
> /Bruce

Another +1 
Was going to write a big reply, but realised Olivier and Bruce already wrote what I planned too :) 
Just to repeat:
Neil, it is an invalid scenario to call free() twice for the mbuf whose refcnt==1.
Doesn't matter is that happens concurrently from different threads, or sequentially from the same thread.
You can't make it work properly with or without Zoltan patch.
If that happens - it is a bug in your code, and as was pointed by Bruce in another mail,
just voids the concept of reference counting.
As a rule of thumb: if you are going to pass an object to an entity which would free it,
and you still plan to use that object, then it is your responsibility to increment it's reference count first.

Konstantin


> 
> >
> >
> > >
> > > Lets look at another more practical example.  lets imagine that that the mbuf X
> > > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
> > > and is about to be freed using the above refcnt test model (a read followed by
> > > an update that gets squashed, anda refcnt set in the free block.  Basically this
> > > pseudo code:
> > >
> > > if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
> > > 	refcnt_set(X,0)
> > > 	mbuf_free(X)
> > > }
> > >
> > > at the same time CPU1 is preforming a lookup of our needed mbuf from the
> > > aforementioned set, finds it and takes a refcnt on it.
> > >
> > >
> > > CPU0						CPU1
> > > if(refcnt_read(X))				search for mbuf X
> > >      returns 1					get pointer to X
> > > ...						refcnt_update(X,1)
> > > refcnt_set(X, 0)				...
> > > mbuf_free(X)
> > >
> > >
> > > After the following sequence X is freed but CPU1 is left thinking that it has a
> > > valid reference to the mbuf.  This is broken.
> > >
> > > As an alternate thought experiment, why use atomics here at all?  X86 is cache
> > > coherent right?  (ignore that new processor support, as this code predates it).
> > > If all cpus are able to see a consistent state of a variable, and if every
> > > context that has a pointer to a given mbuf also has a reference to an mbuf, then
> > > it should be safe to simply use an integer here rather than an atomic, right?
> > > If you know that you have a reference to a pointer, just decrement the refcnt
> > > and check for 0 instead of one, that will tell you that you are the last user of
> > > a buffer, right?  The answer is you can't because there are conditions in which
> > > you either need to make a set of conditions atomic (finding a pointer and
> > > increasing said refcnt under the protection of a lock), or you need some method
> > > to predicate the execution of some initial or finilazation event (like in
> > > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
> > > same init/finalization and so that you don't provide small windows of
> > > inconsistency in your atomics, which is what you have above.
> > >
> > > I wrote a demonstration program to illustrate (forgive me, its pretty quick and
> > > dirty), but I think it illustrates the point:
> > >
> > > #define _GNU_SOURCE
> > > #include <stdlib.h>
> > > #include <stdio.h>
> > > #include <pthread.h>
> > > #include <stdatomic.h>
> > >
> > > atomic_uint_fast64_t refcnt;
> > >
> > > uint threads = 0;
> > >
> > > static void * thread_exec(void *arg)
> > > {
> > >         int i;
> > >         int cpu = (int)(arg);
> > >         cpu_set_t cpuset;
> > >         pthread_t thread;
> > >
> > >         thread = pthread_self();
> > >         CPU_ZERO(&cpuset);
> > >         CPU_SET(cpu, &cpuset);
> > >         pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
> > >
> > >         for (i=0; i < 1000; i++) {
> > >                 if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
> > >                         atomic_fetch_sub(&refcnt, 1) == 0)) {
> > >                         // There should only ever be one thread in here at a
> > >                         atomic_init(&refcnt, 0);
> > >                         threads |= cpu;
> > >                         printf("threads = %d\n", threads);
> > >                         threads &= ~cpu;
> > >
> > > 			// Need to reset the refcnt for future iterations
> > > 			// but that should be fine since no other thread
> > > 			// should be in here but us
> > >                         atomic_init(&refcnt, 1);
> > >                 }
> > >         }
> > >
> > >         pthread_exit(NULL);
> > > }
> > >
> > > int main(int argc, char **argv)
> > > {
> > >         pthread_attr_t attr;
> > >         pthread_t thread_id1, thread_id2;
> > >         void *status;
> > >
> > >         atomic_init(&refcnt, 1);
> > >
> > >         pthread_attr_init(&attr);
> > >
> > >         pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
> > >         pthread_create(&thread_id2, &attr, thread_exec, (void *)2);
> > >
> > >         pthread_attr_destroy(&attr);
> > >
> > >         pthread_join(thread_id1, &status);
> > >         pthread_join(thread_id2, &status);
> > >
> > >
> > >         exit(0);
> > >
> > > }
> > >
> > >
> > > If you run this on an smp system, you'll clearly see that, occasionally the
> > > value of threads is 3.  That indicates that you have points where you have
> > > multiple contexts executing in that conditional block that has clearly been
> > > coded to only expect one.  You can't make the assumption that every pointer has
> > > a held refcount here, you need to incur the update penalty.
> > >
> > > Neil
> > >
> >
> >
> >

  reply	other threads:[~2015-03-27 13:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-26 18:10 Zoltan Kiss
2015-03-26 21:00 ` Wiles, Keith
2015-03-26 21:07   ` Bruce Richardson
2015-03-27 10:25   ` Neil Horman
2015-03-27 10:48     ` Ananyev, Konstantin
2015-03-27 12:44       ` Neil Horman
2015-03-27 13:10         ` Olivier MATZ
2015-03-27 13:16           ` Bruce Richardson
2015-03-27 13:22             ` Ananyev, Konstantin [this message]
2015-03-27 10:50     ` Olivier MATZ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2601191342CEEE43887BDE71AB977258214080FD@irsmsx105.ger.corp.intel.com \
    --to=konstantin.ananyev@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).