From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 70B625A73 for ; Fri, 27 Mar 2015 13:44:57 +0100 (CET) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1YbTd6-0005Zm-Cn; Fri, 27 Mar 2015 08:44:56 -0400 Date: Fri, 27 Mar 2015 08:44:51 -0400 From: Neil Horman To: "Ananyev, Konstantin" Message-ID: <20150327124451.GE5375@hmsreliant.think-freely.org> References: <1427393457-7080-1-git-send-email-zoltan.kiss@linaro.org> <20150327102533.GA5375@hmsreliant.think-freely.org> <2601191342CEEE43887BDE71AB97725821407F18@irsmsx105.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2601191342CEEE43887BDE71AB97725821407F18@irsmsx105.ger.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Mar 2015 12:44:57 -0000 On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote: > > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > > Sent: Friday, March 27, 2015 10:26 AM > > To: Wiles, Keith > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > > > > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > > > > > > > > On 3/26/15, 1:10 PM, "Zoltan Kiss" wrote: > > > > > > >The current way is not the most efficient: if m->refcnt is 1, the second > > > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > > > >condition fails again, although the code suggest otherwise to branch > > > >prediction. Instead we should keep the second condition only, and remove > > > >the > > > >duplicate set to zero. > > > > > > > >Signed-off-by: Zoltan Kiss > > > >--- > > > > lib/librte_mbuf/rte_mbuf.h | 5 +---- > > > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > > > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > > >index 17ba791..3ec4024 100644 > > > >--- a/lib/librte_mbuf/rte_mbuf.h > > > >+++ b/lib/librte_mbuf/rte_mbuf.h > > > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > > > { > > > > __rte_mbuf_sanity_check(m, 0); > > > > > > > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || > > > >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > >- > > > >- rte_mbuf_refcnt_set(m, 0); > > > >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > > > > > /* if this is an indirect mbuf, then > > > > * - detach mbuf > > > > > > I fell for this one too, but read Brucešs email > > > http://dpdk.org/ml/archives/dev/2015-March/014481.html > > > > This is still the right thing to do though, Bruce's reasoning is erroneous. > > No, it is not. I believe Bruce comments is absolutely correct here. > You and bruce are wrong, I proved that below. > > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you > > It does. > assertions are meaningless without evidence. > > are the last user of the mbuf, > > you are only guaranteed that if the update > > operation returns zero. > > > > In other words: > > rte_mbuf_refcnt_update(m, -1) > > > > is an atomic operation > > > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > > > is not. > > > > To illustrate, on two cpus, this might occur: > > > > CPU0 CPU1 > > rte_mbuf_refcnt_read ... > > returns 1 rte_mbuf_refcnt_read > > ... returns 1 > > execute if clause execute if clause > > > If you have an mbuf with refcnt==N and try to call free() for it N+1 times - > it is a bug in your code. At what point in time did I indicate this was about multiple frees? Please re-read my post. > Such code wouldn't work properly doesn't matter do we use: > > if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > or just: > if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > To illustrate it with your example: > Suppose m.refcnt==1 > > CPU0 executes: > > rte_pktmbuf_free(m1) > /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > m2 = rte_pktmbuf_alloc(pool); > /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ > > /* m2 refcnt ==1 start using m2 */ > Really missing the point here. > CPU1 executes: > rte_pktmbuf_free(m1) > /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. > Still missing the point. Please see below > > > > In the above scenario both cpus fell into the if clause because they both held a > > pointer to the same buffer and both got a return value of one, so they skipped > > the update portion of the if clause and both executed the internal block of the > > conditional expression. you might be tempted to think thats ok, since that > > block just sets the refcnt to zero, and doing so twice isn't harmful, but the > > entire purpose of that if conditional above was to ensure that only one > > execution context ever executed the conditional for a given buffer. Look at > > what else happens in that conditional: > > > > static inline struct rte_mbuf* __attribute__((always_inline)) > > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > { > > __rte_mbuf_sanity_check(m, 0); > > > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > rte_mbuf_refcnt_set(m, 0); > > > > /* if this is an indirect mbuf, then > > * - detach mbuf > > * - free attached mbuf segment > > */ > > if (RTE_MBUF_INDIRECT(m)) { > > struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > > rte_pktmbuf_detach(m); > > if (rte_mbuf_refcnt_update(md, -1) == 0) > > __rte_mbuf_raw_free(md); > > } > > return(m); > > } > > return (NULL); > > } > > > > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > > and in the scenario I outlined above, that refcnt will underflow, likely causing > > a buffer leak. Additionally, the return code of this function is designed to > > indicate to the caller if they were the last user of the buffer. In the above > > scenario, two execution contexts will be told that they were, which is wrong. > > > > Zoltans patch is a good fix > > I don't think so. > > > > > > Acked-by: Neil Horman > > > NACKed-by: Konstantin Ananyev > Again, this has nothing to do with how many times you free an object and everything to do with why you use atomics here in the first place. The purpose of the if conditional in the above code is to ensure that the contents of the conditional block only get executed a single time, correct? Ostensibly you don't want to execution contexts getting in there at the same time right? If you have a single buffer with refcnt=1, and two cpus are executing code that points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around the same time, they can race and both wind up in that conditional block, leading to underflow of the md pointer refcnt, which is bad. Lets look at another more practical example. lets imagine that that the mbuf X is linked into a set that multiple cpus can query. X->refcnt is held by CPU0, and is about to be freed using the above refcnt test model (a read followed by an update that gets squashed, anda refcnt set in the free block. Basically this pseudo code: if (refcnt_read(X) == 1 || refcnt_update(X) == ) { refcnt_set(X,0) mbuf_free(X) } at the same time CPU1 is preforming a lookup of our needed mbuf from the aforementioned set, finds it and takes a refcnt on it. CPU0 CPU1 if(refcnt_read(X)) search for mbuf X returns 1 get pointer to X ... refcnt_update(X,1) refcnt_set(X, 0) ... mbuf_free(X) After the following sequence X is freed but CPU1 is left thinking that it has a valid reference to the mbuf. This is broken. As an alternate thought experiment, why use atomics here at all? X86 is cache coherent right? (ignore that new processor support, as this code predates it). If all cpus are able to see a consistent state of a variable, and if every context that has a pointer to a given mbuf also has a reference to an mbuf, then it should be safe to simply use an integer here rather than an atomic, right? If you know that you have a reference to a pointer, just decrement the refcnt and check for 0 instead of one, that will tell you that you are the last user of a buffer, right? The answer is you can't because there are conditions in which you either need to make a set of conditions atomic (finding a pointer and increasing said refcnt under the protection of a lock), or you need some method to predicate the execution of some initial or finilazation event (like in __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that same init/finalization and so that you don't provide small windows of inconsistency in your atomics, which is what you have above. I wrote a demonstration program to illustrate (forgive me, its pretty quick and dirty), but I think it illustrates the point: #define _GNU_SOURCE #include #include #include #include atomic_uint_fast64_t refcnt; uint threads = 0; static void * thread_exec(void *arg) { int i; int cpu = (int)(arg); cpu_set_t cpuset; pthread_t thread; thread = pthread_self(); CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); for (i=0; i < 1000; i++) { if (((atomic_fetch_sub(&refcnt, 0) == 1) || atomic_fetch_sub(&refcnt, 1) == 0)) { // There should only ever be one thread in here at a atomic_init(&refcnt, 0); threads |= cpu; printf("threads = %d\n", threads); threads &= ~cpu; // Need to reset the refcnt for future iterations // but that should be fine since no other thread // should be in here but us atomic_init(&refcnt, 1); } } pthread_exit(NULL); } int main(int argc, char **argv) { pthread_attr_t attr; pthread_t thread_id1, thread_id2; void *status; atomic_init(&refcnt, 1); pthread_attr_init(&attr); pthread_create(&thread_id1, &attr, thread_exec, (void *)1); pthread_create(&thread_id2, &attr, thread_exec, (void *)2); pthread_attr_destroy(&attr); pthread_join(thread_id1, &status); pthread_join(thread_id2, &status); exit(0); } If you run this on an smp system, you'll clearly see that, occasionally the value of threads is 3. That indicates that you have points where you have multiple contexts executing in that conditional block that has clearly been coded to only expect one. You can't make the assumption that every pointer has a held refcount here, you need to incur the update penalty. Neil