* [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free @ 2015-03-26 18:10 Zoltan Kiss 2015-03-26 21:00 ` Wiles, Keith 0 siblings, 1 reply; 10+ messages in thread From: Zoltan Kiss @ 2015-03-26 18:10 UTC (permalink / raw) To: dev; +Cc: Zoltan Kiss The current way is not the most efficient: if m->refcnt is 1, the second condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd condition fails again, although the code suggest otherwise to branch prediction. Instead we should keep the second condition only, and remove the duplicate set to zero. Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> --- lib/librte_mbuf/rte_mbuf.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 17ba791..3ec4024 100644 --- a/lib/librte_mbuf/rte_mbuf.h +++ b/lib/librte_mbuf/rte_mbuf.h @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { __rte_mbuf_sanity_check(m, 0); - if (likely (rte_mbuf_refcnt_read(m) == 1) || - likely (rte_mbuf_refcnt_update(m, -1) == 0)) { - - rte_mbuf_refcnt_set(m, 0); + if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { /* if this is an indirect mbuf, then * - detach mbuf -- 1.9.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-26 18:10 [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free Zoltan Kiss @ 2015-03-26 21:00 ` Wiles, Keith 2015-03-26 21:07 ` Bruce Richardson 2015-03-27 10:25 ` Neil Horman 0 siblings, 2 replies; 10+ messages in thread From: Wiles, Keith @ 2015-03-26 21:00 UTC (permalink / raw) To: Zoltan Kiss, dev On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: >The current way is not the most efficient: if m->refcnt is 1, the second >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd >condition fails again, although the code suggest otherwise to branch >prediction. Instead we should keep the second condition only, and remove >the >duplicate set to zero. > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> >--- > lib/librte_mbuf/rte_mbuf.h | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h >index 17ba791..3ec4024 100644 >--- a/lib/librte_mbuf/rte_mbuf.h >+++ b/lib/librte_mbuf/rte_mbuf.h >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > { > __rte_mbuf_sanity_check(m, 0); > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >- >- rte_mbuf_refcnt_set(m, 0); >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > /* if this is an indirect mbuf, then > * - detach mbuf I fell for this one too, but read Bruce¹s email http://dpdk.org/ml/archives/dev/2015-March/014481.html >-- >1.9.1 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-26 21:00 ` Wiles, Keith @ 2015-03-26 21:07 ` Bruce Richardson 2015-03-27 10:25 ` Neil Horman 1 sibling, 0 replies; 10+ messages in thread From: Bruce Richardson @ 2015-03-26 21:07 UTC (permalink / raw) To: Wiles, Keith; +Cc: dev, Zoltan Kiss On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > > >The current way is not the most efficient: if m->refcnt is 1, the second > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > >condition fails again, although the code suggest otherwise to branch > >prediction. Instead we should keep the second condition only, and remove > >the > >duplicate set to zero. > > > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > >--- > > lib/librte_mbuf/rte_mbuf.h | 5 +---- > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > >index 17ba791..3ec4024 100644 > >--- a/lib/librte_mbuf/rte_mbuf.h > >+++ b/lib/librte_mbuf/rte_mbuf.h > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > { > > __rte_mbuf_sanity_check(m, 0); > > > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || > >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >- > >- rte_mbuf_refcnt_set(m, 0); > >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > /* if this is an indirect mbuf, then > > * - detach mbuf > > I fell for this one too, but read Bruce¹s email > http://dpdk.org/ml/archives/dev/2015-March/014481.html Looks like a code comment that really, really needs to be added to the code itself! /Bruce > >-- > >1.9.1 > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-26 21:00 ` Wiles, Keith 2015-03-26 21:07 ` Bruce Richardson @ 2015-03-27 10:25 ` Neil Horman 2015-03-27 10:48 ` Ananyev, Konstantin 2015-03-27 10:50 ` Olivier MATZ 1 sibling, 2 replies; 10+ messages in thread From: Neil Horman @ 2015-03-27 10:25 UTC (permalink / raw) To: Wiles, Keith; +Cc: dev On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > > >The current way is not the most efficient: if m->refcnt is 1, the second > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > >condition fails again, although the code suggest otherwise to branch > >prediction. Instead we should keep the second condition only, and remove > >the > >duplicate set to zero. > > > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > >--- > > lib/librte_mbuf/rte_mbuf.h | 5 +---- > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > >index 17ba791..3ec4024 100644 > >--- a/lib/librte_mbuf/rte_mbuf.h > >+++ b/lib/librte_mbuf/rte_mbuf.h > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > { > > __rte_mbuf_sanity_check(m, 0); > > > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || > >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >- > >- rte_mbuf_refcnt_set(m, 0); > >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > /* if this is an indirect mbuf, then > > * - detach mbuf > > I fell for this one too, but read Bruce¹s email > http://dpdk.org/ml/archives/dev/2015-March/014481.html This is still the right thing to do though, Bruce's reasoning is erroneous. Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you are the last user of the mbuf, you are only guaranteed that if the update operation returns zero. In other words: rte_mbuf_refcnt_update(m, -1) is an atomic operation if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) { is not. To illustrate, on two cpus, this might occur: CPU0 CPU1 rte_mbuf_refcnt_read ... returns 1 rte_mbuf_refcnt_read ... returns 1 execute if clause execute if clause In the above scenario both cpus fell into the if clause because they both held a pointer to the same buffer and both got a return value of one, so they skipped the update portion of the if clause and both executed the internal block of the conditional expression. you might be tempted to think thats ok, since that block just sets the refcnt to zero, and doing so twice isn't harmful, but the entire purpose of that if conditional above was to ensure that only one execution context ever executed the conditional for a given buffer. Look at what else happens in that conditional: static inline struct rte_mbuf* __attribute__((always_inline)) __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { __rte_mbuf_sanity_check(m, 0); if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) { rte_mbuf_refcnt_set(m, 0); /* if this is an indirect mbuf, then * - detach mbuf * - free attached mbuf segment */ if (RTE_MBUF_INDIRECT(m)) { struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); rte_pktmbuf_detach(m); if (rte_mbuf_refcnt_update(md, -1) == 0) __rte_mbuf_raw_free(md); } return(m); } return (NULL); } If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, and in the scenario I outlined above, that refcnt will underflow, likely causing a buffer leak. Additionally, the return code of this function is designed to indicate to the caller if they were the last user of the buffer. In the above scenario, two execution contexts will be told that they were, which is wrong. Zoltans patch is a good fix Acked-by: Neil Horman <nhorman@tuxdriver.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 10:25 ` Neil Horman @ 2015-03-27 10:48 ` Ananyev, Konstantin 2015-03-27 12:44 ` Neil Horman 2015-03-27 10:50 ` Olivier MATZ 1 sibling, 1 reply; 10+ messages in thread From: Ananyev, Konstantin @ 2015-03-27 10:48 UTC (permalink / raw) To: Neil Horman, Wiles, Keith; +Cc: dev > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > Sent: Friday, March 27, 2015 10:26 AM > To: Wiles, Keith > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > > > > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > > > > >The current way is not the most efficient: if m->refcnt is 1, the second > > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > > >condition fails again, although the code suggest otherwise to branch > > >prediction. Instead we should keep the second condition only, and remove > > >the > > >duplicate set to zero. > > > > > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > > >--- > > > lib/librte_mbuf/rte_mbuf.h | 5 +---- > > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > >index 17ba791..3ec4024 100644 > > >--- a/lib/librte_mbuf/rte_mbuf.h > > >+++ b/lib/librte_mbuf/rte_mbuf.h > > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > > { > > > __rte_mbuf_sanity_check(m, 0); > > > > > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || > > >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > >- > > >- rte_mbuf_refcnt_set(m, 0); > > >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > > > /* if this is an indirect mbuf, then > > > * - detach mbuf > > > > I fell for this one too, but read Bruce¹s email > > http://dpdk.org/ml/archives/dev/2015-March/014481.html > > This is still the right thing to do though, Bruce's reasoning is erroneous. No, it is not. I believe Bruce comments is absolutely correct here. > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you It does. > are the last user of the mbuf, > you are only guaranteed that if the update > operation returns zero. > > In other words: > rte_mbuf_refcnt_update(m, -1) > > is an atomic operation > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > is not. > > To illustrate, on two cpus, this might occur: > > CPU0 CPU1 > rte_mbuf_refcnt_read ... > returns 1 rte_mbuf_refcnt_read > ... returns 1 > execute if clause execute if clause If you have an mbuf with refcnt==N and try to call free() for it N+1 times - it is a bug in your code. Such code wouldn't work properly doesn't matter do we use: if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) or just: if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) To illustrate it with your example: Suppose m.refcnt==1 CPU0 executes: rte_pktmbuf_free(m1) /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ m2 = rte_pktmbuf_alloc(pool); /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ /* m2 refcnt ==1 start using m2 */ CPU1 executes: rte_pktmbuf_free(m1) /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. > > In the above scenario both cpus fell into the if clause because they both held a > pointer to the same buffer and both got a return value of one, so they skipped > the update portion of the if clause and both executed the internal block of the > conditional expression. you might be tempted to think thats ok, since that > block just sets the refcnt to zero, and doing so twice isn't harmful, but the > entire purpose of that if conditional above was to ensure that only one > execution context ever executed the conditional for a given buffer. Look at > what else happens in that conditional: > > static inline struct rte_mbuf* __attribute__((always_inline)) > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > { > __rte_mbuf_sanity_check(m, 0); > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > rte_mbuf_refcnt_set(m, 0); > > /* if this is an indirect mbuf, then > * - detach mbuf > * - free attached mbuf segment > */ > if (RTE_MBUF_INDIRECT(m)) { > struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > rte_pktmbuf_detach(m); > if (rte_mbuf_refcnt_update(md, -1) == 0) > __rte_mbuf_raw_free(md); > } > return(m); > } > return (NULL); > } > > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > and in the scenario I outlined above, that refcnt will underflow, likely causing > a buffer leak. Additionally, the return code of this function is designed to > indicate to the caller if they were the last user of the buffer. In the above > scenario, two execution contexts will be told that they were, which is wrong. > > Zoltans patch is a good fix I don't think so. > > Acked-by: Neil Horman <nhorman@tuxdriver.com> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 10:48 ` Ananyev, Konstantin @ 2015-03-27 12:44 ` Neil Horman 2015-03-27 13:10 ` Olivier MATZ 0 siblings, 1 reply; 10+ messages in thread From: Neil Horman @ 2015-03-27 12:44 UTC (permalink / raw) To: Ananyev, Konstantin; +Cc: dev On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote: > > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > > Sent: Friday, March 27, 2015 10:26 AM > > To: Wiles, Keith > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > > > > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > > > > > > > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > > > > > > >The current way is not the most efficient: if m->refcnt is 1, the second > > > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > > > >condition fails again, although the code suggest otherwise to branch > > > >prediction. Instead we should keep the second condition only, and remove > > > >the > > > >duplicate set to zero. > > > > > > > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > > > >--- > > > > lib/librte_mbuf/rte_mbuf.h | 5 +---- > > > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > > > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > > >index 17ba791..3ec4024 100644 > > > >--- a/lib/librte_mbuf/rte_mbuf.h > > > >+++ b/lib/librte_mbuf/rte_mbuf.h > > > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > > > { > > > > __rte_mbuf_sanity_check(m, 0); > > > > > > > >- if (likely (rte_mbuf_refcnt_read(m) == 1) || > > > >- likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > >- > > > >- rte_mbuf_refcnt_set(m, 0); > > > >+ if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > > > > > /* if this is an indirect mbuf, then > > > > * - detach mbuf > > > > > > I fell for this one too, but read Bruce¹s email > > > http://dpdk.org/ml/archives/dev/2015-March/014481.html > > > > This is still the right thing to do though, Bruce's reasoning is erroneous. > > No, it is not. I believe Bruce comments is absolutely correct here. > You and bruce are wrong, I proved that below. > > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you > > It does. > assertions are meaningless without evidence. > > are the last user of the mbuf, > > you are only guaranteed that if the update > > operation returns zero. > > > > In other words: > > rte_mbuf_refcnt_update(m, -1) > > > > is an atomic operation > > > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > > > is not. > > > > To illustrate, on two cpus, this might occur: > > > > CPU0 CPU1 > > rte_mbuf_refcnt_read ... > > returns 1 rte_mbuf_refcnt_read > > ... returns 1 > > execute if clause execute if clause > > > If you have an mbuf with refcnt==N and try to call free() for it N+1 times - > it is a bug in your code. At what point in time did I indicate this was about multiple frees? Please re-read my post. > Such code wouldn't work properly doesn't matter do we use: > > if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > or just: > if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > To illustrate it with your example: > Suppose m.refcnt==1 > > CPU0 executes: > > rte_pktmbuf_free(m1) > /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > m2 = rte_pktmbuf_alloc(pool); > /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ > > /* m2 refcnt ==1 start using m2 */ > Really missing the point here. > CPU1 executes: > rte_pktmbuf_free(m1) > /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. > Still missing the point. Please see below > > > > In the above scenario both cpus fell into the if clause because they both held a > > pointer to the same buffer and both got a return value of one, so they skipped > > the update portion of the if clause and both executed the internal block of the > > conditional expression. you might be tempted to think thats ok, since that > > block just sets the refcnt to zero, and doing so twice isn't harmful, but the > > entire purpose of that if conditional above was to ensure that only one > > execution context ever executed the conditional for a given buffer. Look at > > what else happens in that conditional: > > > > static inline struct rte_mbuf* __attribute__((always_inline)) > > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > { > > __rte_mbuf_sanity_check(m, 0); > > > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > > rte_mbuf_refcnt_set(m, 0); > > > > /* if this is an indirect mbuf, then > > * - detach mbuf > > * - free attached mbuf segment > > */ > > if (RTE_MBUF_INDIRECT(m)) { > > struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > > rte_pktmbuf_detach(m); > > if (rte_mbuf_refcnt_update(md, -1) == 0) > > __rte_mbuf_raw_free(md); > > } > > return(m); > > } > > return (NULL); > > } > > > > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > > and in the scenario I outlined above, that refcnt will underflow, likely causing > > a buffer leak. Additionally, the return code of this function is designed to > > indicate to the caller if they were the last user of the buffer. In the above > > scenario, two execution contexts will be told that they were, which is wrong. > > > > Zoltans patch is a good fix > > I don't think so. > > > > > > Acked-by: Neil Horman <nhorman@tuxdriver.com> > > > NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > Again, this has nothing to do with how many times you free an object and everything to do with why you use atomics here in the first place. The purpose of the if conditional in the above code is to ensure that the contents of the conditional block only get executed a single time, correct? Ostensibly you don't want to execution contexts getting in there at the same time right? If you have a single buffer with refcnt=1, and two cpus are executing code that points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around the same time, they can race and both wind up in that conditional block, leading to underflow of the md pointer refcnt, which is bad. Lets look at another more practical example. lets imagine that that the mbuf X is linked into a set that multiple cpus can query. X->refcnt is held by CPU0, and is about to be freed using the above refcnt test model (a read followed by an update that gets squashed, anda refcnt set in the free block. Basically this pseudo code: if (refcnt_read(X) == 1 || refcnt_update(X) == ) { refcnt_set(X,0) mbuf_free(X) } at the same time CPU1 is preforming a lookup of our needed mbuf from the aforementioned set, finds it and takes a refcnt on it. CPU0 CPU1 if(refcnt_read(X)) search for mbuf X returns 1 get pointer to X ... refcnt_update(X,1) refcnt_set(X, 0) ... mbuf_free(X) After the following sequence X is freed but CPU1 is left thinking that it has a valid reference to the mbuf. This is broken. As an alternate thought experiment, why use atomics here at all? X86 is cache coherent right? (ignore that new processor support, as this code predates it). If all cpus are able to see a consistent state of a variable, and if every context that has a pointer to a given mbuf also has a reference to an mbuf, then it should be safe to simply use an integer here rather than an atomic, right? If you know that you have a reference to a pointer, just decrement the refcnt and check for 0 instead of one, that will tell you that you are the last user of a buffer, right? The answer is you can't because there are conditions in which you either need to make a set of conditions atomic (finding a pointer and increasing said refcnt under the protection of a lock), or you need some method to predicate the execution of some initial or finilazation event (like in __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that same init/finalization and so that you don't provide small windows of inconsistency in your atomics, which is what you have above. I wrote a demonstration program to illustrate (forgive me, its pretty quick and dirty), but I think it illustrates the point: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <pthread.h> #include <stdatomic.h> atomic_uint_fast64_t refcnt; uint threads = 0; static void * thread_exec(void *arg) { int i; int cpu = (int)(arg); cpu_set_t cpuset; pthread_t thread; thread = pthread_self(); CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); for (i=0; i < 1000; i++) { if (((atomic_fetch_sub(&refcnt, 0) == 1) || atomic_fetch_sub(&refcnt, 1) == 0)) { // There should only ever be one thread in here at a atomic_init(&refcnt, 0); threads |= cpu; printf("threads = %d\n", threads); threads &= ~cpu; // Need to reset the refcnt for future iterations // but that should be fine since no other thread // should be in here but us atomic_init(&refcnt, 1); } } pthread_exit(NULL); } int main(int argc, char **argv) { pthread_attr_t attr; pthread_t thread_id1, thread_id2; void *status; atomic_init(&refcnt, 1); pthread_attr_init(&attr); pthread_create(&thread_id1, &attr, thread_exec, (void *)1); pthread_create(&thread_id2, &attr, thread_exec, (void *)2); pthread_attr_destroy(&attr); pthread_join(thread_id1, &status); pthread_join(thread_id2, &status); exit(0); } If you run this on an smp system, you'll clearly see that, occasionally the value of threads is 3. That indicates that you have points where you have multiple contexts executing in that conditional block that has clearly been coded to only expect one. You can't make the assumption that every pointer has a held refcount here, you need to incur the update penalty. Neil ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 12:44 ` Neil Horman @ 2015-03-27 13:10 ` Olivier MATZ 2015-03-27 13:16 ` Bruce Richardson 0 siblings, 1 reply; 10+ messages in thread From: Olivier MATZ @ 2015-03-27 13:10 UTC (permalink / raw) To: Neil Horman, Ananyev, Konstantin; +Cc: dev Hi Neil, On 03/27/2015 01:44 PM, Neil Horman wrote: > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote: >> >> >>> -----Original Message----- >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman >>> Sent: Friday, March 27, 2015 10:26 AM >>> To: Wiles, Keith >>> Cc: dev@dpdk.org >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free >>> >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: >>>> >>>> >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: >>>> >>>>> The current way is not the most efficient: if m->refcnt is 1, the second >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd >>>>> condition fails again, although the code suggest otherwise to branch >>>>> prediction. Instead we should keep the second condition only, and remove >>>>> the >>>>> duplicate set to zero. >>>>> >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> >>>>> --- >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +---- >>>>> 1 file changed, 1 insertion(+), 4 deletions(-) >>>>> >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h >>>>> index 17ba791..3ec4024 100644 >>>>> --- a/lib/librte_mbuf/rte_mbuf.h >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) >>>>> { >>>>> __rte_mbuf_sanity_check(m, 0); >>>>> >>>>> - if (likely (rte_mbuf_refcnt_read(m) == 1) || >>>>> - likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>>>> - >>>>> - rte_mbuf_refcnt_set(m, 0); >>>>> + if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>>>> >>>>> /* if this is an indirect mbuf, then >>>>> * - detach mbuf >>>> >>>> I fell for this one too, but read Bruce¹s email >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html >>> >>> This is still the right thing to do though, Bruce's reasoning is erroneous. >> >> No, it is not. I believe Bruce comments is absolutely correct here. >> > You and bruce are wrong, I proved that below. > >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you >> >> It does. >> > assertions are meaningless without evidence. > >>> are the last user of the mbuf, >>> you are only guaranteed that if the update >>> operation returns zero. >>> >>> In other words: >>> rte_mbuf_refcnt_update(m, -1) >>> >>> is an atomic operation >>> >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>> >>> >>> is not. >>> >>> To illustrate, on two cpus, this might occur: >>> >>> CPU0 CPU1 >>> rte_mbuf_refcnt_read ... >>> returns 1 rte_mbuf_refcnt_read >>> ... returns 1 >>> execute if clause execute if clause >> >> >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times - >> it is a bug in your code. > At what point in time did I indicate this was about multiple frees? Please > re-read my post. > >> Such code wouldn't work properly doesn't matter do we use: >> >> if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) >> >> or just: >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) >> >> To illustrate it with your example: >> Suppose m.refcnt==1 >> >> CPU0 executes: >> >> rte_pktmbuf_free(m1) >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ >> >> m2 = rte_pktmbuf_alloc(pool); >> /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ >> >> /* m2 refcnt ==1 start using m2 */ >> > Really missing the point here. > >> CPU1 executes: >> rte_pktmbuf_free(m1) >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ >> >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. >> > Still missing the point. Please see below > >>> >>> In the above scenario both cpus fell into the if clause because they both held a >>> pointer to the same buffer and both got a return value of one, so they skipped >>> the update portion of the if clause and both executed the internal block of the >>> conditional expression. you might be tempted to think thats ok, since that >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the >>> entire purpose of that if conditional above was to ensure that only one >>> execution context ever executed the conditional for a given buffer. Look at >>> what else happens in that conditional: >>> >>> static inline struct rte_mbuf* __attribute__((always_inline)) >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) >>> { >>> __rte_mbuf_sanity_check(m, 0); >>> >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>> >>> rte_mbuf_refcnt_set(m, 0); >>> >>> /* if this is an indirect mbuf, then >>> * - detach mbuf >>> * - free attached mbuf segment >>> */ >>> if (RTE_MBUF_INDIRECT(m)) { >>> struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); >>> rte_pktmbuf_detach(m); >>> if (rte_mbuf_refcnt_update(md, -1) == 0) >>> __rte_mbuf_raw_free(md); >>> } >>> return(m); >>> } >>> return (NULL); >>> } >>> >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, >>> and in the scenario I outlined above, that refcnt will underflow, likely causing >>> a buffer leak. Additionally, the return code of this function is designed to >>> indicate to the caller if they were the last user of the buffer. In the above >>> scenario, two execution contexts will be told that they were, which is wrong. >>> >>> Zoltans patch is a good fix >> >> I don't think so. >> >> >>> >>> Acked-by: Neil Horman <nhorman@tuxdriver.com> >> >> >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com> >> > > Again, this has nothing to do with how many times you free an object and > everything to do with why you use atomics here in the first place. The purpose > of the if conditional in the above code is to ensure that the contents of the > conditional block only get executed a single time, correct? Ostensibly you > don't want to execution contexts getting in there at the same time right? > > If you have a single buffer with refcnt=1, and two cpus are executing code that > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around > the same time, they can race and both wind up in that conditional block, leading > to underflow of the md pointer refcnt, which is bad. You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does not make sense. Even with the fix you have acked. CPU0 CPU1 m = a_common_mbuf; m = a_common_mbuf; rte_pktmbuf_free(m) // fully atomic m2 = rte_pktmbuf_alloc() // m2 returned the same addr than m // as it was in the pool // should not access m here // whatever the operation Your example below just shows that the current code is wrong if several cores access a mbuf with refcnt=1 at the same time. That's true, but that's not allowed. - If you want to give a mbuf to another core, you put it in a ring and stop to reference it on core 0, here not need to have refcnt - If you want to share a mbuf with another core, you increase the reference counter before sending it to core 1. Then, both cores will have to call rte_pktmbuf_free(). Regards, Olivier > > Lets look at another more practical example. lets imagine that that the mbuf X > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0, > and is about to be freed using the above refcnt test model (a read followed by > an update that gets squashed, anda refcnt set in the free block. Basically this > pseudo code: > > if (refcnt_read(X) == 1 || refcnt_update(X) == ) { > refcnt_set(X,0) > mbuf_free(X) > } > > at the same time CPU1 is preforming a lookup of our needed mbuf from the > aforementioned set, finds it and takes a refcnt on it. > > > CPU0 CPU1 > if(refcnt_read(X)) search for mbuf X > returns 1 get pointer to X > ... refcnt_update(X,1) > refcnt_set(X, 0) ... > mbuf_free(X) > > > After the following sequence X is freed but CPU1 is left thinking that it has a > valid reference to the mbuf. This is broken. > > As an alternate thought experiment, why use atomics here at all? X86 is cache > coherent right? (ignore that new processor support, as this code predates it). > If all cpus are able to see a consistent state of a variable, and if every > context that has a pointer to a given mbuf also has a reference to an mbuf, then > it should be safe to simply use an integer here rather than an atomic, right? > If you know that you have a reference to a pointer, just decrement the refcnt > and check for 0 instead of one, that will tell you that you are the last user of > a buffer, right? The answer is you can't because there are conditions in which > you either need to make a set of conditions atomic (finding a pointer and > increasing said refcnt under the protection of a lock), or you need some method > to predicate the execution of some initial or finilazation event (like in > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that > same init/finalization and so that you don't provide small windows of > inconsistency in your atomics, which is what you have above. > > I wrote a demonstration program to illustrate (forgive me, its pretty quick and > dirty), but I think it illustrates the point: > > #define _GNU_SOURCE > #include <stdlib.h> > #include <stdio.h> > #include <pthread.h> > #include <stdatomic.h> > > atomic_uint_fast64_t refcnt; > > uint threads = 0; > > static void * thread_exec(void *arg) > { > int i; > int cpu = (int)(arg); > cpu_set_t cpuset; > pthread_t thread; > > thread = pthread_self(); > CPU_ZERO(&cpuset); > CPU_SET(cpu, &cpuset); > pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); > > for (i=0; i < 1000; i++) { > if (((atomic_fetch_sub(&refcnt, 0) == 1) || > atomic_fetch_sub(&refcnt, 1) == 0)) { > // There should only ever be one thread in here at a > atomic_init(&refcnt, 0); > threads |= cpu; > printf("threads = %d\n", threads); > threads &= ~cpu; > > // Need to reset the refcnt for future iterations > // but that should be fine since no other thread > // should be in here but us > atomic_init(&refcnt, 1); > } > } > > pthread_exit(NULL); > } > > int main(int argc, char **argv) > { > pthread_attr_t attr; > pthread_t thread_id1, thread_id2; > void *status; > > atomic_init(&refcnt, 1); > > pthread_attr_init(&attr); > > pthread_create(&thread_id1, &attr, thread_exec, (void *)1); > pthread_create(&thread_id2, &attr, thread_exec, (void *)2); > > pthread_attr_destroy(&attr); > > pthread_join(thread_id1, &status); > pthread_join(thread_id2, &status); > > > exit(0); > > } > > > If you run this on an smp system, you'll clearly see that, occasionally the > value of threads is 3. That indicates that you have points where you have > multiple contexts executing in that conditional block that has clearly been > coded to only expect one. You can't make the assumption that every pointer has > a held refcount here, you need to incur the update penalty. > > Neil > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 13:10 ` Olivier MATZ @ 2015-03-27 13:16 ` Bruce Richardson 2015-03-27 13:22 ` Ananyev, Konstantin 0 siblings, 1 reply; 10+ messages in thread From: Bruce Richardson @ 2015-03-27 13:16 UTC (permalink / raw) To: Olivier MATZ; +Cc: dev On Fri, Mar 27, 2015 at 02:10:33PM +0100, Olivier MATZ wrote: > Hi Neil, > > On 03/27/2015 01:44 PM, Neil Horman wrote: > > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote: > >> > >> > >>> -----Original Message----- > >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > >>> Sent: Friday, March 27, 2015 10:26 AM > >>> To: Wiles, Keith > >>> Cc: dev@dpdk.org > >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > >>> > >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > >>>> > >>>> > >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > >>>> > >>>>> The current way is not the most efficient: if m->refcnt is 1, the second > >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > >>>>> condition fails again, although the code suggest otherwise to branch > >>>>> prediction. Instead we should keep the second condition only, and remove > >>>>> the > >>>>> duplicate set to zero. > >>>>> > >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > >>>>> --- > >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +---- > >>>>> 1 file changed, 1 insertion(+), 4 deletions(-) > >>>>> > >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > >>>>> index 17ba791..3ec4024 100644 > >>>>> --- a/lib/librte_mbuf/rte_mbuf.h > >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h > >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > >>>>> { > >>>>> __rte_mbuf_sanity_check(m, 0); > >>>>> > >>>>> - if (likely (rte_mbuf_refcnt_read(m) == 1) || > >>>>> - likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >>>>> - > >>>>> - rte_mbuf_refcnt_set(m, 0); > >>>>> + if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >>>>> > >>>>> /* if this is an indirect mbuf, then > >>>>> * - detach mbuf > >>>> > >>>> I fell for this one too, but read Bruce¹s email > >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html > >>> > >>> This is still the right thing to do though, Bruce's reasoning is erroneous. > >> > >> No, it is not. I believe Bruce comments is absolutely correct here. > >> > > You and bruce are wrong, I proved that below. > > > >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you > >> > >> It does. > >> > > assertions are meaningless without evidence. > > > >>> are the last user of the mbuf, > >>> you are only guaranteed that if the update > >>> operation returns zero. > >>> > >>> In other words: > >>> rte_mbuf_refcnt_update(m, -1) > >>> > >>> is an atomic operation > >>> > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || > >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >>> > >>> > >>> is not. > >>> > >>> To illustrate, on two cpus, this might occur: > >>> > >>> CPU0 CPU1 > >>> rte_mbuf_refcnt_read ... > >>> returns 1 rte_mbuf_refcnt_read > >>> ... returns 1 > >>> execute if clause execute if clause > >> > >> > >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times - > >> it is a bug in your code. > > At what point in time did I indicate this was about multiple frees? Please > > re-read my post. > > > >> Such code wouldn't work properly doesn't matter do we use: > >> > >> if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) > >> > >> or just: > >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) > >> > >> To illustrate it with your example: > >> Suppose m.refcnt==1 > >> > >> CPU0 executes: > >> > >> rte_pktmbuf_free(m1) > >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > >> > >> m2 = rte_pktmbuf_alloc(pool); > >> /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ > >> > >> /* m2 refcnt ==1 start using m2 */ > >> > > Really missing the point here. > > > >> CPU1 executes: > >> rte_pktmbuf_free(m1) > >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > >> > >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. > >> > > Still missing the point. Please see below > > > >>> > >>> In the above scenario both cpus fell into the if clause because they both held a > >>> pointer to the same buffer and both got a return value of one, so they skipped > >>> the update portion of the if clause and both executed the internal block of the > >>> conditional expression. you might be tempted to think thats ok, since that > >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the > >>> entire purpose of that if conditional above was to ensure that only one > >>> execution context ever executed the conditional for a given buffer. Look at > >>> what else happens in that conditional: > >>> > >>> static inline struct rte_mbuf* __attribute__((always_inline)) > >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > >>> { > >>> __rte_mbuf_sanity_check(m, 0); > >>> > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || > >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > >>> > >>> rte_mbuf_refcnt_set(m, 0); > >>> > >>> /* if this is an indirect mbuf, then > >>> * - detach mbuf > >>> * - free attached mbuf segment > >>> */ > >>> if (RTE_MBUF_INDIRECT(m)) { > >>> struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > >>> rte_pktmbuf_detach(m); > >>> if (rte_mbuf_refcnt_update(md, -1) == 0) > >>> __rte_mbuf_raw_free(md); > >>> } > >>> return(m); > >>> } > >>> return (NULL); > >>> } > >>> > >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > >>> and in the scenario I outlined above, that refcnt will underflow, likely causing > >>> a buffer leak. Additionally, the return code of this function is designed to > >>> indicate to the caller if they were the last user of the buffer. In the above > >>> scenario, two execution contexts will be told that they were, which is wrong. > >>> > >>> Zoltans patch is a good fix > >> > >> I don't think so. > >> > >> > >>> > >>> Acked-by: Neil Horman <nhorman@tuxdriver.com> > >> > >> > >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > >> > > > > Again, this has nothing to do with how many times you free an object and > > everything to do with why you use atomics here in the first place. The purpose > > of the if conditional in the above code is to ensure that the contents of the > > conditional block only get executed a single time, correct? Ostensibly you > > don't want to execution contexts getting in there at the same time right? > > > > If you have a single buffer with refcnt=1, and two cpus are executing code that > > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around > > the same time, they can race and both wind up in that conditional block, leading > > to underflow of the md pointer refcnt, which is bad. > > > You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does > not make sense. Even with the fix you have acked. > > CPU0 CPU1 > > m = a_common_mbuf; m = a_common_mbuf; > rte_pktmbuf_free(m) // fully atomic > m2 = rte_pktmbuf_alloc() > // m2 returned the same addr than m > // as it was in the pool > // should not access m here > // whatever the operation > > > Your example below just shows that the current code is wrong if > several cores access a mbuf with refcnt=1 at the same time. That's > true, but that's not allowed. > > - If you want to give a mbuf to another core, you put it in a ring > and stop to reference it on core 0, here not need to have refcnt > > - If you want to share a mbuf with another core, you increase the > reference counter before sending it to core 1. Then, both cores > will have to call rte_pktmbuf_free(). > > > > Regards, > Olivier > > +1 Also to note that the function this comment is being added to is a free-type function. If you are in that function, you are freeing the mbuf, so calls from multiple cores simultaneously is a double-free error. /Bruce > > > > > > Lets look at another more practical example. lets imagine that that the mbuf X > > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0, > > and is about to be freed using the above refcnt test model (a read followed by > > an update that gets squashed, anda refcnt set in the free block. Basically this > > pseudo code: > > > > if (refcnt_read(X) == 1 || refcnt_update(X) == ) { > > refcnt_set(X,0) > > mbuf_free(X) > > } > > > > at the same time CPU1 is preforming a lookup of our needed mbuf from the > > aforementioned set, finds it and takes a refcnt on it. > > > > > > CPU0 CPU1 > > if(refcnt_read(X)) search for mbuf X > > returns 1 get pointer to X > > ... refcnt_update(X,1) > > refcnt_set(X, 0) ... > > mbuf_free(X) > > > > > > After the following sequence X is freed but CPU1 is left thinking that it has a > > valid reference to the mbuf. This is broken. > > > > As an alternate thought experiment, why use atomics here at all? X86 is cache > > coherent right? (ignore that new processor support, as this code predates it). > > If all cpus are able to see a consistent state of a variable, and if every > > context that has a pointer to a given mbuf also has a reference to an mbuf, then > > it should be safe to simply use an integer here rather than an atomic, right? > > If you know that you have a reference to a pointer, just decrement the refcnt > > and check for 0 instead of one, that will tell you that you are the last user of > > a buffer, right? The answer is you can't because there are conditions in which > > you either need to make a set of conditions atomic (finding a pointer and > > increasing said refcnt under the protection of a lock), or you need some method > > to predicate the execution of some initial or finilazation event (like in > > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that > > same init/finalization and so that you don't provide small windows of > > inconsistency in your atomics, which is what you have above. > > > > I wrote a demonstration program to illustrate (forgive me, its pretty quick and > > dirty), but I think it illustrates the point: > > > > #define _GNU_SOURCE > > #include <stdlib.h> > > #include <stdio.h> > > #include <pthread.h> > > #include <stdatomic.h> > > > > atomic_uint_fast64_t refcnt; > > > > uint threads = 0; > > > > static void * thread_exec(void *arg) > > { > > int i; > > int cpu = (int)(arg); > > cpu_set_t cpuset; > > pthread_t thread; > > > > thread = pthread_self(); > > CPU_ZERO(&cpuset); > > CPU_SET(cpu, &cpuset); > > pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); > > > > for (i=0; i < 1000; i++) { > > if (((atomic_fetch_sub(&refcnt, 0) == 1) || > > atomic_fetch_sub(&refcnt, 1) == 0)) { > > // There should only ever be one thread in here at a > > atomic_init(&refcnt, 0); > > threads |= cpu; > > printf("threads = %d\n", threads); > > threads &= ~cpu; > > > > // Need to reset the refcnt for future iterations > > // but that should be fine since no other thread > > // should be in here but us > > atomic_init(&refcnt, 1); > > } > > } > > > > pthread_exit(NULL); > > } > > > > int main(int argc, char **argv) > > { > > pthread_attr_t attr; > > pthread_t thread_id1, thread_id2; > > void *status; > > > > atomic_init(&refcnt, 1); > > > > pthread_attr_init(&attr); > > > > pthread_create(&thread_id1, &attr, thread_exec, (void *)1); > > pthread_create(&thread_id2, &attr, thread_exec, (void *)2); > > > > pthread_attr_destroy(&attr); > > > > pthread_join(thread_id1, &status); > > pthread_join(thread_id2, &status); > > > > > > exit(0); > > > > } > > > > > > If you run this on an smp system, you'll clearly see that, occasionally the > > value of threads is 3. That indicates that you have points where you have > > multiple contexts executing in that conditional block that has clearly been > > coded to only expect one. You can't make the assumption that every pointer has > > a held refcount here, you need to incur the update penalty. > > > > Neil > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 13:16 ` Bruce Richardson @ 2015-03-27 13:22 ` Ananyev, Konstantin 0 siblings, 0 replies; 10+ messages in thread From: Ananyev, Konstantin @ 2015-03-27 13:22 UTC (permalink / raw) To: Richardson, Bruce, Olivier MATZ; +Cc: dev > -----Original Message----- > From: Richardson, Bruce > Sent: Friday, March 27, 2015 1:17 PM > To: Olivier MATZ > Cc: Neil Horman; Ananyev, Konstantin; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > > On Fri, Mar 27, 2015 at 02:10:33PM +0100, Olivier MATZ wrote: > > Hi Neil, > > > > On 03/27/2015 01:44 PM, Neil Horman wrote: > > > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote: > > >> > > >> > > >>> -----Original Message----- > > >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman > > >>> Sent: Friday, March 27, 2015 10:26 AM > > >>> To: Wiles, Keith > > >>> Cc: dev@dpdk.org > > >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free > > >>> > > >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: > > >>>> > > >>>> > > >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: > > >>>> > > >>>>> The current way is not the most efficient: if m->refcnt is 1, the second > > >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd > > >>>>> condition fails again, although the code suggest otherwise to branch > > >>>>> prediction. Instead we should keep the second condition only, and remove > > >>>>> the > > >>>>> duplicate set to zero. > > >>>>> > > >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> > > >>>>> --- > > >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +---- > > >>>>> 1 file changed, 1 insertion(+), 4 deletions(-) > > >>>>> > > >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > >>>>> index 17ba791..3ec4024 100644 > > >>>>> --- a/lib/librte_mbuf/rte_mbuf.h > > >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h > > >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > >>>>> { > > >>>>> __rte_mbuf_sanity_check(m, 0); > > >>>>> > > >>>>> - if (likely (rte_mbuf_refcnt_read(m) == 1) || > > >>>>> - likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > >>>>> - > > >>>>> - rte_mbuf_refcnt_set(m, 0); > > >>>>> + if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > >>>>> > > >>>>> /* if this is an indirect mbuf, then > > >>>>> * - detach mbuf > > >>>> > > >>>> I fell for this one too, but read Bruce¹s email > > >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html > > >>> > > >>> This is still the right thing to do though, Bruce's reasoning is erroneous. > > >> > > >> No, it is not. I believe Bruce comments is absolutely correct here. > > >> > > > You and bruce are wrong, I proved that below. > > > > > >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you > > >> > > >> It does. > > >> > > > assertions are meaningless without evidence. > > > > > >>> are the last user of the mbuf, > > >>> you are only guaranteed that if the update > > >>> operation returns zero. > > >>> > > >>> In other words: > > >>> rte_mbuf_refcnt_update(m, -1) > > >>> > > >>> is an atomic operation > > >>> > > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || > > >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > >>> > > >>> > > >>> is not. > > >>> > > >>> To illustrate, on two cpus, this might occur: > > >>> > > >>> CPU0 CPU1 > > >>> rte_mbuf_refcnt_read ... > > >>> returns 1 rte_mbuf_refcnt_read > > >>> ... returns 1 > > >>> execute if clause execute if clause > > >> > > >> > > >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times - > > >> it is a bug in your code. > > > At what point in time did I indicate this was about multiple frees? Please > > > re-read my post. > > > > > >> Such code wouldn't work properly doesn't matter do we use: > > >> > > >> if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > >> > > >> or just: > > >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) > > >> > > >> To illustrate it with your example: > > >> Suppose m.refcnt==1 > > >> > > >> CPU0 executes: > > >> > > >> rte_pktmbuf_free(m1) > > >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > >> > > >> m2 = rte_pktmbuf_alloc(pool); > > >> /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */ > > >> > > >> /* m2 refcnt ==1 start using m2 */ > > >> > > > Really missing the point here. > > > > > >> CPU1 executes: > > >> rte_pktmbuf_free(m1) > > >> /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/ > > >> > > >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content. > > >> > > > Still missing the point. Please see below > > > > > >>> > > >>> In the above scenario both cpus fell into the if clause because they both held a > > >>> pointer to the same buffer and both got a return value of one, so they skipped > > >>> the update portion of the if clause and both executed the internal block of the > > >>> conditional expression. you might be tempted to think thats ok, since that > > >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the > > >>> entire purpose of that if conditional above was to ensure that only one > > >>> execution context ever executed the conditional for a given buffer. Look at > > >>> what else happens in that conditional: > > >>> > > >>> static inline struct rte_mbuf* __attribute__((always_inline)) > > >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > > >>> { > > >>> __rte_mbuf_sanity_check(m, 0); > > >>> > > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) || > > >>> likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > >>> > > >>> rte_mbuf_refcnt_set(m, 0); > > >>> > > >>> /* if this is an indirect mbuf, then > > >>> * - detach mbuf > > >>> * - free attached mbuf segment > > >>> */ > > >>> if (RTE_MBUF_INDIRECT(m)) { > > >>> struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > > >>> rte_pktmbuf_detach(m); > > >>> if (rte_mbuf_refcnt_update(md, -1) == 0) > > >>> __rte_mbuf_raw_free(md); > > >>> } > > >>> return(m); > > >>> } > > >>> return (NULL); > > >>> } > > >>> > > >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > > >>> and in the scenario I outlined above, that refcnt will underflow, likely causing > > >>> a buffer leak. Additionally, the return code of this function is designed to > > >>> indicate to the caller if they were the last user of the buffer. In the above > > >>> scenario, two execution contexts will be told that they were, which is wrong. > > >>> > > >>> Zoltans patch is a good fix > > >> > > >> I don't think so. > > >> > > >> > > >>> > > >>> Acked-by: Neil Horman <nhorman@tuxdriver.com> > > >> > > >> > > >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > > >> > > > > > > Again, this has nothing to do with how many times you free an object and > > > everything to do with why you use atomics here in the first place. The purpose > > > of the if conditional in the above code is to ensure that the contents of the > > > conditional block only get executed a single time, correct? Ostensibly you > > > don't want to execution contexts getting in there at the same time right? > > > > > > If you have a single buffer with refcnt=1, and two cpus are executing code that > > > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around > > > the same time, they can race and both wind up in that conditional block, leading > > > to underflow of the md pointer refcnt, which is bad. > > > > > > You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does > > not make sense. Even with the fix you have acked. > > > > CPU0 CPU1 > > > > m = a_common_mbuf; m = a_common_mbuf; > > rte_pktmbuf_free(m) // fully atomic > > m2 = rte_pktmbuf_alloc() > > // m2 returned the same addr than m > > // as it was in the pool > > // should not access m here > > // whatever the operation > > > > > > Your example below just shows that the current code is wrong if > > several cores access a mbuf with refcnt=1 at the same time. That's > > true, but that's not allowed. > > > > - If you want to give a mbuf to another core, you put it in a ring > > and stop to reference it on core 0, here not need to have refcnt > > > > - If you want to share a mbuf with another core, you increase the > > reference counter before sending it to core 1. Then, both cores > > will have to call rte_pktmbuf_free(). > > > > > > > > Regards, > > Olivier > > > > > > +1 > > Also to note that the function this comment is being added to is a free-type > function. If you are in that function, you are freeing the mbuf, so calls > from multiple cores simultaneously is a double-free error. > > /Bruce Another +1 Was going to write a big reply, but realised Olivier and Bruce already wrote what I planned too :) Just to repeat: Neil, it is an invalid scenario to call free() twice for the mbuf whose refcnt==1. Doesn't matter is that happens concurrently from different threads, or sequentially from the same thread. You can't make it work properly with or without Zoltan patch. If that happens - it is a bug in your code, and as was pointed by Bruce in another mail, just voids the concept of reference counting. As a rule of thumb: if you are going to pass an object to an entity which would free it, and you still plan to use that object, then it is your responsibility to increment it's reference count first. Konstantin > > > > > > > > > > > Lets look at another more practical example. lets imagine that that the mbuf X > > > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0, > > > and is about to be freed using the above refcnt test model (a read followed by > > > an update that gets squashed, anda refcnt set in the free block. Basically this > > > pseudo code: > > > > > > if (refcnt_read(X) == 1 || refcnt_update(X) == ) { > > > refcnt_set(X,0) > > > mbuf_free(X) > > > } > > > > > > at the same time CPU1 is preforming a lookup of our needed mbuf from the > > > aforementioned set, finds it and takes a refcnt on it. > > > > > > > > > CPU0 CPU1 > > > if(refcnt_read(X)) search for mbuf X > > > returns 1 get pointer to X > > > ... refcnt_update(X,1) > > > refcnt_set(X, 0) ... > > > mbuf_free(X) > > > > > > > > > After the following sequence X is freed but CPU1 is left thinking that it has a > > > valid reference to the mbuf. This is broken. > > > > > > As an alternate thought experiment, why use atomics here at all? X86 is cache > > > coherent right? (ignore that new processor support, as this code predates it). > > > If all cpus are able to see a consistent state of a variable, and if every > > > context that has a pointer to a given mbuf also has a reference to an mbuf, then > > > it should be safe to simply use an integer here rather than an atomic, right? > > > If you know that you have a reference to a pointer, just decrement the refcnt > > > and check for 0 instead of one, that will tell you that you are the last user of > > > a buffer, right? The answer is you can't because there are conditions in which > > > you either need to make a set of conditions atomic (finding a pointer and > > > increasing said refcnt under the protection of a lock), or you need some method > > > to predicate the execution of some initial or finilazation event (like in > > > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that > > > same init/finalization and so that you don't provide small windows of > > > inconsistency in your atomics, which is what you have above. > > > > > > I wrote a demonstration program to illustrate (forgive me, its pretty quick and > > > dirty), but I think it illustrates the point: > > > > > > #define _GNU_SOURCE > > > #include <stdlib.h> > > > #include <stdio.h> > > > #include <pthread.h> > > > #include <stdatomic.h> > > > > > > atomic_uint_fast64_t refcnt; > > > > > > uint threads = 0; > > > > > > static void * thread_exec(void *arg) > > > { > > > int i; > > > int cpu = (int)(arg); > > > cpu_set_t cpuset; > > > pthread_t thread; > > > > > > thread = pthread_self(); > > > CPU_ZERO(&cpuset); > > > CPU_SET(cpu, &cpuset); > > > pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); > > > > > > for (i=0; i < 1000; i++) { > > > if (((atomic_fetch_sub(&refcnt, 0) == 1) || > > > atomic_fetch_sub(&refcnt, 1) == 0)) { > > > // There should only ever be one thread in here at a > > > atomic_init(&refcnt, 0); > > > threads |= cpu; > > > printf("threads = %d\n", threads); > > > threads &= ~cpu; > > > > > > // Need to reset the refcnt for future iterations > > > // but that should be fine since no other thread > > > // should be in here but us > > > atomic_init(&refcnt, 1); > > > } > > > } > > > > > > pthread_exit(NULL); > > > } > > > > > > int main(int argc, char **argv) > > > { > > > pthread_attr_t attr; > > > pthread_t thread_id1, thread_id2; > > > void *status; > > > > > > atomic_init(&refcnt, 1); > > > > > > pthread_attr_init(&attr); > > > > > > pthread_create(&thread_id1, &attr, thread_exec, (void *)1); > > > pthread_create(&thread_id2, &attr, thread_exec, (void *)2); > > > > > > pthread_attr_destroy(&attr); > > > > > > pthread_join(thread_id1, &status); > > > pthread_join(thread_id2, &status); > > > > > > > > > exit(0); > > > > > > } > > > > > > > > > If you run this on an smp system, you'll clearly see that, occasionally the > > > value of threads is 3. That indicates that you have points where you have > > > multiple contexts executing in that conditional block that has clearly been > > > coded to only expect one. You can't make the assumption that every pointer has > > > a held refcount here, you need to incur the update penalty. > > > > > > Neil > > > > > > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free 2015-03-27 10:25 ` Neil Horman 2015-03-27 10:48 ` Ananyev, Konstantin @ 2015-03-27 10:50 ` Olivier MATZ 1 sibling, 0 replies; 10+ messages in thread From: Olivier MATZ @ 2015-03-27 10:50 UTC (permalink / raw) To: Neil Horman, Wiles, Keith; +Cc: dev Hi Neil, On 03/27/2015 11:25 AM, Neil Horman wrote: > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote: >> >> >> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote: >> >>> The current way is not the most efficient: if m->refcnt is 1, the second >>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd >>> condition fails again, although the code suggest otherwise to branch >>> prediction. Instead we should keep the second condition only, and remove >>> the >>> duplicate set to zero. >>> >>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org> >>> --- >>> lib/librte_mbuf/rte_mbuf.h | 5 +---- >>> 1 file changed, 1 insertion(+), 4 deletions(-) >>> >>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h >>> index 17ba791..3ec4024 100644 >>> --- a/lib/librte_mbuf/rte_mbuf.h >>> +++ b/lib/librte_mbuf/rte_mbuf.h >>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) >>> { >>> __rte_mbuf_sanity_check(m, 0); >>> >>> - if (likely (rte_mbuf_refcnt_read(m) == 1) || >>> - likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>> - >>> - rte_mbuf_refcnt_set(m, 0); >>> + if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) { >>> >>> /* if this is an indirect mbuf, then >>> * - detach mbuf >> >> I fell for this one too, but read Bruce¹s email >> http://dpdk.org/ml/archives/dev/2015-March/014481.html > > This is still the right thing to do though, Bruce's reasoning is erroneous. > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you > are the last user of the mbuf, you are only guaranteed that if the update > operation returns zero. > > In other words: > rte_mbuf_refcnt_update(m, -1) > > is an atomic operation > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > > is not. > > To illustrate, on two cpus, this might occur: > > CPU0 CPU1 > rte_mbuf_refcnt_read ... > returns 1 rte_mbuf_refcnt_read > ... returns 1 > execute if clause execute if clause > > In the above scenario both cpus fell into the if clause because they both held a > pointer to the same buffer and both got a return value of one, so they skipped > the update portion of the if clause and both executed the internal block of the > conditional expression. you might be tempted to think thats ok, since that > block just sets the refcnt to zero, and doing so twice isn't harmful, but the > entire purpose of that if conditional above was to ensure that only one > execution context ever executed the conditional for a given buffer. Look at > what else happens in that conditional: I disagree, I also spent some time to think about this code, and I think Bruce is right here. If you read rte_mbuf_refcnt and it returns 1, it means you are the last user, so no other core references the mbuf anymore. Your scenario is not possible, because 2 CPUs do not have the right to access to a mbuf pointer at the same time. It's like writing data in the mbuf while reading it on another core. If you think your scenario can happen, could you give an example of code that would led to such case? If you want to use a mbuf on 2 CPUs at the same time, you have to clone it first, and in this case the reference counter would be at least 2, preventing your case to happen Olivier > > static inline struct rte_mbuf* __attribute__((always_inline)) > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m) > { > __rte_mbuf_sanity_check(m, 0); > > if (likely (rte_mbuf_refcnt_read(m) == 1) || > likely (rte_mbuf_refcnt_update(m, -1) == 0)) { > > rte_mbuf_refcnt_set(m, 0); > > /* if this is an indirect mbuf, then > * - detach mbuf > * - free attached mbuf segment > */ > if (RTE_MBUF_INDIRECT(m)) { > struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr); > rte_pktmbuf_detach(m); > if (rte_mbuf_refcnt_update(md, -1) == 0) > __rte_mbuf_raw_free(md); > } > return(m); > } > return (NULL); > } > > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf, > and in the scenario I outlined above, that refcnt will underflow, likely causing > a buffer leak. Additionally, the return code of this function is designed to > indicate to the caller if they were the last user of the buffer. In the above > scenario, two execution contexts will be told that they were, which is wrong. > > Zoltans patch is a good fix > > Acked-by: Neil Horman <nhorman@tuxdriver.com> > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-03-27 13:22 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-26 18:10 [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free Zoltan Kiss 2015-03-26 21:00 ` Wiles, Keith 2015-03-26 21:07 ` Bruce Richardson 2015-03-27 10:25 ` Neil Horman 2015-03-27 10:48 ` Ananyev, Konstantin 2015-03-27 12:44 ` Neil Horman 2015-03-27 13:10 ` Olivier MATZ 2015-03-27 13:16 ` Bruce Richardson 2015-03-27 13:22 ` Ananyev, Konstantin 2015-03-27 10:50 ` Olivier MATZ
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).