DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
@ 2015-03-26 18:10 Zoltan Kiss
  2015-03-26 21:00 ` Wiles, Keith
  0 siblings, 1 reply; 10+ messages in thread
From: Zoltan Kiss @ 2015-03-26 18:10 UTC (permalink / raw)
  To: dev; +Cc: Zoltan Kiss

The current way is not the most efficient: if m->refcnt is 1, the second
condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
condition fails again, although the code suggest otherwise to branch
prediction. Instead we should keep the second condition only, and remove the
duplicate set to zero.

Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
---
 lib/librte_mbuf/rte_mbuf.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 17ba791..3ec4024 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 0);
 
-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
-
-		rte_mbuf_refcnt_set(m, 0);
+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
 
 		/* if this is an indirect mbuf, then
 		 *  - detach mbuf
-- 
1.9.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-26 18:10 [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free Zoltan Kiss
@ 2015-03-26 21:00 ` Wiles, Keith
  2015-03-26 21:07   ` Bruce Richardson
  2015-03-27 10:25   ` Neil Horman
  0 siblings, 2 replies; 10+ messages in thread
From: Wiles, Keith @ 2015-03-26 21:00 UTC (permalink / raw)
  To: Zoltan Kiss, dev



On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:

>The current way is not the most efficient: if m->refcnt is 1, the second
>condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
>condition fails again, although the code suggest otherwise to branch
>prediction. Instead we should keep the second condition only, and remove
>the
>duplicate set to zero.
>
>Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>---
> lib/librte_mbuf/rte_mbuf.h | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
>diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>index 17ba791..3ec4024 100644
>--- a/lib/librte_mbuf/rte_mbuf.h
>+++ b/lib/librte_mbuf/rte_mbuf.h
>@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> 	__rte_mbuf_sanity_check(m, 0);
> 
>-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>-
>-		rte_mbuf_refcnt_set(m, 0);
>+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> 
> 		/* if this is an indirect mbuf, then
> 		 *  - detach mbuf

I fell for this one too, but read Bruce¹s email
http://dpdk.org/ml/archives/dev/2015-March/014481.html
>-- 
>1.9.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-26 21:00 ` Wiles, Keith
@ 2015-03-26 21:07   ` Bruce Richardson
  2015-03-27 10:25   ` Neil Horman
  1 sibling, 0 replies; 10+ messages in thread
From: Bruce Richardson @ 2015-03-26 21:07 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev, Zoltan Kiss

On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> 
> 
> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> 
> >The current way is not the most efficient: if m->refcnt is 1, the second
> >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> >condition fails again, although the code suggest otherwise to branch
> >prediction. Instead we should keep the second condition only, and remove
> >the
> >duplicate set to zero.
> >
> >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> >---
> > lib/librte_mbuf/rte_mbuf.h | 5 +----
> > 1 file changed, 1 insertion(+), 4 deletions(-)
> >
> >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> >index 17ba791..3ec4024 100644
> >--- a/lib/librte_mbuf/rte_mbuf.h
> >+++ b/lib/librte_mbuf/rte_mbuf.h
> >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> > 	__rte_mbuf_sanity_check(m, 0);
> > 
> >-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >-
> >-		rte_mbuf_refcnt_set(m, 0);
> >+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> > 		/* if this is an indirect mbuf, then
> > 		 *  - detach mbuf
> 
> I fell for this one too, but read Bruce¹s email
> http://dpdk.org/ml/archives/dev/2015-March/014481.html

Looks like a code comment that really, really needs to be added to the code itself!

/Bruce

> >-- 
> >1.9.1
> >
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-26 21:00 ` Wiles, Keith
  2015-03-26 21:07   ` Bruce Richardson
@ 2015-03-27 10:25   ` Neil Horman
  2015-03-27 10:48     ` Ananyev, Konstantin
  2015-03-27 10:50     ` Olivier MATZ
  1 sibling, 2 replies; 10+ messages in thread
From: Neil Horman @ 2015-03-27 10:25 UTC (permalink / raw)
  To: Wiles, Keith; +Cc: dev

On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> 
> 
> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> 
> >The current way is not the most efficient: if m->refcnt is 1, the second
> >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> >condition fails again, although the code suggest otherwise to branch
> >prediction. Instead we should keep the second condition only, and remove
> >the
> >duplicate set to zero.
> >
> >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> >---
> > lib/librte_mbuf/rte_mbuf.h | 5 +----
> > 1 file changed, 1 insertion(+), 4 deletions(-)
> >
> >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> >index 17ba791..3ec4024 100644
> >--- a/lib/librte_mbuf/rte_mbuf.h
> >+++ b/lib/librte_mbuf/rte_mbuf.h
> >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> > 	__rte_mbuf_sanity_check(m, 0);
> > 
> >-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >-
> >-		rte_mbuf_refcnt_set(m, 0);
> >+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> > 		/* if this is an indirect mbuf, then
> > 		 *  - detach mbuf
> 
> I fell for this one too, but read Bruce¹s email
> http://dpdk.org/ml/archives/dev/2015-March/014481.html

This is still the right thing to do though, Bruce's reasoning is erroneous.
Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
are the last user of the mbuf, you are only guaranteed that if the update
operation returns zero.

In other words:
rte_mbuf_refcnt_update(m, -1)

is an atomic operation

if (likely (rte_mbuf_refcnt_read(m) == 1) ||
                    likely (rte_mbuf_refcnt_update(m, -1) == 0)) {


is not.

To illustrate, on two cpus, this might occur:

CPU0					CPU1
rte_mbuf_refcnt_read			...
   returns 1				rte_mbuf_refcnt_read
...					   returns 1
execute if clause			execute if clause

In the above scenario both cpus fell into the if clause because they both held a
pointer to the same buffer and both got a return value of one, so they skipped
the update portion of the if clause and both executed the internal block of the
conditional expression.  you might be tempted to think thats ok, since that
block just sets the refcnt to zero, and doing so twice isn't harmful, but the
entire purpose of that if conditional above was to ensure that only one
execution context ever executed the conditional for a given buffer.  Look at
what else happens in that conditional:

static inline struct rte_mbuf* __attribute__((always_inline))
__rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
        __rte_mbuf_sanity_check(m, 0);

        if (likely (rte_mbuf_refcnt_read(m) == 1) ||
                        likely (rte_mbuf_refcnt_update(m, -1) == 0)) {

                rte_mbuf_refcnt_set(m, 0);

                /* if this is an indirect mbuf, then
                 *  - detach mbuf
                 *  - free attached mbuf segment
                 */
                if (RTE_MBUF_INDIRECT(m)) {
                        struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
                        rte_pktmbuf_detach(m);
                        if (rte_mbuf_refcnt_update(md, -1) == 0)
                                __rte_mbuf_raw_free(md);
                }
                return(m);
        }
        return (NULL);
}

If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
and in the scenario I outlined above, that refcnt will underflow, likely causing
a buffer leak.  Additionally, the return code of this function is designed to
indicate to the caller if they were the last user of the buffer.  In the above
scenario, two execution contexts will be told that they were, which is wrong.

Zoltans patch is a good fix

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 10:25   ` Neil Horman
@ 2015-03-27 10:48     ` Ananyev, Konstantin
  2015-03-27 12:44       ` Neil Horman
  2015-03-27 10:50     ` Olivier MATZ
  1 sibling, 1 reply; 10+ messages in thread
From: Ananyev, Konstantin @ 2015-03-27 10:48 UTC (permalink / raw)
  To: Neil Horman, Wiles, Keith; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> Sent: Friday, March 27, 2015 10:26 AM
> To: Wiles, Keith
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> 
> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> >
> >
> > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> >
> > >The current way is not the most efficient: if m->refcnt is 1, the second
> > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> > >condition fails again, although the code suggest otherwise to branch
> > >prediction. Instead we should keep the second condition only, and remove
> > >the
> > >duplicate set to zero.
> > >
> > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> > >---
> > > lib/librte_mbuf/rte_mbuf.h | 5 +----
> > > 1 file changed, 1 insertion(+), 4 deletions(-)
> > >
> > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > >index 17ba791..3ec4024 100644
> > >--- a/lib/librte_mbuf/rte_mbuf.h
> > >+++ b/lib/librte_mbuf/rte_mbuf.h
> > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > {
> > > 	__rte_mbuf_sanity_check(m, 0);
> > >
> > >-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >-
> > >-		rte_mbuf_refcnt_set(m, 0);
> > >+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >
> > > 		/* if this is an indirect mbuf, then
> > > 		 *  - detach mbuf
> >
> > I fell for this one too, but read Bruce¹s email
> > http://dpdk.org/ml/archives/dev/2015-March/014481.html
> 
> This is still the right thing to do though, Bruce's reasoning is erroneous.

No, it is not. I believe Bruce comments is absolutely correct here.

> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you

It does.

> are the last user of the mbuf,
> you are only guaranteed that if the update
> operation returns zero.
> 
> In other words:
> rte_mbuf_refcnt_update(m, -1)
> 
> is an atomic operation
> 
> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> 
> 
> is not.
> 
> To illustrate, on two cpus, this might occur:
> 
> CPU0					CPU1
> rte_mbuf_refcnt_read			...
>    returns 1				rte_mbuf_refcnt_read
> ...					   returns 1
> execute if clause			execute if clause


If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
it is a bug in your code.
Such code wouldn't work properly doesn't matter do we use:

 if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))

or just: 
if (likely (rte_mbuf_refcnt_update(m, -1) == 0))

To illustrate it with your example:
Suppose m.refcnt==1

CPU0 executes: 

rte_pktmbuf_free(m1)
        /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/

m2 = rte_pktmbuf_alloc(pool);
      /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */

/* m2 refcnt ==1 start using m2 */

CPU1 executes:
rte_pktmbuf_free(m1)
        /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/

We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.

> 
> In the above scenario both cpus fell into the if clause because they both held a
> pointer to the same buffer and both got a return value of one, so they skipped
> the update portion of the if clause and both executed the internal block of the
> conditional expression.  you might be tempted to think thats ok, since that
> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> entire purpose of that if conditional above was to ensure that only one
> execution context ever executed the conditional for a given buffer.  Look at
> what else happens in that conditional:
> 
> static inline struct rte_mbuf* __attribute__((always_inline))
> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
>         __rte_mbuf_sanity_check(m, 0);
> 
>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> 
>                 rte_mbuf_refcnt_set(m, 0);
> 
>                 /* if this is an indirect mbuf, then
>                  *  - detach mbuf
>                  *  - free attached mbuf segment
>                  */
>                 if (RTE_MBUF_INDIRECT(m)) {
>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
>                         rte_pktmbuf_detach(m);
>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
>                                 __rte_mbuf_raw_free(md);
>                 }
>                 return(m);
>         }
>         return (NULL);
> }
> 
> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> and in the scenario I outlined above, that refcnt will underflow, likely causing
> a buffer leak.  Additionally, the return code of this function is designed to
> indicate to the caller if they were the last user of the buffer.  In the above
> scenario, two execution contexts will be told that they were, which is wrong.
> 
> Zoltans patch is a good fix

I don't think so.


> 
> Acked-by: Neil Horman <nhorman@tuxdriver.com>


NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 10:25   ` Neil Horman
  2015-03-27 10:48     ` Ananyev, Konstantin
@ 2015-03-27 10:50     ` Olivier MATZ
  1 sibling, 0 replies; 10+ messages in thread
From: Olivier MATZ @ 2015-03-27 10:50 UTC (permalink / raw)
  To: Neil Horman, Wiles, Keith; +Cc: dev

Hi Neil,

On 03/27/2015 11:25 AM, Neil Horman wrote:
> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
>>
>>
>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
>>
>>> The current way is not the most efficient: if m->refcnt is 1, the second
>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
>>> condition fails again, although the code suggest otherwise to branch
>>> prediction. Instead we should keep the second condition only, and remove
>>> the
>>> duplicate set to zero.
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>>> ---
>>> lib/librte_mbuf/rte_mbuf.h | 5 +----
>>> 1 file changed, 1 insertion(+), 4 deletions(-)
>>>
>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>> index 17ba791..3ec4024 100644
>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
>>> {
>>> 	__rte_mbuf_sanity_check(m, 0);
>>>
>>> -	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>>> -			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>> -
>>> -		rte_mbuf_refcnt_set(m, 0);
>>> +	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>>
>>> 		/* if this is an indirect mbuf, then
>>> 		 *  - detach mbuf
>>
>> I fell for this one too, but read Bruce¹s email
>> http://dpdk.org/ml/archives/dev/2015-March/014481.html
> 
> This is still the right thing to do though, Bruce's reasoning is erroneous.
> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> are the last user of the mbuf, you are only guaranteed that if the update
> operation returns zero.
> 
> In other words:
> rte_mbuf_refcnt_update(m, -1)
> 
> is an atomic operation
> 
> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> 
> 
> is not.
> 
> To illustrate, on two cpus, this might occur:
> 
> CPU0					CPU1
> rte_mbuf_refcnt_read			...
>    returns 1				rte_mbuf_refcnt_read
> ...					   returns 1
> execute if clause			execute if clause
> 
> In the above scenario both cpus fell into the if clause because they both held a
> pointer to the same buffer and both got a return value of one, so they skipped
> the update portion of the if clause and both executed the internal block of the
> conditional expression.  you might be tempted to think thats ok, since that
> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> entire purpose of that if conditional above was to ensure that only one
> execution context ever executed the conditional for a given buffer.  Look at
> what else happens in that conditional:

I disagree, I also spent some time to think about this code, and I think
Bruce is right here. If you read rte_mbuf_refcnt and it returns 1, it
means you are the last user, so no other core references the mbuf
anymore.

Your scenario is not possible, because 2 CPUs do not have the right to
access to a mbuf pointer at the same time. It's like writing data in the
mbuf while reading it on another core. If you think your scenario can
happen, could you give an example of code that would led to such case?

If you want to use a mbuf on 2 CPUs at the same time, you have to clone
it first, and in this case the reference counter would be at least 2,
preventing your case to happen


Olivier



> 
> static inline struct rte_mbuf* __attribute__((always_inline))
> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
>         __rte_mbuf_sanity_check(m, 0);
> 
>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> 
>                 rte_mbuf_refcnt_set(m, 0);
> 
>                 /* if this is an indirect mbuf, then
>                  *  - detach mbuf
>                  *  - free attached mbuf segment
>                  */
>                 if (RTE_MBUF_INDIRECT(m)) {
>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
>                         rte_pktmbuf_detach(m);
>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
>                                 __rte_mbuf_raw_free(md);
>                 }
>                 return(m);
>         }
>         return (NULL);
> }
> 
> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> and in the scenario I outlined above, that refcnt will underflow, likely causing
> a buffer leak.  Additionally, the return code of this function is designed to
> indicate to the caller if they were the last user of the buffer.  In the above
> scenario, two execution contexts will be told that they were, which is wrong.
> 
> Zoltans patch is a good fix
> 
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 10:48     ` Ananyev, Konstantin
@ 2015-03-27 12:44       ` Neil Horman
  2015-03-27 13:10         ` Olivier MATZ
  0 siblings, 1 reply; 10+ messages in thread
From: Neil Horman @ 2015-03-27 12:44 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> > Sent: Friday, March 27, 2015 10:26 AM
> > To: Wiles, Keith
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> > 
> > On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> > >
> > >
> > > On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> > >
> > > >The current way is not the most efficient: if m->refcnt is 1, the second
> > > >condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> > > >condition fails again, although the code suggest otherwise to branch
> > > >prediction. Instead we should keep the second condition only, and remove
> > > >the
> > > >duplicate set to zero.
> > > >
> > > >Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> > > >---
> > > > lib/librte_mbuf/rte_mbuf.h | 5 +----
> > > > 1 file changed, 1 insertion(+), 4 deletions(-)
> > > >
> > > >diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > >index 17ba791..3ec4024 100644
> > > >--- a/lib/librte_mbuf/rte_mbuf.h
> > > >+++ b/lib/librte_mbuf/rte_mbuf.h
> > > >@@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > > > {
> > > > 	__rte_mbuf_sanity_check(m, 0);
> > > >
> > > >-	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > > >-			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > > >-
> > > >-		rte_mbuf_refcnt_set(m, 0);
> > > >+	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > > >
> > > > 		/* if this is an indirect mbuf, then
> > > > 		 *  - detach mbuf
> > >
> > > I fell for this one too, but read Bruce¹s email
> > > http://dpdk.org/ml/archives/dev/2015-March/014481.html
> > 
> > This is still the right thing to do though, Bruce's reasoning is erroneous.
> 
> No, it is not. I believe Bruce comments is absolutely correct here.
> 
You and bruce are wrong, I proved that below.

> > Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> 
> It does.
> 
assertions are meaningless without evidence.

> > are the last user of the mbuf,
> > you are only guaranteed that if the update
> > operation returns zero.
> > 
> > In other words:
> > rte_mbuf_refcnt_update(m, -1)
> > 
> > is an atomic operation
> > 
> > if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> > 
> > is not.
> > 
> > To illustrate, on two cpus, this might occur:
> > 
> > CPU0					CPU1
> > rte_mbuf_refcnt_read			...
> >    returns 1				rte_mbuf_refcnt_read
> > ...					   returns 1
> > execute if clause			execute if clause
> 
> 
> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
> it is a bug in your code.
At what point in time did I indicate this was about multiple frees?  Please
re-read my post.

> Such code wouldn't work properly doesn't matter do we use:
> 
>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
> 
> or just: 
> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
> 
> To illustrate it with your example:
> Suppose m.refcnt==1
> 
> CPU0 executes: 
> 
> rte_pktmbuf_free(m1)
>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> 
> m2 = rte_pktmbuf_alloc(pool);
>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
> 
> /* m2 refcnt ==1 start using m2 */
> 
Really missing the point here.

> CPU1 executes:
> rte_pktmbuf_free(m1)
>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> 
> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
> 
Still missing the point. Please see below

> > 
> > In the above scenario both cpus fell into the if clause because they both held a
> > pointer to the same buffer and both got a return value of one, so they skipped
> > the update portion of the if clause and both executed the internal block of the
> > conditional expression.  you might be tempted to think thats ok, since that
> > block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> > entire purpose of that if conditional above was to ensure that only one
> > execution context ever executed the conditional for a given buffer.  Look at
> > what else happens in that conditional:
> > 
> > static inline struct rte_mbuf* __attribute__((always_inline))
> > __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> >         __rte_mbuf_sanity_check(m, 0);
> > 
> >         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > 
> >                 rte_mbuf_refcnt_set(m, 0);
> > 
> >                 /* if this is an indirect mbuf, then
> >                  *  - detach mbuf
> >                  *  - free attached mbuf segment
> >                  */
> >                 if (RTE_MBUF_INDIRECT(m)) {
> >                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
> >                         rte_pktmbuf_detach(m);
> >                         if (rte_mbuf_refcnt_update(md, -1) == 0)
> >                                 __rte_mbuf_raw_free(md);
> >                 }
> >                 return(m);
> >         }
> >         return (NULL);
> > }
> > 
> > If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> > and in the scenario I outlined above, that refcnt will underflow, likely causing
> > a buffer leak.  Additionally, the return code of this function is designed to
> > indicate to the caller if they were the last user of the buffer.  In the above
> > scenario, two execution contexts will be told that they were, which is wrong.
> > 
> > Zoltans patch is a good fix
> 
> I don't think so.
> 
> 
> > 
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 

Again, this has nothing to do with how many times you free an object and
everything to do with why you use atomics here in the first place.  The purpose
of the if conditional in the above code is to ensure that the contents of the
conditional block only get executed a single time, correct?  Ostensibly you
don't want to execution contexts getting in there at the same time right?

If you have a single buffer with refcnt=1, and two cpus are executing code that
points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
the same time, they can race and both wind up in that conditional block, leading
to underflow of the md pointer refcnt, which is bad.

Lets look at another more practical example.  lets imagine that that the mbuf X
is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
and is about to be freed using the above refcnt test model (a read followed by
an update that gets squashed, anda refcnt set in the free block.  Basically this
pseudo code:

if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
	refcnt_set(X,0)
	mbuf_free(X)
}

at the same time CPU1 is preforming a lookup of our needed mbuf from the
aforementioned set, finds it and takes a refcnt on it.


CPU0						CPU1
if(refcnt_read(X))				search for mbuf X
     returns 1					get pointer to X
...						refcnt_update(X,1)
refcnt_set(X, 0)				...
mbuf_free(X)


After the following sequence X is freed but CPU1 is left thinking that it has a
valid reference to the mbuf.  This is broken.

As an alternate thought experiment, why use atomics here at all?  X86 is cache
coherent right?  (ignore that new processor support, as this code predates it).
If all cpus are able to see a consistent state of a variable, and if every
context that has a pointer to a given mbuf also has a reference to an mbuf, then
it should be safe to simply use an integer here rather than an atomic, right?
If you know that you have a reference to a pointer, just decrement the refcnt
and check for 0 instead of one, that will tell you that you are the last user of
a buffer, right?  The answer is you can't because there are conditions in which
you either need to make a set of conditions atomic (finding a pointer and
increasing said refcnt under the protection of a lock), or you need some method
to predicate the execution of some initial or finilazation event (like in
__rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
same init/finalization and so that you don't provide small windows of
inconsistency in your atomics, which is what you have above.

I wrote a demonstration program to illustrate (forgive me, its pretty quick and
dirty), but I think it illustrates the point:

#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <stdatomic.h>

atomic_uint_fast64_t refcnt;

uint threads = 0;

static void * thread_exec(void *arg)
{
        int i;
        int cpu = (int)(arg);
        cpu_set_t cpuset;
        pthread_t thread;

        thread = pthread_self();
        CPU_ZERO(&cpuset);
        CPU_SET(cpu, &cpuset);
        pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

        for (i=0; i < 1000; i++) {
                if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
                        atomic_fetch_sub(&refcnt, 1) == 0)) {
                        // There should only ever be one thread in here at a
                        atomic_init(&refcnt, 0);
                        threads |= cpu;
                        printf("threads = %d\n", threads);
                        threads &= ~cpu;

			// Need to reset the refcnt for future iterations
			// but that should be fine since no other thread
			// should be in here but us
                        atomic_init(&refcnt, 1);
                }
        }

        pthread_exit(NULL);
}

int main(int argc, char **argv)
{
        pthread_attr_t attr;
        pthread_t thread_id1, thread_id2;
        void *status;

        atomic_init(&refcnt, 1);

        pthread_attr_init(&attr);

        pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
        pthread_create(&thread_id2, &attr, thread_exec, (void *)2);

        pthread_attr_destroy(&attr);

        pthread_join(thread_id1, &status);
        pthread_join(thread_id2, &status);


        exit(0);

}


If you run this on an smp system, you'll clearly see that, occasionally the
value of threads is 3.  That indicates that you have points where you have
multiple contexts executing in that conditional block that has clearly been
coded to only expect one.  You can't make the assumption that every pointer has
a held refcount here, you need to incur the update penalty.

Neil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 12:44       ` Neil Horman
@ 2015-03-27 13:10         ` Olivier MATZ
  2015-03-27 13:16           ` Bruce Richardson
  0 siblings, 1 reply; 10+ messages in thread
From: Olivier MATZ @ 2015-03-27 13:10 UTC (permalink / raw)
  To: Neil Horman, Ananyev, Konstantin; +Cc: dev

Hi Neil,

On 03/27/2015 01:44 PM, Neil Horman wrote:
> On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
>>> Sent: Friday, March 27, 2015 10:26 AM
>>> To: Wiles, Keith
>>> Cc: dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
>>>
>>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
>>>>
>>>>
>>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
>>>>
>>>>> The current way is not the most efficient: if m->refcnt is 1, the second
>>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
>>>>> condition fails again, although the code suggest otherwise to branch
>>>>> prediction. Instead we should keep the second condition only, and remove
>>>>> the
>>>>> duplicate set to zero.
>>>>>
>>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>>>>> ---
>>>>> lib/librte_mbuf/rte_mbuf.h | 5 +----
>>>>> 1 file changed, 1 insertion(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>>>>> index 17ba791..3ec4024 100644
>>>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
>>>>> {
>>>>> 	__rte_mbuf_sanity_check(m, 0);
>>>>>
>>>>> -	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>>>>> -			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>>>> -
>>>>> -		rte_mbuf_refcnt_set(m, 0);
>>>>> +	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>>>>
>>>>> 		/* if this is an indirect mbuf, then
>>>>> 		 *  - detach mbuf
>>>>
>>>> I fell for this one too, but read Bruce¹s email
>>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html
>>>
>>> This is still the right thing to do though, Bruce's reasoning is erroneous.
>>
>> No, it is not. I believe Bruce comments is absolutely correct here.
>>
> You and bruce are wrong, I proved that below.
> 
>>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
>>
>> It does.
>>
> assertions are meaningless without evidence.
> 
>>> are the last user of the mbuf,
>>> you are only guaranteed that if the update
>>> operation returns zero.
>>>
>>> In other words:
>>> rte_mbuf_refcnt_update(m, -1)
>>>
>>> is an atomic operation
>>>
>>> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>>>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>>
>>>
>>> is not.
>>>
>>> To illustrate, on two cpus, this might occur:
>>>
>>> CPU0					CPU1
>>> rte_mbuf_refcnt_read			...
>>>    returns 1				rte_mbuf_refcnt_read
>>> ...					   returns 1
>>> execute if clause			execute if clause
>>
>>
>> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
>> it is a bug in your code.
> At what point in time did I indicate this was about multiple frees?  Please
> re-read my post.
> 
>> Such code wouldn't work properly doesn't matter do we use:
>>
>>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
>>
>> or just: 
>> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
>>
>> To illustrate it with your example:
>> Suppose m.refcnt==1
>>
>> CPU0 executes: 
>>
>> rte_pktmbuf_free(m1)
>>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
>>
>> m2 = rte_pktmbuf_alloc(pool);
>>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
>>
>> /* m2 refcnt ==1 start using m2 */
>>
> Really missing the point here.
> 
>> CPU1 executes:
>> rte_pktmbuf_free(m1)
>>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
>>
>> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
>>
> Still missing the point. Please see below
> 
>>>
>>> In the above scenario both cpus fell into the if clause because they both held a
>>> pointer to the same buffer and both got a return value of one, so they skipped
>>> the update portion of the if clause and both executed the internal block of the
>>> conditional expression.  you might be tempted to think thats ok, since that
>>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
>>> entire purpose of that if conditional above was to ensure that only one
>>> execution context ever executed the conditional for a given buffer.  Look at
>>> what else happens in that conditional:
>>>
>>> static inline struct rte_mbuf* __attribute__((always_inline))
>>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
>>> {
>>>         __rte_mbuf_sanity_check(m, 0);
>>>
>>>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
>>>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
>>>
>>>                 rte_mbuf_refcnt_set(m, 0);
>>>
>>>                 /* if this is an indirect mbuf, then
>>>                  *  - detach mbuf
>>>                  *  - free attached mbuf segment
>>>                  */
>>>                 if (RTE_MBUF_INDIRECT(m)) {
>>>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
>>>                         rte_pktmbuf_detach(m);
>>>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
>>>                                 __rte_mbuf_raw_free(md);
>>>                 }
>>>                 return(m);
>>>         }
>>>         return (NULL);
>>> }
>>>
>>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
>>> and in the scenario I outlined above, that refcnt will underflow, likely causing
>>> a buffer leak.  Additionally, the return code of this function is designed to
>>> indicate to the caller if they were the last user of the buffer.  In the above
>>> scenario, two execution contexts will be told that they were, which is wrong.
>>>
>>> Zoltans patch is a good fix
>>
>> I don't think so.
>>
>>
>>>
>>> Acked-by: Neil Horman <nhorman@tuxdriver.com>
>>
>>
>> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>>
> 
> Again, this has nothing to do with how many times you free an object and
> everything to do with why you use atomics here in the first place.  The purpose
> of the if conditional in the above code is to ensure that the contents of the
> conditional block only get executed a single time, correct?  Ostensibly you
> don't want to execution contexts getting in there at the same time right?
> 
> If you have a single buffer with refcnt=1, and two cpus are executing code that
> points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
> the same time, they can race and both wind up in that conditional block, leading
> to underflow of the md pointer refcnt, which is bad.


You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does
not make sense. Even with the fix you have acked.

CPU0                                   CPU1

m = a_common_mbuf;                     m = a_common_mbuf;
rte_pktmbuf_free(m) // fully atomic
m2 = rte_pktmbuf_alloc()
// m2 returned the same addr than m
// as it was in the pool
                                       // should not access m here
                                       // whatever the operation


Your example below just shows that the current code is wrong if
several cores access a mbuf with refcnt=1 at the same time. That's
true, but that's not allowed.

- If you want to give a mbuf to another core, you put it in a ring
  and stop to reference it on core 0, here not need to have refcnt

- If you want to share a mbuf with another core, you increase the
  reference counter before sending it to core 1. Then, both cores
  will have to call rte_pktmbuf_free().



Regards,
Olivier




> 
> Lets look at another more practical example.  lets imagine that that the mbuf X
> is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
> and is about to be freed using the above refcnt test model (a read followed by
> an update that gets squashed, anda refcnt set in the free block.  Basically this
> pseudo code:
> 
> if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
> 	refcnt_set(X,0)
> 	mbuf_free(X)
> }
> 
> at the same time CPU1 is preforming a lookup of our needed mbuf from the
> aforementioned set, finds it and takes a refcnt on it.
> 
> 
> CPU0						CPU1
> if(refcnt_read(X))				search for mbuf X
>      returns 1					get pointer to X
> ...						refcnt_update(X,1)
> refcnt_set(X, 0)				...
> mbuf_free(X)
> 
> 
> After the following sequence X is freed but CPU1 is left thinking that it has a
> valid reference to the mbuf.  This is broken.
> 
> As an alternate thought experiment, why use atomics here at all?  X86 is cache
> coherent right?  (ignore that new processor support, as this code predates it).
> If all cpus are able to see a consistent state of a variable, and if every
> context that has a pointer to a given mbuf also has a reference to an mbuf, then
> it should be safe to simply use an integer here rather than an atomic, right?
> If you know that you have a reference to a pointer, just decrement the refcnt
> and check for 0 instead of one, that will tell you that you are the last user of
> a buffer, right?  The answer is you can't because there are conditions in which
> you either need to make a set of conditions atomic (finding a pointer and
> increasing said refcnt under the protection of a lock), or you need some method
> to predicate the execution of some initial or finilazation event (like in
> __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
> same init/finalization and so that you don't provide small windows of
> inconsistency in your atomics, which is what you have above.
> 
> I wrote a demonstration program to illustrate (forgive me, its pretty quick and
> dirty), but I think it illustrates the point:
> 
> #define _GNU_SOURCE
> #include <stdlib.h>
> #include <stdio.h>
> #include <pthread.h>
> #include <stdatomic.h>
> 
> atomic_uint_fast64_t refcnt;
> 
> uint threads = 0;
> 
> static void * thread_exec(void *arg)
> {
>         int i;
>         int cpu = (int)(arg);
>         cpu_set_t cpuset;
>         pthread_t thread;
> 
>         thread = pthread_self();
>         CPU_ZERO(&cpuset);
>         CPU_SET(cpu, &cpuset);
>         pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
> 
>         for (i=0; i < 1000; i++) {
>                 if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
>                         atomic_fetch_sub(&refcnt, 1) == 0)) {
>                         // There should only ever be one thread in here at a
>                         atomic_init(&refcnt, 0);
>                         threads |= cpu;
>                         printf("threads = %d\n", threads);
>                         threads &= ~cpu;
> 
> 			// Need to reset the refcnt for future iterations
> 			// but that should be fine since no other thread
> 			// should be in here but us
>                         atomic_init(&refcnt, 1);
>                 }
>         }
> 
>         pthread_exit(NULL);
> }
> 
> int main(int argc, char **argv)
> {
>         pthread_attr_t attr;
>         pthread_t thread_id1, thread_id2;
>         void *status;
> 
>         atomic_init(&refcnt, 1);
> 
>         pthread_attr_init(&attr);
> 
>         pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
>         pthread_create(&thread_id2, &attr, thread_exec, (void *)2);
> 
>         pthread_attr_destroy(&attr);
> 
>         pthread_join(thread_id1, &status);
>         pthread_join(thread_id2, &status);
> 
> 
>         exit(0);
> 
> }
> 
> 
> If you run this on an smp system, you'll clearly see that, occasionally the
> value of threads is 3.  That indicates that you have points where you have
> multiple contexts executing in that conditional block that has clearly been
> coded to only expect one.  You can't make the assumption that every pointer has
> a held refcount here, you need to incur the update penalty.
> 
> Neil
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 13:10         ` Olivier MATZ
@ 2015-03-27 13:16           ` Bruce Richardson
  2015-03-27 13:22             ` Ananyev, Konstantin
  0 siblings, 1 reply; 10+ messages in thread
From: Bruce Richardson @ 2015-03-27 13:16 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

On Fri, Mar 27, 2015 at 02:10:33PM +0100, Olivier MATZ wrote:
> Hi Neil,
> 
> On 03/27/2015 01:44 PM, Neil Horman wrote:
> > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> >>> Sent: Friday, March 27, 2015 10:26 AM
> >>> To: Wiles, Keith
> >>> Cc: dev@dpdk.org
> >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> >>>
> >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> >>>>
> >>>>
> >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> >>>>
> >>>>> The current way is not the most efficient: if m->refcnt is 1, the second
> >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> >>>>> condition fails again, although the code suggest otherwise to branch
> >>>>> prediction. Instead we should keep the second condition only, and remove
> >>>>> the
> >>>>> duplicate set to zero.
> >>>>>
> >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> >>>>> ---
> >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +----
> >>>>> 1 file changed, 1 insertion(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> >>>>> index 17ba791..3ec4024 100644
> >>>>> --- a/lib/librte_mbuf/rte_mbuf.h
> >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
> >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> >>>>> {
> >>>>> 	__rte_mbuf_sanity_check(m, 0);
> >>>>>
> >>>>> -	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >>>>> -			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >>>>> -
> >>>>> -		rte_mbuf_refcnt_set(m, 0);
> >>>>> +	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >>>>>
> >>>>> 		/* if this is an indirect mbuf, then
> >>>>> 		 *  - detach mbuf
> >>>>
> >>>> I fell for this one too, but read Bruce¹s email
> >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html
> >>>
> >>> This is still the right thing to do though, Bruce's reasoning is erroneous.
> >>
> >> No, it is not. I believe Bruce comments is absolutely correct here.
> >>
> > You and bruce are wrong, I proved that below.
> > 
> >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> >>
> >> It does.
> >>
> > assertions are meaningless without evidence.
> > 
> >>> are the last user of the mbuf,
> >>> you are only guaranteed that if the update
> >>> operation returns zero.
> >>>
> >>> In other words:
> >>> rte_mbuf_refcnt_update(m, -1)
> >>>
> >>> is an atomic operation
> >>>
> >>> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >>>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >>>
> >>>
> >>> is not.
> >>>
> >>> To illustrate, on two cpus, this might occur:
> >>>
> >>> CPU0					CPU1
> >>> rte_mbuf_refcnt_read			...
> >>>    returns 1				rte_mbuf_refcnt_read
> >>> ...					   returns 1
> >>> execute if clause			execute if clause
> >>
> >>
> >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
> >> it is a bug in your code.
> > At what point in time did I indicate this was about multiple frees?  Please
> > re-read my post.
> > 
> >> Such code wouldn't work properly doesn't matter do we use:
> >>
> >>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
> >>
> >> or just: 
> >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
> >>
> >> To illustrate it with your example:
> >> Suppose m.refcnt==1
> >>
> >> CPU0 executes: 
> >>
> >> rte_pktmbuf_free(m1)
> >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> >>
> >> m2 = rte_pktmbuf_alloc(pool);
> >>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
> >>
> >> /* m2 refcnt ==1 start using m2 */
> >>
> > Really missing the point here.
> > 
> >> CPU1 executes:
> >> rte_pktmbuf_free(m1)
> >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> >>
> >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
> >>
> > Still missing the point. Please see below
> > 
> >>>
> >>> In the above scenario both cpus fell into the if clause because they both held a
> >>> pointer to the same buffer and both got a return value of one, so they skipped
> >>> the update portion of the if clause and both executed the internal block of the
> >>> conditional expression.  you might be tempted to think thats ok, since that
> >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> >>> entire purpose of that if conditional above was to ensure that only one
> >>> execution context ever executed the conditional for a given buffer.  Look at
> >>> what else happens in that conditional:
> >>>
> >>> static inline struct rte_mbuf* __attribute__((always_inline))
> >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> >>> {
> >>>         __rte_mbuf_sanity_check(m, 0);
> >>>
> >>>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> >>>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> >>>
> >>>                 rte_mbuf_refcnt_set(m, 0);
> >>>
> >>>                 /* if this is an indirect mbuf, then
> >>>                  *  - detach mbuf
> >>>                  *  - free attached mbuf segment
> >>>                  */
> >>>                 if (RTE_MBUF_INDIRECT(m)) {
> >>>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
> >>>                         rte_pktmbuf_detach(m);
> >>>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
> >>>                                 __rte_mbuf_raw_free(md);
> >>>                 }
> >>>                 return(m);
> >>>         }
> >>>         return (NULL);
> >>> }
> >>>
> >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> >>> and in the scenario I outlined above, that refcnt will underflow, likely causing
> >>> a buffer leak.  Additionally, the return code of this function is designed to
> >>> indicate to the caller if they were the last user of the buffer.  In the above
> >>> scenario, two execution contexts will be told that they were, which is wrong.
> >>>
> >>> Zoltans patch is a good fix
> >>
> >> I don't think so.
> >>
> >>
> >>>
> >>> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> >>
> >>
> >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> >>
> > 
> > Again, this has nothing to do with how many times you free an object and
> > everything to do with why you use atomics here in the first place.  The purpose
> > of the if conditional in the above code is to ensure that the contents of the
> > conditional block only get executed a single time, correct?  Ostensibly you
> > don't want to execution contexts getting in there at the same time right?
> > 
> > If you have a single buffer with refcnt=1, and two cpus are executing code that
> > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
> > the same time, they can race and both wind up in that conditional block, leading
> > to underflow of the md pointer refcnt, which is bad.
> 
> 
> You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does
> not make sense. Even with the fix you have acked.
> 
> CPU0                                   CPU1
> 
> m = a_common_mbuf;                     m = a_common_mbuf;
> rte_pktmbuf_free(m) // fully atomic
> m2 = rte_pktmbuf_alloc()
> // m2 returned the same addr than m
> // as it was in the pool
>                                        // should not access m here
>                                        // whatever the operation
> 
> 
> Your example below just shows that the current code is wrong if
> several cores access a mbuf with refcnt=1 at the same time. That's
> true, but that's not allowed.
> 
> - If you want to give a mbuf to another core, you put it in a ring
>   and stop to reference it on core 0, here not need to have refcnt
> 
> - If you want to share a mbuf with another core, you increase the
>   reference counter before sending it to core 1. Then, both cores
>   will have to call rte_pktmbuf_free().
> 
> 
> 
> Regards,
> Olivier
> 
> 

+1 

Also to note that the function this comment is being added to is a free-type
function. If you are in that function, you are freeing the mbuf, so calls
from multiple cores simultaneously is a double-free error.

/Bruce

> 
> 
> > 
> > Lets look at another more practical example.  lets imagine that that the mbuf X
> > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
> > and is about to be freed using the above refcnt test model (a read followed by
> > an update that gets squashed, anda refcnt set in the free block.  Basically this
> > pseudo code:
> > 
> > if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
> > 	refcnt_set(X,0)
> > 	mbuf_free(X)
> > }
> > 
> > at the same time CPU1 is preforming a lookup of our needed mbuf from the
> > aforementioned set, finds it and takes a refcnt on it.
> > 
> > 
> > CPU0						CPU1
> > if(refcnt_read(X))				search for mbuf X
> >      returns 1					get pointer to X
> > ...						refcnt_update(X,1)
> > refcnt_set(X, 0)				...
> > mbuf_free(X)
> > 
> > 
> > After the following sequence X is freed but CPU1 is left thinking that it has a
> > valid reference to the mbuf.  This is broken.
> > 
> > As an alternate thought experiment, why use atomics here at all?  X86 is cache
> > coherent right?  (ignore that new processor support, as this code predates it).
> > If all cpus are able to see a consistent state of a variable, and if every
> > context that has a pointer to a given mbuf also has a reference to an mbuf, then
> > it should be safe to simply use an integer here rather than an atomic, right?
> > If you know that you have a reference to a pointer, just decrement the refcnt
> > and check for 0 instead of one, that will tell you that you are the last user of
> > a buffer, right?  The answer is you can't because there are conditions in which
> > you either need to make a set of conditions atomic (finding a pointer and
> > increasing said refcnt under the protection of a lock), or you need some method
> > to predicate the execution of some initial or finilazation event (like in
> > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
> > same init/finalization and so that you don't provide small windows of
> > inconsistency in your atomics, which is what you have above.
> > 
> > I wrote a demonstration program to illustrate (forgive me, its pretty quick and
> > dirty), but I think it illustrates the point:
> > 
> > #define _GNU_SOURCE
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <pthread.h>
> > #include <stdatomic.h>
> > 
> > atomic_uint_fast64_t refcnt;
> > 
> > uint threads = 0;
> > 
> > static void * thread_exec(void *arg)
> > {
> >         int i;
> >         int cpu = (int)(arg);
> >         cpu_set_t cpuset;
> >         pthread_t thread;
> > 
> >         thread = pthread_self();
> >         CPU_ZERO(&cpuset);
> >         CPU_SET(cpu, &cpuset);
> >         pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
> > 
> >         for (i=0; i < 1000; i++) {
> >                 if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
> >                         atomic_fetch_sub(&refcnt, 1) == 0)) {
> >                         // There should only ever be one thread in here at a
> >                         atomic_init(&refcnt, 0);
> >                         threads |= cpu;
> >                         printf("threads = %d\n", threads);
> >                         threads &= ~cpu;
> > 
> > 			// Need to reset the refcnt for future iterations
> > 			// but that should be fine since no other thread
> > 			// should be in here but us
> >                         atomic_init(&refcnt, 1);
> >                 }
> >         }
> > 
> >         pthread_exit(NULL);
> > }
> > 
> > int main(int argc, char **argv)
> > {
> >         pthread_attr_t attr;
> >         pthread_t thread_id1, thread_id2;
> >         void *status;
> > 
> >         atomic_init(&refcnt, 1);
> > 
> >         pthread_attr_init(&attr);
> > 
> >         pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
> >         pthread_create(&thread_id2, &attr, thread_exec, (void *)2);
> > 
> >         pthread_attr_destroy(&attr);
> > 
> >         pthread_join(thread_id1, &status);
> >         pthread_join(thread_id2, &status);
> > 
> > 
> >         exit(0);
> > 
> > }
> > 
> > 
> > If you run this on an smp system, you'll clearly see that, occasionally the
> > value of threads is 3.  That indicates that you have points where you have
> > multiple contexts executing in that conditional block that has clearly been
> > coded to only expect one.  You can't make the assumption that every pointer has
> > a held refcount here, you need to incur the update penalty.
> > 
> > Neil
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
  2015-03-27 13:16           ` Bruce Richardson
@ 2015-03-27 13:22             ` Ananyev, Konstantin
  0 siblings, 0 replies; 10+ messages in thread
From: Ananyev, Konstantin @ 2015-03-27 13:22 UTC (permalink / raw)
  To: Richardson, Bruce, Olivier MATZ; +Cc: dev



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Friday, March 27, 2015 1:17 PM
> To: Olivier MATZ
> Cc: Neil Horman; Ananyev, Konstantin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> 
> On Fri, Mar 27, 2015 at 02:10:33PM +0100, Olivier MATZ wrote:
> > Hi Neil,
> >
> > On 03/27/2015 01:44 PM, Neil Horman wrote:
> > > On Fri, Mar 27, 2015 at 10:48:20AM +0000, Ananyev, Konstantin wrote:
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
> > >>> Sent: Friday, March 27, 2015 10:26 AM
> > >>> To: Wiles, Keith
> > >>> Cc: dev@dpdk.org
> > >>> Subject: Re: [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free
> > >>>
> > >>> On Thu, Mar 26, 2015 at 09:00:33PM +0000, Wiles, Keith wrote:
> > >>>>
> > >>>>
> > >>>> On 3/26/15, 1:10 PM, "Zoltan Kiss" <zoltan.kiss@linaro.org> wrote:
> > >>>>
> > >>>>> The current way is not the most efficient: if m->refcnt is 1, the second
> > >>>>> condition never evaluates, and we set it to 0. If refcnt > 1, the 2nd
> > >>>>> condition fails again, although the code suggest otherwise to branch
> > >>>>> prediction. Instead we should keep the second condition only, and remove
> > >>>>> the
> > >>>>> duplicate set to zero.
> > >>>>>
> > >>>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
> > >>>>> ---
> > >>>>> lib/librte_mbuf/rte_mbuf.h | 5 +----
> > >>>>> 1 file changed, 1 insertion(+), 4 deletions(-)
> > >>>>>
> > >>>>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > >>>>> index 17ba791..3ec4024 100644
> > >>>>> --- a/lib/librte_mbuf/rte_mbuf.h
> > >>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
> > >>>>> @@ -764,10 +764,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > >>>>> {
> > >>>>> 	__rte_mbuf_sanity_check(m, 0);
> > >>>>>
> > >>>>> -	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>>> -			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>>> -
> > >>>>> -		rte_mbuf_refcnt_set(m, 0);
> > >>>>> +	if (likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>>>
> > >>>>> 		/* if this is an indirect mbuf, then
> > >>>>> 		 *  - detach mbuf
> > >>>>
> > >>>> I fell for this one too, but read Bruce¹s email
> > >>>> http://dpdk.org/ml/archives/dev/2015-March/014481.html
> > >>>
> > >>> This is still the right thing to do though, Bruce's reasoning is erroneous.
> > >>
> > >> No, it is not. I believe Bruce comments is absolutely correct here.
> > >>
> > > You and bruce are wrong, I proved that below.
> > >
> > >>> Just because the return from rte_mbuf_refcnt_read returns 1, doesn't mean you
> > >>
> > >> It does.
> > >>
> > > assertions are meaningless without evidence.
> > >
> > >>> are the last user of the mbuf,
> > >>> you are only guaranteed that if the update
> > >>> operation returns zero.
> > >>>
> > >>> In other words:
> > >>> rte_mbuf_refcnt_update(m, -1)
> > >>>
> > >>> is an atomic operation
> > >>>
> > >>> if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>                     likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>
> > >>>
> > >>> is not.
> > >>>
> > >>> To illustrate, on two cpus, this might occur:
> > >>>
> > >>> CPU0					CPU1
> > >>> rte_mbuf_refcnt_read			...
> > >>>    returns 1				rte_mbuf_refcnt_read
> > >>> ...					   returns 1
> > >>> execute if clause			execute if clause
> > >>
> > >>
> > >> If you have an mbuf with refcnt==N and try to call free() for it N+1 times -
> > >> it is a bug in your code.
> > > At what point in time did I indicate this was about multiple frees?  Please
> > > re-read my post.
> > >
> > >> Such code wouldn't work properly doesn't matter do we use:
> > >>
> > >>  if (likely (rte_mbuf_refcnt_read(m) == 1) || likely (rte_mbuf_refcnt_update(m, -1) == 0))
> > >>
> > >> or just:
> > >> if (likely (rte_mbuf_refcnt_update(m, -1) == 0))
> > >>
> > >> To illustrate it with your example:
> > >> Suppose m.refcnt==1
> > >>
> > >> CPU0 executes:
> > >>
> > >> rte_pktmbuf_free(m1)
> > >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> > >>
> > >> m2 = rte_pktmbuf_alloc(pool);
> > >>       /*as m1 is 'free' alloc could return same mbuf here, i.e: m2 == m1. */
> > >>
> > >> /* m2 refcnt ==1 start using m2 */
> > >>
> > > Really missing the point here.
> > >
> > >> CPU1 executes:
> > >> rte_pktmbuf_free(m1)
> > >>         /*rte_mbuf_refcnt_update(m1, -1) returns 0, so we reset I'ts refcnt and next and put mbuf back to the pool.*/
> > >>
> > >> We just returnend to the pool mbuf that is in use and caused silent memory corruption of the mbuf's content.
> > >>
> > > Still missing the point. Please see below
> > >
> > >>>
> > >>> In the above scenario both cpus fell into the if clause because they both held a
> > >>> pointer to the same buffer and both got a return value of one, so they skipped
> > >>> the update portion of the if clause and both executed the internal block of the
> > >>> conditional expression.  you might be tempted to think thats ok, since that
> > >>> block just sets the refcnt to zero, and doing so twice isn't harmful, but the
> > >>> entire purpose of that if conditional above was to ensure that only one
> > >>> execution context ever executed the conditional for a given buffer.  Look at
> > >>> what else happens in that conditional:
> > >>>
> > >>> static inline struct rte_mbuf* __attribute__((always_inline))
> > >>> __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > >>> {
> > >>>         __rte_mbuf_sanity_check(m, 0);
> > >>>
> > >>>         if (likely (rte_mbuf_refcnt_read(m) == 1) ||
> > >>>                         likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
> > >>>
> > >>>                 rte_mbuf_refcnt_set(m, 0);
> > >>>
> > >>>                 /* if this is an indirect mbuf, then
> > >>>                  *  - detach mbuf
> > >>>                  *  - free attached mbuf segment
> > >>>                  */
> > >>>                 if (RTE_MBUF_INDIRECT(m)) {
> > >>>                         struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
> > >>>                         rte_pktmbuf_detach(m);
> > >>>                         if (rte_mbuf_refcnt_update(md, -1) == 0)
> > >>>                                 __rte_mbuf_raw_free(md);
> > >>>                 }
> > >>>                 return(m);
> > >>>         }
> > >>>         return (NULL);
> > >>> }
> > >>>
> > >>> If the buffer is indirect, another refcnt update occurs to the buf_addr mbuf,
> > >>> and in the scenario I outlined above, that refcnt will underflow, likely causing
> > >>> a buffer leak.  Additionally, the return code of this function is designed to
> > >>> indicate to the caller if they were the last user of the buffer.  In the above
> > >>> scenario, two execution contexts will be told that they were, which is wrong.
> > >>>
> > >>> Zoltans patch is a good fix
> > >>
> > >> I don't think so.
> > >>
> > >>
> > >>>
> > >>> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> > >>
> > >>
> > >> NACKed-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >>
> > >
> > > Again, this has nothing to do with how many times you free an object and
> > > everything to do with why you use atomics here in the first place.  The purpose
> > > of the if conditional in the above code is to ensure that the contents of the
> > > conditional block only get executed a single time, correct?  Ostensibly you
> > > don't want to execution contexts getting in there at the same time right?
> > >
> > > If you have a single buffer with refcnt=1, and two cpus are executing code that
> > > points to that buffer, and they both call __rte_pktmbuf_prefree_seg at around
> > > the same time, they can race and both wind up in that conditional block, leading
> > > to underflow of the md pointer refcnt, which is bad.
> >
> >
> > You cannot have a mbuf with refcnt=1 referenced by 2 cores, this does
> > not make sense. Even with the fix you have acked.
> >
> > CPU0                                   CPU1
> >
> > m = a_common_mbuf;                     m = a_common_mbuf;
> > rte_pktmbuf_free(m) // fully atomic
> > m2 = rte_pktmbuf_alloc()
> > // m2 returned the same addr than m
> > // as it was in the pool
> >                                        // should not access m here
> >                                        // whatever the operation
> >
> >
> > Your example below just shows that the current code is wrong if
> > several cores access a mbuf with refcnt=1 at the same time. That's
> > true, but that's not allowed.
> >
> > - If you want to give a mbuf to another core, you put it in a ring
> >   and stop to reference it on core 0, here not need to have refcnt
> >
> > - If you want to share a mbuf with another core, you increase the
> >   reference counter before sending it to core 1. Then, both cores
> >   will have to call rte_pktmbuf_free().
> >
> >
> >
> > Regards,
> > Olivier
> >
> >
> 
> +1
> 
> Also to note that the function this comment is being added to is a free-type
> function. If you are in that function, you are freeing the mbuf, so calls
> from multiple cores simultaneously is a double-free error.
> 
> /Bruce

Another +1 
Was going to write a big reply, but realised Olivier and Bruce already wrote what I planned too :) 
Just to repeat:
Neil, it is an invalid scenario to call free() twice for the mbuf whose refcnt==1.
Doesn't matter is that happens concurrently from different threads, or sequentially from the same thread.
You can't make it work properly with or without Zoltan patch.
If that happens - it is a bug in your code, and as was pointed by Bruce in another mail,
just voids the concept of reference counting.
As a rule of thumb: if you are going to pass an object to an entity which would free it,
and you still plan to use that object, then it is your responsibility to increment it's reference count first.

Konstantin


> 
> >
> >
> > >
> > > Lets look at another more practical example.  lets imagine that that the mbuf X
> > > is linked into a set that multiple cpus can query. X->refcnt is held by CPU0,
> > > and is about to be freed using the above refcnt test model (a read followed by
> > > an update that gets squashed, anda refcnt set in the free block.  Basically this
> > > pseudo code:
> > >
> > > if (refcnt_read(X) == 1 || refcnt_update(X) == ) {
> > > 	refcnt_set(X,0)
> > > 	mbuf_free(X)
> > > }
> > >
> > > at the same time CPU1 is preforming a lookup of our needed mbuf from the
> > > aforementioned set, finds it and takes a refcnt on it.
> > >
> > >
> > > CPU0						CPU1
> > > if(refcnt_read(X))				search for mbuf X
> > >      returns 1					get pointer to X
> > > ...						refcnt_update(X,1)
> > > refcnt_set(X, 0)				...
> > > mbuf_free(X)
> > >
> > >
> > > After the following sequence X is freed but CPU1 is left thinking that it has a
> > > valid reference to the mbuf.  This is broken.
> > >
> > > As an alternate thought experiment, why use atomics here at all?  X86 is cache
> > > coherent right?  (ignore that new processor support, as this code predates it).
> > > If all cpus are able to see a consistent state of a variable, and if every
> > > context that has a pointer to a given mbuf also has a reference to an mbuf, then
> > > it should be safe to simply use an integer here rather than an atomic, right?
> > > If you know that you have a reference to a pointer, just decrement the refcnt
> > > and check for 0 instead of one, that will tell you that you are the last user of
> > > a buffer, right?  The answer is you can't because there are conditions in which
> > > you either need to make a set of conditions atomic (finding a pointer and
> > > increasing said refcnt under the protection of a lock), or you need some method
> > > to predicate the execution of some initial or finilazation event (like in
> > > __rte_pktmbuf_prefree_seg so that you don't have multiple contexts doing that
> > > same init/finalization and so that you don't provide small windows of
> > > inconsistency in your atomics, which is what you have above.
> > >
> > > I wrote a demonstration program to illustrate (forgive me, its pretty quick and
> > > dirty), but I think it illustrates the point:
> > >
> > > #define _GNU_SOURCE
> > > #include <stdlib.h>
> > > #include <stdio.h>
> > > #include <pthread.h>
> > > #include <stdatomic.h>
> > >
> > > atomic_uint_fast64_t refcnt;
> > >
> > > uint threads = 0;
> > >
> > > static void * thread_exec(void *arg)
> > > {
> > >         int i;
> > >         int cpu = (int)(arg);
> > >         cpu_set_t cpuset;
> > >         pthread_t thread;
> > >
> > >         thread = pthread_self();
> > >         CPU_ZERO(&cpuset);
> > >         CPU_SET(cpu, &cpuset);
> > >         pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
> > >
> > >         for (i=0; i < 1000; i++) {
> > >                 if (((atomic_fetch_sub(&refcnt, 0) == 1) ||
> > >                         atomic_fetch_sub(&refcnt, 1) == 0)) {
> > >                         // There should only ever be one thread in here at a
> > >                         atomic_init(&refcnt, 0);
> > >                         threads |= cpu;
> > >                         printf("threads = %d\n", threads);
> > >                         threads &= ~cpu;
> > >
> > > 			// Need to reset the refcnt for future iterations
> > > 			// but that should be fine since no other thread
> > > 			// should be in here but us
> > >                         atomic_init(&refcnt, 1);
> > >                 }
> > >         }
> > >
> > >         pthread_exit(NULL);
> > > }
> > >
> > > int main(int argc, char **argv)
> > > {
> > >         pthread_attr_t attr;
> > >         pthread_t thread_id1, thread_id2;
> > >         void *status;
> > >
> > >         atomic_init(&refcnt, 1);
> > >
> > >         pthread_attr_init(&attr);
> > >
> > >         pthread_create(&thread_id1, &attr, thread_exec, (void *)1);
> > >         pthread_create(&thread_id2, &attr, thread_exec, (void *)2);
> > >
> > >         pthread_attr_destroy(&attr);
> > >
> > >         pthread_join(thread_id1, &status);
> > >         pthread_join(thread_id2, &status);
> > >
> > >
> > >         exit(0);
> > >
> > > }
> > >
> > >
> > > If you run this on an smp system, you'll clearly see that, occasionally the
> > > value of threads is 3.  That indicates that you have points where you have
> > > multiple contexts executing in that conditional block that has clearly been
> > > coded to only expect one.  You can't make the assumption that every pointer has
> > > a held refcount here, you need to incur the update penalty.
> > >
> > > Neil
> > >
> >
> >
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-03-27 13:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 18:10 [dpdk-dev] [PATCH] mbuf: optimize refcnt handling during free Zoltan Kiss
2015-03-26 21:00 ` Wiles, Keith
2015-03-26 21:07   ` Bruce Richardson
2015-03-27 10:25   ` Neil Horman
2015-03-27 10:48     ` Ananyev, Konstantin
2015-03-27 12:44       ` Neil Horman
2015-03-27 13:10         ` Olivier MATZ
2015-03-27 13:16           ` Bruce Richardson
2015-03-27 13:22             ` Ananyev, Konstantin
2015-03-27 10:50     ` Olivier MATZ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).