From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <olivier.matz@6wind.com>
Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53])
 by dpdk.org (Postfix) with ESMTP id 4408EA10
 for <dev@dpdk.org>; Wed, 15 Jul 2015 10:56:58 +0200 (CEST)
Received: by wgjx7 with SMTP id x7so27784662wgj.2
 for <dev@dpdk.org>; Wed, 15 Jul 2015 01:56:58 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
 :subject:references:in-reply-to:content-type
 :content-transfer-encoding;
 bh=+FPMjnKMuwqmMfwQABrM2IlPbBBhp+SpBdnh48nDiGQ=;
 b=VgXWo+4evY0wVdqNlMxboxb/NUyTYKn99MHJ1HNm+2X49PligFF3GXd0R1Q2jMZkL/
 /RoyUKXAde+6LnecevBEE2aCETI5zcPD21QeLq3pbh8fkXXKplwHVRWCmc5BcImqRyMs
 X5gM+5yQehsBoq1tg18nBObyHYfg92Wc+1J0ek2aNf9u1jt4dEjTvLX62RwPBJte3Y36
 yyBTCrfcGApl7+M6fgKT4UZ6sjiZjrvDJQ6sfcyixSrvIzCCWB1OHW61pWUXAKMKECee
 RSJoQT4dgxeiaDydNdsvLCPGFEcNdsGpMJQbMk2P5DkqexzazcqdBAEifh+Dyb+miy7/
 I8Kg==
X-Gm-Message-State: ALoCoQkLAexIi2qY9q47dV1MQlpf3phqV9bKpcFZdOsiHzKI5e74jXGoD5Fcxm/ozgs9GMtUwmyo
X-Received: by 10.194.47.164 with SMTP id e4mr5855189wjn.157.1436950618038;
 Wed, 15 Jul 2015 01:56:58 -0700 (PDT)
Received: from [10.16.0.195] (6wind.net2.nerim.net. [213.41.151.210])
 by smtp.gmail.com with ESMTPSA id c3sm6837797wja.3.2015.07.15.01.56.56
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 15 Jul 2015 01:56:57 -0700 (PDT)
Message-ID: <55A62052.9080908@6wind.com>
Date: Wed, 15 Jul 2015 10:56:50 +0200
From: Olivier MATZ <olivier.matz@6wind.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Icedove/31.7.0
MIME-Version: 1.0
To: Zoltan Kiss <zoltan.kiss@linaro.org>, 
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "dev@dpdk.org" <dev@dpdk.org>
References: <1435258110-17140-1-git-send-email-zoltan.kiss@linaro.org>
 <1435741430-2088-1-git-send-email-zoltan.kiss@linaro.org>
 <2601191342CEEE43887BDE71AB97725836A21C32@irsmsx105.ger.corp.intel.com>
 <559C0991.601@linaro.org>
In-Reply-To: <559C0991.601@linaro.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v2] mempool: improve cache search
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Jul 2015 08:56:58 -0000

Hi,

On 07/07/2015 07:17 PM, Zoltan Kiss wrote:
>
>
> On 02/07/15 18:07, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss
>>> Sent: Wednesday, July 01, 2015 10:04 AM
>>> To: dev@dpdk.org
>>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search
>>>
>>> The current way has a few problems:
>>>
>>> - if cache->len < n, we copy our elements into the cache first, then
>>>    into obj_table, that's unnecessary
>>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>>    request from the ring alone, we don't try to combine with the cache
>>> - if refill fails, we don't return anything, even if the ring has enough
>>>    for our request
>>>
>>> This patch rewrites it severely:
>>> - at the first part of the function we only try the cache if
>>> cache->len < n
>>> - otherwise take our elements straight from the ring
>>> - if that fails but we have something in the cache, try to combine them
>>> - the refill happens at the end, and its failure doesn't modify our
>>> return
>>>    value
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>>> ---
>>> v2:
>>> - fix subject
>>> - add unlikely for branch where request is fulfilled both from cache
>>> and ring
>>>
>>>   lib/librte_mempool/rte_mempool.h | 63
>>> +++++++++++++++++++++++++---------------
>>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.h
>>> b/lib/librte_mempool/rte_mempool.h
>>> index 6d4ce9a..1e96f03 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>       unsigned lcore_id = rte_lcore_id();
>>>       uint32_t cache_size = mp->cache_size;
>>>
>>> -    /* cache is not enabled or single consumer */
>>> +    cache = &mp->local_cache[lcore_id];
>>> +    /* cache is not enabled or single consumer or not enough */
>>>       if (unlikely(cache_size == 0 || is_mc == 0 ||
>>> -             n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>>> +             cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>>           goto ring_dequeue;
>>>
>>> -    cache = &mp->local_cache[lcore_id];
>>>       cache_objs = cache->objs;
>>>
>>> -    /* Can this be satisfied from the cache? */
>>> -    if (cache->len < n) {
>>> -        /* No. Backfill the cache first, and then fill from it */
>>> -        uint32_t req = n + (cache_size - cache->len);
>>> -
>>> -        /* How many do we require i.e. number to fill the cache +
>>> the request */
>>> -        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> &cache->objs[cache->len], req);
>>> -        if (unlikely(ret < 0)) {
>>> -            /*
>>> -             * In the offchance that we are buffer constrained,
>>> -             * where we are not able to allocate cache + n, go to
>>> -             * the ring directly. If that fails, we are truly out of
>>> -             * buffers.
>>> -             */
>>> -            goto ring_dequeue;
>>> -        }
>>> -
>>> -        cache->len += req;
>>> -    }
>>> -
>>>       /* Now fill in the response ... */
>>>       for (index = 0, len = cache->len - 1; index < n; ++index,
>>> len--, obj_table++)
>>>           *obj_table = cache_objs[len];
>>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>
>>>       __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> -    return 0;
>>> +    ret = 0;
>>> +    goto cache_refill;
>>>
>>>   ring_dequeue:
>>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> @@ -994,11 +975,45 @@ ring_dequeue:
>>>       else
>>>           ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +    if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
>>> +        uint32_t req = n - cache->len;
>>> +
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>>> +        if (ret == 0) {
>>> +            cache_objs = cache->objs;
>>> +            obj_table += req;
>>> +            for (index = 0; index < cache->len;
>>> +                 ++index, ++obj_table)
>>> +                *obj_table = cache_objs[index];
>>> +            cache->len = 0;
>>> +        }
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       if (ret < 0)
>>>           __MEMPOOL_STAT_ADD(mp, get_fail, n);
>>>       else
>>>           __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +cache_refill:
>>
>> Ok, so if I get things right: if the lcore runs out of entries in cache,
>> then on next __mempool_get_bulk() it has to do ring_dequeue() twice:
>> 1. to satisfy user request
>> 2. to refill the cache.
>> Right?
> Yes.
>
>> If that so, then I think the current approach:
>> ring_dequeue() once to refill the cache, then copy entries from the
>> cache to the user
>> is a cheaper(faster) one for many cases.
> But then you can't return anything if the refill fails, even if there
> would be enough in the ring (or ring+cache combined). Unless you retry
> with just n.
> __rte_ring_mc_do_dequeue is inlined, as far as I see the overhead of
> calling twice is:
> - check the number of entries in the ring, and atomic cmpset of
> cons.head again. This can loop if an other dequeue preceded us while
> doing that subtraction, but as that's a very short interval, I think
> it's not very likely
> - an extra rte_compiler_barrier()
> - wait for preceding dequeues to finish, and set cons.tail to the new
> value. I think this can happen often when 'n' has a big variation, so
> the previous dequeue can be easily much bigger
> - statistics update
>
> I guess if there is no contention on the ring the extra memcpy outweighs
> these easily. And my gut feeling says that contention around the two
> while loop should not be high unless, but I don't have hard facts.
> An another argument for doing two dequeue because we can do burst
> dequeue for the cache refill, which is better than only accepting the
> full amount.
>
> How about the following?
> If the cache can't satisfy the request, we do a dequeue from the ring to
> the cache for n + cache_size, but with rte_ring_mc_dequeue_burst. So it
> takes as many as it can, but doesn't fail if it can't take the whole.
> Then we copy from cache to obj_table, if there is enough.
> It makes sure we utilize as much as possible, with one ring dequeue.

Will it be possible to dequeue "n + cache_size"?
I think it would require to allocate some space to store the object
pointers, right? I don't feel it's a good idea to use a dynamic local
table (or alloca()) that depends on n.


>
>
>
>
>> Especially when same pool is shared between multiple threads.
>> For example when thread is doing RX only (no TX).
>>
>>
>>> +    /* If previous dequeue was OK and we have less than n, start
>>> refill */
>>> +    if (ret == 0 && cache_size > 0 && cache->len < n) {
>>> +        uint32_t req = cache_size - cache->len;
>>
>>
>> It could be that n > cache_size.
>> For that case, there probably no point to refill the cache, as you
>> took entrires from the ring
>> and cache was intact.
>
> Yes, it makes sense to add.
>>
>> Konstantin
>>
>>> +
>>> +        cache_objs = cache->objs;
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> +                           &cache->objs[cache->len],
>>> +                           req);
>>> +        if (likely(ret == 0))
>>> +            cache->len += req;
>>> +        else
>>> +            /* Don't spoil the return value */
>>> +            ret = 0;
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       return ret;
>>>   }
>>>
>>> --
>>> 1.9.1
>>