DPDK patches and discussions
 help / color / mirror / Atom feed
From: Olivier MATZ <olivier.matz@6wind.com>
To: Zoltan Kiss <zoltan.kiss@linaro.org>,
	 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH v2] mempool: improve cache search
Date: Wed, 15 Jul 2015 10:56:50 +0200	[thread overview]
Message-ID: <55A62052.9080908@6wind.com> (raw)
In-Reply-To: <559C0991.601@linaro.org>

Hi,

On 07/07/2015 07:17 PM, Zoltan Kiss wrote:
>
>
> On 02/07/15 18:07, Ananyev, Konstantin wrote:
>>
>>
>>> -----Original Message-----
>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss
>>> Sent: Wednesday, July 01, 2015 10:04 AM
>>> To: dev@dpdk.org
>>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search
>>>
>>> The current way has a few problems:
>>>
>>> - if cache->len < n, we copy our elements into the cache first, then
>>>    into obj_table, that's unnecessary
>>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>>    request from the ring alone, we don't try to combine with the cache
>>> - if refill fails, we don't return anything, even if the ring has enough
>>>    for our request
>>>
>>> This patch rewrites it severely:
>>> - at the first part of the function we only try the cache if
>>> cache->len < n
>>> - otherwise take our elements straight from the ring
>>> - if that fails but we have something in the cache, try to combine them
>>> - the refill happens at the end, and its failure doesn't modify our
>>> return
>>>    value
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>>> ---
>>> v2:
>>> - fix subject
>>> - add unlikely for branch where request is fulfilled both from cache
>>> and ring
>>>
>>>   lib/librte_mempool/rte_mempool.h | 63
>>> +++++++++++++++++++++++++---------------
>>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/lib/librte_mempool/rte_mempool.h
>>> b/lib/librte_mempool/rte_mempool.h
>>> index 6d4ce9a..1e96f03 100644
>>> --- a/lib/librte_mempool/rte_mempool.h
>>> +++ b/lib/librte_mempool/rte_mempool.h
>>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>       unsigned lcore_id = rte_lcore_id();
>>>       uint32_t cache_size = mp->cache_size;
>>>
>>> -    /* cache is not enabled or single consumer */
>>> +    cache = &mp->local_cache[lcore_id];
>>> +    /* cache is not enabled or single consumer or not enough */
>>>       if (unlikely(cache_size == 0 || is_mc == 0 ||
>>> -             n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>>> +             cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>>           goto ring_dequeue;
>>>
>>> -    cache = &mp->local_cache[lcore_id];
>>>       cache_objs = cache->objs;
>>>
>>> -    /* Can this be satisfied from the cache? */
>>> -    if (cache->len < n) {
>>> -        /* No. Backfill the cache first, and then fill from it */
>>> -        uint32_t req = n + (cache_size - cache->len);
>>> -
>>> -        /* How many do we require i.e. number to fill the cache +
>>> the request */
>>> -        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> &cache->objs[cache->len], req);
>>> -        if (unlikely(ret < 0)) {
>>> -            /*
>>> -             * In the offchance that we are buffer constrained,
>>> -             * where we are not able to allocate cache + n, go to
>>> -             * the ring directly. If that fails, we are truly out of
>>> -             * buffers.
>>> -             */
>>> -            goto ring_dequeue;
>>> -        }
>>> -
>>> -        cache->len += req;
>>> -    }
>>> -
>>>       /* Now fill in the response ... */
>>>       for (index = 0, len = cache->len - 1; index < n; ++index,
>>> len--, obj_table++)
>>>           *obj_table = cache_objs[len];
>>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void
>>> **obj_table,
>>>
>>>       __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> -    return 0;
>>> +    ret = 0;
>>> +    goto cache_refill;
>>>
>>>   ring_dequeue:
>>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> @@ -994,11 +975,45 @@ ring_dequeue:
>>>       else
>>>           ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +    if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
>>> +        uint32_t req = n - cache->len;
>>> +
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>>> +        if (ret == 0) {
>>> +            cache_objs = cache->objs;
>>> +            obj_table += req;
>>> +            for (index = 0; index < cache->len;
>>> +                 ++index, ++obj_table)
>>> +                *obj_table = cache_objs[index];
>>> +            cache->len = 0;
>>> +        }
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       if (ret < 0)
>>>           __MEMPOOL_STAT_ADD(mp, get_fail, n);
>>>       else
>>>           __MEMPOOL_STAT_ADD(mp, get_success, n);
>>>
>>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>>> +cache_refill:
>>
>> Ok, so if I get things right: if the lcore runs out of entries in cache,
>> then on next __mempool_get_bulk() it has to do ring_dequeue() twice:
>> 1. to satisfy user request
>> 2. to refill the cache.
>> Right?
> Yes.
>
>> If that so, then I think the current approach:
>> ring_dequeue() once to refill the cache, then copy entries from the
>> cache to the user
>> is a cheaper(faster) one for many cases.
> But then you can't return anything if the refill fails, even if there
> would be enough in the ring (or ring+cache combined). Unless you retry
> with just n.
> __rte_ring_mc_do_dequeue is inlined, as far as I see the overhead of
> calling twice is:
> - check the number of entries in the ring, and atomic cmpset of
> cons.head again. This can loop if an other dequeue preceded us while
> doing that subtraction, but as that's a very short interval, I think
> it's not very likely
> - an extra rte_compiler_barrier()
> - wait for preceding dequeues to finish, and set cons.tail to the new
> value. I think this can happen often when 'n' has a big variation, so
> the previous dequeue can be easily much bigger
> - statistics update
>
> I guess if there is no contention on the ring the extra memcpy outweighs
> these easily. And my gut feeling says that contention around the two
> while loop should not be high unless, but I don't have hard facts.
> An another argument for doing two dequeue because we can do burst
> dequeue for the cache refill, which is better than only accepting the
> full amount.
>
> How about the following?
> If the cache can't satisfy the request, we do a dequeue from the ring to
> the cache for n + cache_size, but with rte_ring_mc_dequeue_burst. So it
> takes as many as it can, but doesn't fail if it can't take the whole.
> Then we copy from cache to obj_table, if there is enough.
> It makes sure we utilize as much as possible, with one ring dequeue.

Will it be possible to dequeue "n + cache_size"?
I think it would require to allocate some space to store the object
pointers, right? I don't feel it's a good idea to use a dynamic local
table (or alloca()) that depends on n.



>
>
>
>
>> Especially when same pool is shared between multiple threads.
>> For example when thread is doing RX only (no TX).
>>
>>
>>> +    /* If previous dequeue was OK and we have less than n, start
>>> refill */
>>> +    if (ret == 0 && cache_size > 0 && cache->len < n) {
>>> +        uint32_t req = cache_size - cache->len;
>>
>>
>> It could be that n > cache_size.
>> For that case, there probably no point to refill the cache, as you
>> took entrires from the ring
>> and cache was intact.
>
> Yes, it makes sense to add.
>>
>> Konstantin
>>
>>> +
>>> +        cache_objs = cache->objs;
>>> +        ret = rte_ring_mc_dequeue_bulk(mp->ring,
>>> +                           &cache->objs[cache->len],
>>> +                           req);
>>> +        if (likely(ret == 0))
>>> +            cache->len += req;
>>> +        else
>>> +            /* Don't spoil the return value */
>>> +            ret = 0;
>>> +    }
>>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>>> +
>>>       return ret;
>>>   }
>>>
>>> --
>>> 1.9.1
>>

  parent reply	other threads:[~2015-07-15  8:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-25 18:48 [dpdk-dev] [PATCH] mempool: improbe " Zoltan Kiss
2015-06-30 11:58 ` Olivier MATZ
2015-06-30 13:59   ` Zoltan Kiss
2015-07-01  9:03 ` [dpdk-dev] [PATCH v2] mempool: improve " Zoltan Kiss
2015-07-02 17:07   ` Ananyev, Konstantin
2015-07-07 17:17     ` Zoltan Kiss
2015-07-08  9:27       ` Bruce Richardson
2015-07-15  8:56       ` Olivier MATZ [this message]
2015-07-03 13:32   ` Olivier MATZ
2015-07-03 13:44     ` Olivier MATZ

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55A62052.9080908@6wind.com \
    --to=olivier.matz@6wind.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@intel.com \
    --cc=zoltan.kiss@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).