From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by dpdk.org (Postfix) with ESMTP id 4408EA10 for ; Wed, 15 Jul 2015 10:56:58 +0200 (CEST) Received: by wgjx7 with SMTP id x7so27784662wgj.2 for ; Wed, 15 Jul 2015 01:56:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=+FPMjnKMuwqmMfwQABrM2IlPbBBhp+SpBdnh48nDiGQ=; b=VgXWo+4evY0wVdqNlMxboxb/NUyTYKn99MHJ1HNm+2X49PligFF3GXd0R1Q2jMZkL/ /RoyUKXAde+6LnecevBEE2aCETI5zcPD21QeLq3pbh8fkXXKplwHVRWCmc5BcImqRyMs X5gM+5yQehsBoq1tg18nBObyHYfg92Wc+1J0ek2aNf9u1jt4dEjTvLX62RwPBJte3Y36 yyBTCrfcGApl7+M6fgKT4UZ6sjiZjrvDJQ6sfcyixSrvIzCCWB1OHW61pWUXAKMKECee RSJoQT4dgxeiaDydNdsvLCPGFEcNdsGpMJQbMk2P5DkqexzazcqdBAEifh+Dyb+miy7/ I8Kg== X-Gm-Message-State: ALoCoQkLAexIi2qY9q47dV1MQlpf3phqV9bKpcFZdOsiHzKI5e74jXGoD5Fcxm/ozgs9GMtUwmyo X-Received: by 10.194.47.164 with SMTP id e4mr5855189wjn.157.1436950618038; Wed, 15 Jul 2015 01:56:58 -0700 (PDT) Received: from [10.16.0.195] (6wind.net2.nerim.net. [213.41.151.210]) by smtp.gmail.com with ESMTPSA id c3sm6837797wja.3.2015.07.15.01.56.56 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 15 Jul 2015 01:56:57 -0700 (PDT) Message-ID: <55A62052.9080908@6wind.com> Date: Wed, 15 Jul 2015 10:56:50 +0200 From: Olivier MATZ User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0 MIME-Version: 1.0 To: Zoltan Kiss , "Ananyev, Konstantin" , "dev@dpdk.org" References: <1435258110-17140-1-git-send-email-zoltan.kiss@linaro.org> <1435741430-2088-1-git-send-email-zoltan.kiss@linaro.org> <2601191342CEEE43887BDE71AB97725836A21C32@irsmsx105.ger.corp.intel.com> <559C0991.601@linaro.org> In-Reply-To: <559C0991.601@linaro.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v2] mempool: improve cache search X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Jul 2015 08:56:58 -0000 Hi, On 07/07/2015 07:17 PM, Zoltan Kiss wrote: > > > On 02/07/15 18:07, Ananyev, Konstantin wrote: >> >> >>> -----Original Message----- >>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss >>> Sent: Wednesday, July 01, 2015 10:04 AM >>> To: dev@dpdk.org >>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search >>> >>> The current way has a few problems: >>> >>> - if cache->len < n, we copy our elements into the cache first, then >>> into obj_table, that's unnecessary >>> - if n >= cache_size (or the backfill fails), and we can't fulfil the >>> request from the ring alone, we don't try to combine with the cache >>> - if refill fails, we don't return anything, even if the ring has enough >>> for our request >>> >>> This patch rewrites it severely: >>> - at the first part of the function we only try the cache if >>> cache->len < n >>> - otherwise take our elements straight from the ring >>> - if that fails but we have something in the cache, try to combine them >>> - the refill happens at the end, and its failure doesn't modify our >>> return >>> value >>> >>> Signed-off-by: Zoltan Kiss >>> --- >>> v2: >>> - fix subject >>> - add unlikely for branch where request is fulfilled both from cache >>> and ring >>> >>> lib/librte_mempool/rte_mempool.h | 63 >>> +++++++++++++++++++++++++--------------- >>> 1 file changed, 39 insertions(+), 24 deletions(-) >>> >>> diff --git a/lib/librte_mempool/rte_mempool.h >>> b/lib/librte_mempool/rte_mempool.h >>> index 6d4ce9a..1e96f03 100644 >>> --- a/lib/librte_mempool/rte_mempool.h >>> +++ b/lib/librte_mempool/rte_mempool.h >>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void >>> **obj_table, >>> unsigned lcore_id = rte_lcore_id(); >>> uint32_t cache_size = mp->cache_size; >>> >>> - /* cache is not enabled or single consumer */ >>> + cache = &mp->local_cache[lcore_id]; >>> + /* cache is not enabled or single consumer or not enough */ >>> if (unlikely(cache_size == 0 || is_mc == 0 || >>> - n >= cache_size || lcore_id >= RTE_MAX_LCORE)) >>> + cache->len < n || lcore_id >= RTE_MAX_LCORE)) >>> goto ring_dequeue; >>> >>> - cache = &mp->local_cache[lcore_id]; >>> cache_objs = cache->objs; >>> >>> - /* Can this be satisfied from the cache? */ >>> - if (cache->len < n) { >>> - /* No. Backfill the cache first, and then fill from it */ >>> - uint32_t req = n + (cache_size - cache->len); >>> - >>> - /* How many do we require i.e. number to fill the cache + >>> the request */ >>> - ret = rte_ring_mc_dequeue_bulk(mp->ring, >>> &cache->objs[cache->len], req); >>> - if (unlikely(ret < 0)) { >>> - /* >>> - * In the offchance that we are buffer constrained, >>> - * where we are not able to allocate cache + n, go to >>> - * the ring directly. If that fails, we are truly out of >>> - * buffers. >>> - */ >>> - goto ring_dequeue; >>> - } >>> - >>> - cache->len += req; >>> - } >>> - >>> /* Now fill in the response ... */ >>> for (index = 0, len = cache->len - 1; index < n; ++index, >>> len--, obj_table++) >>> *obj_table = cache_objs[len]; >>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void >>> **obj_table, >>> >>> __MEMPOOL_STAT_ADD(mp, get_success, n); >>> >>> - return 0; >>> + ret = 0; >>> + goto cache_refill; >>> >>> ring_dequeue: >>> #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */ >>> @@ -994,11 +975,45 @@ ring_dequeue: >>> else >>> ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n); >>> >>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0 >>> + if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) { >>> + uint32_t req = n - cache->len; >>> + >>> + ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req); >>> + if (ret == 0) { >>> + cache_objs = cache->objs; >>> + obj_table += req; >>> + for (index = 0; index < cache->len; >>> + ++index, ++obj_table) >>> + *obj_table = cache_objs[index]; >>> + cache->len = 0; >>> + } >>> + } >>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */ >>> + >>> if (ret < 0) >>> __MEMPOOL_STAT_ADD(mp, get_fail, n); >>> else >>> __MEMPOOL_STAT_ADD(mp, get_success, n); >>> >>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0 >>> +cache_refill: >> >> Ok, so if I get things right: if the lcore runs out of entries in cache, >> then on next __mempool_get_bulk() it has to do ring_dequeue() twice: >> 1. to satisfy user request >> 2. to refill the cache. >> Right? > Yes. > >> If that so, then I think the current approach: >> ring_dequeue() once to refill the cache, then copy entries from the >> cache to the user >> is a cheaper(faster) one for many cases. > But then you can't return anything if the refill fails, even if there > would be enough in the ring (or ring+cache combined). Unless you retry > with just n. > __rte_ring_mc_do_dequeue is inlined, as far as I see the overhead of > calling twice is: > - check the number of entries in the ring, and atomic cmpset of > cons.head again. This can loop if an other dequeue preceded us while > doing that subtraction, but as that's a very short interval, I think > it's not very likely > - an extra rte_compiler_barrier() > - wait for preceding dequeues to finish, and set cons.tail to the new > value. I think this can happen often when 'n' has a big variation, so > the previous dequeue can be easily much bigger > - statistics update > > I guess if there is no contention on the ring the extra memcpy outweighs > these easily. And my gut feeling says that contention around the two > while loop should not be high unless, but I don't have hard facts. > An another argument for doing two dequeue because we can do burst > dequeue for the cache refill, which is better than only accepting the > full amount. > > How about the following? > If the cache can't satisfy the request, we do a dequeue from the ring to > the cache for n + cache_size, but with rte_ring_mc_dequeue_burst. So it > takes as many as it can, but doesn't fail if it can't take the whole. > Then we copy from cache to obj_table, if there is enough. > It makes sure we utilize as much as possible, with one ring dequeue. Will it be possible to dequeue "n + cache_size"? I think it would require to allocate some space to store the object pointers, right? I don't feel it's a good idea to use a dynamic local table (or alloca()) that depends on n. > > > > >> Especially when same pool is shared between multiple threads. >> For example when thread is doing RX only (no TX). >> >> >>> + /* If previous dequeue was OK and we have less than n, start >>> refill */ >>> + if (ret == 0 && cache_size > 0 && cache->len < n) { >>> + uint32_t req = cache_size - cache->len; >> >> >> It could be that n > cache_size. >> For that case, there probably no point to refill the cache, as you >> took entrires from the ring >> and cache was intact. > > Yes, it makes sense to add. >> >> Konstantin >> >>> + >>> + cache_objs = cache->objs; >>> + ret = rte_ring_mc_dequeue_bulk(mp->ring, >>> + &cache->objs[cache->len], >>> + req); >>> + if (likely(ret == 0)) >>> + cache->len += req; >>> + else >>> + /* Don't spoil the return value */ >>> + ret = 0; >>> + } >>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */ >>> + >>> return ret; >>> } >>> >>> -- >>> 1.9.1 >>