From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zoltan.kiss@linaro.org>
Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com
 [209.85.212.179]) by dpdk.org (Postfix) with ESMTP id 9BB2CFFA
 for <dev@dpdk.org>; Tue,  7 Jul 2015 19:17:06 +0200 (CEST)
Received: by wifm2 with SMTP id m2so66097678wif.1
 for <dev@dpdk.org>; Tue, 07 Jul 2015 10:17:06 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
 :subject:references:in-reply-to:content-type
 :content-transfer-encoding;
 bh=1KTIlkv1u4gKgikgON/lwu/5fGBanWi67CKxltsohPQ=;
 b=TxHEjXA0OMAE9f8zKKDesY5bX0GWwlRlMUgK3t8LeGmE0ViozbPp4vwdZXM50RusdD
 GclsA16mTCtnx1TvxBfpOebBnS32qZXexwskTXqen1S9iKdD2GSr/ss5Fu+iDVkfgOlz
 S8zAhO/opKPTD1sM8LswZO5NCy5mKhtLI6XhZGYz4PqkYYZivLGtQ7RJ8SOWsrRAldKF
 HPmr14im7VOfC7dlLxh8iFBkHfL7Xu3Un+lzFS2W9rBcasTi213AJjCDr5cOsdk7DBiw
 GcMo64YtdBICUwgPt/5EyTL4jy3okubAuNP+1JukNqSp5yWCWcKvqxZSVDyDVT1j6pK3
 oe1w==
X-Gm-Message-State: ALoCoQl1wwzkMu2vlg0SX9uUbrsG/Tm/uDioMwqJ4ngg4TZm5a7mTk4TPKQ/YYyxPjfuJBIZS/jN
X-Received: by 10.194.109.97 with SMTP id hr1mr10228989wjb.95.1436289426429;
 Tue, 07 Jul 2015 10:17:06 -0700 (PDT)
Received: from [192.168.0.101] ([90.152.119.35])
 by mx.google.com with ESMTPSA id ef10sm34366864wjd.49.2015.07.07.10.17.05
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 07 Jul 2015 10:17:05 -0700 (PDT)
Message-ID: <559C0991.601@linaro.org>
Date: Tue, 07 Jul 2015 18:17:05 +0100
From: Zoltan Kiss <zoltan.kiss@linaro.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>, 
 "dev@dpdk.org" <dev@dpdk.org>
References: <1435258110-17140-1-git-send-email-zoltan.kiss@linaro.org>
 <1435741430-2088-1-git-send-email-zoltan.kiss@linaro.org>
 <2601191342CEEE43887BDE71AB97725836A21C32@irsmsx105.ger.corp.intel.com>
In-Reply-To: <2601191342CEEE43887BDE71AB97725836A21C32@irsmsx105.ger.corp.intel.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v2] mempool: improve cache search
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jul 2015 17:17:06 -0000


On 02/07/15 18:07, Ananyev, Konstantin wrote:
>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Wednesday, July 01, 2015 10:04 AM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] [PATCH v2] mempool: improve cache search
>>
>> The current way has a few problems:
>>
>> - if cache->len < n, we copy our elements into the cache first, then
>>    into obj_table, that's unnecessary
>> - if n >= cache_size (or the backfill fails), and we can't fulfil the
>>    request from the ring alone, we don't try to combine with the cache
>> - if refill fails, we don't return anything, even if the ring has enough
>>    for our request
>>
>> This patch rewrites it severely:
>> - at the first part of the function we only try the cache if cache->len < n
>> - otherwise take our elements straight from the ring
>> - if that fails but we have something in the cache, try to combine them
>> - the refill happens at the end, and its failure doesn't modify our return
>>    value
>>
>> Signed-off-by: Zoltan Kiss <zoltan.kiss@linaro.org>
>> ---
>> v2:
>> - fix subject
>> - add unlikely for branch where request is fulfilled both from cache and ring
>>
>>   lib/librte_mempool/rte_mempool.h | 63 +++++++++++++++++++++++++---------------
>>   1 file changed, 39 insertions(+), 24 deletions(-)
>>
>> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
>> index 6d4ce9a..1e96f03 100644
>> --- a/lib/librte_mempool/rte_mempool.h
>> +++ b/lib/librte_mempool/rte_mempool.h
>> @@ -947,34 +947,14 @@ __mempool_get_bulk(struct rte_mempool *mp, void **obj_table,
>>   	unsigned lcore_id = rte_lcore_id();
>>   	uint32_t cache_size = mp->cache_size;
>>
>> -	/* cache is not enabled or single consumer */
>> +	cache = &mp->local_cache[lcore_id];
>> +	/* cache is not enabled or single consumer or not enough */
>>   	if (unlikely(cache_size == 0 || is_mc == 0 ||
>> -		     n >= cache_size || lcore_id >= RTE_MAX_LCORE))
>> +		     cache->len < n || lcore_id >= RTE_MAX_LCORE))
>>   		goto ring_dequeue;
>>
>> -	cache = &mp->local_cache[lcore_id];
>>   	cache_objs = cache->objs;
>>
>> -	/* Can this be satisfied from the cache? */
>> -	if (cache->len < n) {
>> -		/* No. Backfill the cache first, and then fill from it */
>> -		uint32_t req = n + (cache_size - cache->len);
>> -
>> -		/* How many do we require i.e. number to fill the cache + the request */
>> -		ret = rte_ring_mc_dequeue_bulk(mp->ring, &cache->objs[cache->len], req);
>> -		if (unlikely(ret < 0)) {
>> -			/*
>> -			 * In the offchance that we are buffer constrained,
>> -			 * where we are not able to allocate cache + n, go to
>> -			 * the ring directly. If that fails, we are truly out of
>> -			 * buffers.
>> -			 */
>> -			goto ring_dequeue;
>> -		}
>> -
>> -		cache->len += req;
>> -	}
>> -
>>   	/* Now fill in the response ... */
>>   	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
>>   		*obj_table = cache_objs[len];
>> @@ -983,7 +963,8 @@ __mempool_get_bulk(struct rte_mempool *mp, void **obj_table,
>>
>>   	__MEMPOOL_STAT_ADD(mp, get_success, n);
>>
>> -	return 0;
>> +	ret = 0;
>> +	goto cache_refill;
>>
>>   ring_dequeue:
>>   #endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>> @@ -994,11 +975,45 @@ ring_dequeue:
>>   	else
>>   		ret = rte_ring_sc_dequeue_bulk(mp->ring, obj_table, n);
>>
>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>> +	if (unlikely(ret < 0 && is_mc == 1 && cache->len > 0)) {
>> +		uint32_t req = n - cache->len;
>> +
>> +		ret = rte_ring_mc_dequeue_bulk(mp->ring, obj_table, req);
>> +		if (ret == 0) {
>> +			cache_objs = cache->objs;
>> +			obj_table += req;
>> +			for (index = 0; index < cache->len;
>> +			     ++index, ++obj_table)
>> +				*obj_table = cache_objs[index];
>> +			cache->len = 0;
>> +		}
>> +	}
>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>> +
>>   	if (ret < 0)
>>   		__MEMPOOL_STAT_ADD(mp, get_fail, n);
>>   	else
>>   		__MEMPOOL_STAT_ADD(mp, get_success, n);
>>
>> +#if RTE_MEMPOOL_CACHE_MAX_SIZE > 0
>> +cache_refill:
>
> Ok, so if I get things right: if the lcore runs out of entries in cache,
> then on next __mempool_get_bulk() it has to do ring_dequeue() twice:
> 1. to satisfy user request
> 2. to refill the cache.
> Right?
Yes.

> If that so, then I think the current approach:
> ring_dequeue() once to refill the cache, then copy entries from the cache to the user
> is a cheaper(faster) one for many cases.
But then you can't return anything if the refill fails, even if there 
would be enough in the ring (or ring+cache combined). Unless you retry 
with just n.
__rte_ring_mc_do_dequeue is inlined, as far as I see the overhead of 
calling twice is:
- check the number of entries in the ring, and atomic cmpset of 
cons.head again. This can loop if an other dequeue preceded us while 
doing that subtraction, but as that's a very short interval, I think 
it's not very likely
- an extra rte_compiler_barrier()
- wait for preceding dequeues to finish, and set cons.tail to the new 
value. I think this can happen often when 'n' has a big variation, so 
the previous dequeue can be easily much bigger
- statistics update

I guess if there is no contention on the ring the extra memcpy outweighs 
these easily. And my gut feeling says that contention around the two 
while loop should not be high unless, but I don't have hard facts.
An another argument for doing two dequeue because we can do burst 
dequeue for the cache refill, which is better than only accepting the 
full amount.

How about the following?
If the cache can't satisfy the request, we do a dequeue from the ring to 
the cache for n + cache_size, but with rte_ring_mc_dequeue_burst. So it 
takes as many as it can, but doesn't fail if it can't take the whole.
Then we copy from cache to obj_table, if there is enough.
It makes sure we utilize as much as possible, with one ring dequeue.


> Especially when same pool is shared between multiple threads.
> For example when thread is doing RX only (no TX).
>
>
>> +	/* If previous dequeue was OK and we have less than n, start refill */
>> +	if (ret == 0 && cache_size > 0 && cache->len < n) {
>> +		uint32_t req = cache_size - cache->len;
>
>
> It could be that n > cache_size.
> For that case, there probably no point to refill the cache, as you took entrires from the ring
> and cache was intact.

Yes, it makes sense to add.
>
> Konstantin
>
>> +
>> +		cache_objs = cache->objs;
>> +		ret = rte_ring_mc_dequeue_bulk(mp->ring,
>> +					       &cache->objs[cache->len],
>> +					       req);
>> +		if (likely(ret == 0))
>> +			cache->len += req;
>> +		else
>> +			/* Don't spoil the return value */
>> +			ret = 0;
>> +	}
>> +#endif /* RTE_MEMPOOL_CACHE_MAX_SIZE > 0 */
>> +
>>   	return ret;
>>   }
>>
>> --
>> 1.9.1
>