From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <arybchenko@solarflare.com>
Received: from dispatch1-us1.ppe-hosted.com (dispatch1-us1.ppe-hosted.com
 [148.163.129.52]) by dpdk.org (Postfix) with ESMTP id 05DFD728E
 for <dev@dpdk.org>; Wed, 17 Jan 2018 16:06:30 +0100 (CET)
X-Virus-Scanned: Proofpoint Essentials engine
Received: from webmail.solarflare.com (uk.solarflare.com [193.34.186.16])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1-us1.ppe-hosted.com (Proofpoint Essentials ESMTP Server) with ESMTPS id
 1E8899400AB; Wed, 17 Jan 2018 15:06:29 +0000 (UTC)
Received: from [192.168.38.17] (84.52.114.114) by ukex01.SolarFlarecom.com
 (10.17.10.4) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Wed, 17 Jan
 2018 15:06:24 +0000
To: Olivier MATZ <olivier.matz@6wind.com>
CC: <dev@dpdk.org>, "Artem V. Andreev" <Artem.Andreev@oktetlabs.ru>
References: <1511539591-20966-1-git-send-email-arybchenko@solarflare.com>
 <1511539591-20966-4-git-send-email-arybchenko@solarflare.com>
 <20171214133821.gn4ak6w77gqlyysn@platinum>
From: Andrew Rybchenko <arybchenko@solarflare.com>
Message-ID: <98fe3f77-ae62-6a68-2e31-212033d41383@solarflare.com>
Date: Wed, 17 Jan 2018 18:06:20 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.2
MIME-Version: 1.0
In-Reply-To: <20171214133821.gn4ak6w77gqlyysn@platinum>
Content-Language: en-GB
X-Originating-IP: [84.52.114.114]
X-ClientProxiedBy: ocex03.SolarFlarecom.com (10.20.40.36) To
 ukex01.SolarFlarecom.com (10.17.10.4)
X-TM-AS-Product-Ver: SMEX-11.0.0.1191-8.100.1062-23600.003
X-TM-AS-Result: No--20.952700-0.000000-31
X-TM-AS-User-Approved-Sender: Yes
X-TM-AS-User-Blocked-Sender: No
X-MDID: 1516201590-QIOPpndSL9UY
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] [RFC PATCH 3/6] mempool/bucket: implement bucket
	mempool manager
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Jan 2018 15:06:31 -0000

On 12/14/2017 04:38 PM, Olivier MATZ wrote:
> On Fri, Nov 24, 2017 at 04:06:28PM +0000, Andrew Rybchenko wrote:
>> From: "Artem V. Andreev" <Artem.Andreev@oktetlabs.ru>
>>
>> The manager provides a way to allocate physically and virtually
>> contiguous set of objects.
>>
>> Note: due to the way objects are organized in the bucket manager,
>> the get_avail_count may return less objects than were enqueued.
>> That breaks the expectation of mempool and mempool_perf tests.
> To me, this can be problematic. The driver should respect the
> API, or it will trigger hard-to-debug issues in applications. Can't
> this be fixed in some way or another?

As I understand there is no requirements on how fast get_count
works. If so, it is doable and we'll fix it in RFCv2.

> [...]
>
>> --- a/config/common_base
>> +++ b/config/common_base
>> @@ -608,6 +608,8 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
>>   #
>>   # Compile Mempool drivers
>>   #
>> +CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
>> +CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=32
>>   CONFIG_RTE_DRIVER_MEMPOOL_RING=y
>>   CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
>>   
> Why 32KB?
> Why not more, or less?
> Can it be a runtime parameter?
> I guess it won't work with too large objects.

We have no good understanding of how driver-specific parameters
should be passed on mempool creation. We've simply kept it for
future since it looks like separate task.
If you have ideas, please, share - we'll be thankful.

> [...]
>
>> +struct bucket_data {
>> +	unsigned int header_size;
>> +	unsigned int chunk_size;
>> +	unsigned int bucket_size;
>> +	uintptr_t bucket_page_mask;
>> +	struct rte_ring *shared_bucket_ring;
>> +	struct bucket_stack *buckets[RTE_MAX_LCORE];
>> +	/*
>> +	 * Multi-producer single-consumer ring to hold objects that are
>> +	 * returned to the mempool at a different lcore than initially
>> +	 * dequeued
>> +	 */
>> +	struct rte_ring *adoption_buffer_rings[RTE_MAX_LCORE];
>> +	struct rte_ring *shared_orphan_ring;
>> +	struct rte_mempool *pool;
>> +
>> +};
> I'm seeing per-core structures. Will it work on non-dataplane cores?
> For instance, if a control thread wants to allocate a mbuf?

May be I don't understand something. Does the control thread has
valid rte_lcore_id()?

> If possible, these fields should be more documented (or just renamed).
> For instance, I suggest chunk_size could be called obj_per_bucket, which
> better described the content of the field.

Thanks, we'll do.

> [...]
>
>> +static int
>> +bucket_enqueue_single(struct bucket_data *data, void *obj)
>> +{
>> +	int rc = 0;
>> +	uintptr_t addr = (uintptr_t)obj;
>> +	struct bucket_header *hdr;
>> +	unsigned int lcore_id = rte_lcore_id();
>> +
>> +	addr &= data->bucket_page_mask;
>> +	hdr = (struct bucket_header *)addr;
>> +
>> +	if (likely(hdr->lcore_id == lcore_id)) {
>> +		if (hdr->fill_cnt < data->bucket_size - 1) {
>> +			hdr->fill_cnt++;
>> +		} else {
>> +			hdr->fill_cnt = 0;
>> +			/* Stack is big enough to put all buckets */
>> +			bucket_stack_push(data->buckets[lcore_id], hdr);
>> +		}
>> +	} else if (hdr->lcore_id != LCORE_ID_ANY) {
>> +		struct rte_ring *adopt_ring =
>> +			data->adoption_buffer_rings[hdr->lcore_id];
>> +
>> +		rc = rte_ring_enqueue(adopt_ring, obj);
>> +		/* Ring is big enough to put all objects */
>> +		RTE_ASSERT(rc == 0);
>> +	} else if (hdr->fill_cnt < data->bucket_size - 1) {
>> +		hdr->fill_cnt++;
>> +	} else {
>> +		hdr->fill_cnt = 0;
>> +		rc = rte_ring_enqueue(data->shared_bucket_ring, hdr);
>> +		/* Ring is big enough to put all buckets */
>> +		RTE_ASSERT(rc == 0);
>> +	}
>> +
>> +	return rc;
>> +}
> [...]
>
>> +static int
>> +bucket_dequeue_buckets(struct bucket_data *data, void **obj_table,
>> +		       unsigned int n_buckets)
>> +{
>> +	struct bucket_stack *cur_stack = data->buckets[rte_lcore_id()];
>> +	unsigned int n_buckets_from_stack = RTE_MIN(n_buckets, cur_stack->top);
>> +	void **obj_table_base = obj_table;
>> +
>> +	n_buckets -= n_buckets_from_stack;
>> +	while (n_buckets_from_stack-- > 0) {
>> +		void *obj = bucket_stack_pop_unsafe(cur_stack);
>> +
>> +		obj_table = bucket_fill_obj_table(data, &obj, obj_table,
>> +						  data->bucket_size);
>> +	}
>> +	while (n_buckets-- > 0) {
>> +		struct bucket_header *hdr;
>> +
>> +		if (unlikely(rte_ring_dequeue(data->shared_bucket_ring,
>> +					      (void **)&hdr) != 0)) {
>> +			/* Return the already-dequeued buffers
>> +			 * back to the mempool
>> +			 */
>> +			bucket_enqueue(data->pool, obj_table_base,
>> +				       obj_table - obj_table_base);
>> +			rte_errno = ENOBUFS;
>> +			return -rte_errno;
>> +		}
>> +		hdr->lcore_id = rte_lcore_id();
>> +		obj_table = bucket_fill_obj_table(data, (void **)&hdr,
>> +						  obj_table, data->bucket_size);
>> +	}
>> +
>> +	return 0;
>> +}
> [...]
>
>> +static int
>> +bucket_dequeue(struct rte_mempool *mp, void **obj_table, unsigned int n)
>> +{
>> +	struct bucket_data *data = mp->pool_data;
>> +	unsigned int n_buckets = n / data->bucket_size;
>> +	unsigned int n_orphans = n - n_buckets * data->bucket_size;
>> +	int rc = 0;
>> +
>> +	bucket_adopt_orphans(data);
>> +
>> +	if (unlikely(n_orphans > 0)) {
>> +		rc = bucket_dequeue_orphans(data, obj_table +
>> +					    (n_buckets * data->bucket_size),
>> +					    n_orphans);
>> +		if (rc != 0)
>> +			return rc;
>> +	}
>> +
>> +	if (likely(n_buckets > 0)) {
>> +		rc = bucket_dequeue_buckets(data, obj_table, n_buckets);
>> +		if (unlikely(rc != 0) && n_orphans > 0) {
>> +			rte_ring_enqueue_bulk(data->shared_orphan_ring,
>> +					      obj_table + (n_buckets *
>> +							   data->bucket_size),
>> +					      n_orphans, NULL);
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
> If my understanding is correct, at initialization, all full buckets will
> go to the data->shared_bucket_ring ring, with lcore_id == ANY (this is
> done in register_mem).
>
> (note: I feel 'data' is not an ideal name for bucket_data)

Yes, agree. We'll rename it. It is really too generic.

> If the core 0 allocates all the mbufs, and then frees them all, they
> will be stored in the per-core stack, with hdr->lcoreid == 0. Is it
> right?

Right.

> If yes, can core 1 allocate a mbuf after that?

We'll add threshold for per-core stack. If it is exceeded, buckets will be
flushed into shared ring.

>> +static unsigned int
>> +bucket_get_count(const struct rte_mempool *mp)
>> +{
>> +	const struct bucket_data *data = mp->pool_data;
>> +	const struct bucket_stack *local_bucket_stack =
>> +		data->buckets[rte_lcore_id()];
>> +
>> +	return data->bucket_size * local_bucket_stack->top +
>> +		data->bucket_size * rte_ring_count(data->shared_bucket_ring) +
>> +		rte_ring_count(data->shared_orphan_ring);
>> +}
> It looks that get_count only rely on the current core stack usage
> and ignore the other core stacks.

We'll fix it to provide more accurate return value which is required
to pass self-test and make it usable for debugging.

> [...]
>
>> +static int
>> +bucket_register_memory_area(__rte_unused const struct rte_mempool *mp,
>> +			    char *vaddr, __rte_unused phys_addr_t paddr,
>> +			    size_t len)
>> +{
>> +	/* mp->pool_data may be still uninitialized at this point */
>> +	unsigned int chunk_size = mp->header_size + mp->elt_size +
>> +		mp->trailer_size;
>> +	unsigned int bucket_mem_size =
>> +		(BUCKET_MEM_SIZE / chunk_size) * chunk_size;
>> +	unsigned int bucket_page_sz = rte_align32pow2(bucket_mem_size);
>> +	uintptr_t align;
>> +	char *iter;
>> +
>> +	align = RTE_PTR_ALIGN_CEIL(vaddr, bucket_page_sz) - vaddr;
>> +
>> +	for (iter = vaddr + align; iter < vaddr + len; iter += bucket_page_sz) {
>> +		/* librte_mempool uses the header part for its own bookkeeping,
>> +		 * but the librte_mempool's object header is adjacent to the
>> +		 * data; it is small enough and the header is guaranteed to be
>> +		 * at least CACHE_LINE_SIZE (i.e. 64) bytes, so we do have
>> +		 * plenty of space at the start of the header. So the layout
>> +		 * looks like this:
>> +		 * [bucket_header] ... unused ... [rte_mempool_objhdr] [data...]
>> +		 */
> This is not always true.
> If a use creates a mempool with the NO_CACHE_ALIGN, the header will be
> small, without padding.

Thanks. I think it can be handled when bucket mempool implements own
callback to populate objects.