From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9B2D4A0032; Fri, 1 Oct 2021 14:37:24 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1DA054115E; Fri, 1 Oct 2021 14:37:24 +0200 (CEST) Received: from mail-il1-f171.google.com (mail-il1-f171.google.com [209.85.166.171]) by mails.dpdk.org (Postfix) with ESMTP id 1FCCD4114F for ; Fri, 1 Oct 2021 14:37:22 +0200 (CEST) Received: by mail-il1-f171.google.com with SMTP id d11so10341050ilc.8 for ; Fri, 01 Oct 2021 05:37:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=brsAWl20swRvHiG9J5BRDb+m7AhQMQjp9rfFnm2TLJk=; b=KIB11pX0zsFD6iD5WYhXMHA5bhvejBtcqFGQOjN7oldHpkhBIULMkem11Z3NoQxxKR fxYB+kfP704ATuVouCY7m04LkXylhnB3MEPI/shokMeuMiHU6TRNONJtB9dDek/3E9u0 N0VLz/+I31sR8r4zz8x+5T4oR0ojNWx/tRfr2oWhjIIvaEJO/ekGkAYXLgrFRchoUSEd 9Qvm9dhswo6vL13RDLMtbLYcON42iUi+IEcfJ4CS1e+BmWVPMRgU4v13hzbb6nT8/TKf HAvo+9EBQx3t3KZJqM8rY8aGIBHrK2F1+MWWtdms356G9XGkj55/BJ9leeKgi0K+Hizb HYVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=brsAWl20swRvHiG9J5BRDb+m7AhQMQjp9rfFnm2TLJk=; b=e89X4Jis0uMEVTEOoMHzYUYF5k/I0+rHu0TI1OfM7Wkoix6byptFAUdMQgMkeDH20c BK/AXtH9M81SmcOtBjnSOwPnI889XWpcAKCBwSTTbmrnGVvJhh1rvBEZpe/1XHOcrk9+ tU5Ld5M95ZDe2feE+CPUhnwsZ0ODfjFJLUPIfoKA3EB6GPggnK/wdOlYGKKVXxRJjU6E 0FSkLlGSWPNAMEIxu8/Z4bU4cDY7wQtF87VPrmshGeItoEB0paZwQYXCuVpD117dO6S/ QDNdV1jVw2Bu7luEt+NaaQI3slKhObCjiAU3wX6dpBj46sEdacqdS8Ihqw9QyTF6zBT7 Uo9Q== X-Gm-Message-State: AOAM532nSF+rQKkS1oMpj5iA0JzIycMSSZiwqlB0+QP111gcU2g6zaRy eb6YgX+/rGwdjKjVXkkR4IaZrh/ReIl7nTxcQL8= X-Google-Smtp-Source: ABdhPJwv//uv0gDF/oXEyJKeNcSlEhr9KlvNZfrjrQ6oQrasc74N+xNLPk2Goj2DWXGEBYOHlBWdlIh0nJyylMsjGHg= X-Received: by 2002:a05:6e02:160b:: with SMTP id t11mr5690411ilu.251.1633091841142; Fri, 01 Oct 2021 05:37:21 -0700 (PDT) MIME-Version: 1.0 References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com> From: Jerin Jacob Date: Fri, 1 Oct 2021 18:06:55 +0530 Message-ID: To: Dharmik Thakkar Cc: Olivier Matz , Andrew Rybchenko , dpdk-dev , nd , Honnappa Nagarahalli , "Ruifeng Wang (Arm Technology China)" Content-Type: text/plain; charset="UTF-8" Subject: Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Thu, Sep 30, 2021 at 10:57 PM Dharmik Thakkar wrote: > > Current mempool per core cache implementation is based on pointer > For most architectures, each pointer consumes 64b > Replace it with index-based implementation, where in each buffer > is addressed by (pool address + index) > It will reduce memory requirements > > L3Fwd performance testing reveals minor improvements in the cache > performance and no change in throughput > > Micro-benchmarking the patch using mempool_perf_test shows > significant improvement with majority of the test cases > > Future plan involves replacing global pool's pointer-based implementation with index-based implementation > > Signed-off-by: Dharmik Thakkar Sane idea. Like VPP, we tried to do this for rte_graph, but not observed much gain. Since lcore cache is typically 512, maybe there is a gain on the mempool path. Also, Since you are enabling only for local cache, it is good as mempool drivers can work as-is.(i.e HW drivers works with 64bit) I think, getting more performance numbers for various cases may be the next step. > --- > drivers/mempool/ring/rte_mempool_ring.c | 2 +- > lib/mempool/rte_mempool.c | 8 +++ > lib/mempool/rte_mempool.h | 74 ++++++++++++++++++++++--- > 3 files changed, 74 insertions(+), 10 deletions(-) > > diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c > index b1f09ff28f4d..e55913e47f21 100644 > --- a/drivers/mempool/ring/rte_mempool_ring.c > +++ b/drivers/mempool/ring/rte_mempool_ring.c > @@ -101,7 +101,7 @@ ring_alloc(struct rte_mempool *mp, uint32_t rg_flags) > return -rte_errno; > > mp->pool_data = r; > - > + mp->local_cache_base_addr = &r[1]; > return 0; > } > > diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c > index 59a588425bd6..424bdb19c323 100644 > --- a/lib/mempool/rte_mempool.c > +++ b/lib/mempool/rte_mempool.c > @@ -480,6 +480,7 @@ rte_mempool_populate_default(struct rte_mempool *mp) > int ret; > bool need_iova_contig_obj; > size_t max_alloc_size = SIZE_MAX; > + unsigned lcore_id; > > ret = mempool_ops_alloc_once(mp); > if (ret != 0) > @@ -600,6 +601,13 @@ rte_mempool_populate_default(struct rte_mempool *mp) > } > } > > + /* Init all default caches. */ > + if (mp->cache_size != 0) { > + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) > + mp->local_cache[lcore_id].local_cache_base_value = > + *(void **)mp->local_cache_base_addr; > + } > + > rte_mempool_trace_populate_default(mp); > return mp->size; > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > index 4235d6f0bf2b..545405c0d3ce 100644 > --- a/lib/mempool/rte_mempool.h > +++ b/lib/mempool/rte_mempool.h > @@ -51,6 +51,8 @@ > #include > #include > > +#include > + > #include "rte_mempool_trace_fp.h" > > #ifdef __cplusplus > @@ -91,11 +93,12 @@ struct rte_mempool_cache { > uint32_t size; /**< Size of the cache */ > uint32_t flushthresh; /**< Threshold before we flush excess elements */ > uint32_t len; /**< Current cache count */ > + void *local_cache_base_value; /**< Base value to calculate indices */ > /* > * Cache is allocated to this size to allow it to overflow in certain > * cases to avoid needless emptying of cache. > */ > - void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */ > + uint32_t objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */ > } __rte_cache_aligned; > > /** > @@ -172,7 +175,6 @@ struct rte_mempool_objtlr { > * A list of memory where objects are stored > */ > STAILQ_HEAD(rte_mempool_memhdr_list, rte_mempool_memhdr); > - > /** > * Callback used to free a memory chunk > */ > @@ -244,6 +246,7 @@ struct rte_mempool { > int32_t ops_index; > > struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */ > + void *local_cache_base_addr; /**< Reference to the base value */ > > uint32_t populated_size; /**< Number of populated objects. */ > struct rte_mempool_objhdr_list elt_list; /**< List of objects in pool */ > @@ -1269,7 +1272,15 @@ rte_mempool_cache_flush(struct rte_mempool_cache *cache, > if (cache == NULL || cache->len == 0) > return; > rte_mempool_trace_cache_flush(cache, mp); > - rte_mempool_ops_enqueue_bulk(mp, cache->objs, cache->len); > + > + unsigned int i; > + unsigned int cache_len = cache->len; > + void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; > + void *base_value = cache->local_cache_base_value; > + uint32_t *cache_objs = cache->objs; > + for (i = 0; i < cache_len; i++) > + obj_table[i] = (void *) RTE_PTR_ADD(base_value, cache_objs[i]); > + rte_mempool_ops_enqueue_bulk(mp, obj_table, cache->len); > cache->len = 0; > } > > @@ -1289,7 +1300,9 @@ static __rte_always_inline void > __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table, > unsigned int n, struct rte_mempool_cache *cache) > { > - void **cache_objs; > + uint32_t *cache_objs; > + void *base_value; > + uint32_t i; > > /* increment stat now, adding in mempool always success */ > __MEMPOOL_STAT_ADD(mp, put_bulk, 1); > @@ -1301,6 +1314,12 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table, > > cache_objs = &cache->objs[cache->len]; > > + base_value = cache->local_cache_base_value; > + > + uint64x2_t v_obj_table; > + uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value); > + uint32x2_t v_cache_objs; > + > /* > * The cache follows the following algorithm > * 1. Add the objects to the cache > @@ -1309,12 +1328,26 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table, > */ > > /* Add elements back into the cache */ > - rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); > + > +#if defined __ARM_NEON > + for (i = 0; i < (n & ~0x1); i+=2) { > + v_obj_table = vld1q_u64((const uint64_t *)&obj_table[i]); > + v_cache_objs = vqmovn_u64(vsubq_u64(v_obj_table, v_base_value)); > + vst1_u32(cache_objs + i, v_cache_objs); > + } > + if (n & 0x1) { > + cache_objs[i] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value); > + } > +#else > + for (i = 0; i < n; i++) { > + cache_objs[i] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value); > + } > +#endif > > cache->len += n; > > if (cache->len >= cache->flushthresh) { > - rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size], > + rte_mempool_ops_enqueue_bulk(mp, obj_table + cache->len - cache->size, > cache->len - cache->size); > cache->len = cache->size; > } > @@ -1415,23 +1448,26 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table, > unsigned int n, struct rte_mempool_cache *cache) > { > int ret; > + uint32_t i; > uint32_t index, len; > - void **cache_objs; > + uint32_t *cache_objs; > > /* No cache provided or cannot be satisfied from cache */ > if (unlikely(cache == NULL || n >= cache->size)) > goto ring_dequeue; > > + void *base_value = cache->local_cache_base_value; > cache_objs = cache->objs; > > /* Can this be satisfied from the cache? */ > if (cache->len < n) { > /* No. Backfill the cache first, and then fill from it */ > uint32_t req = n + (cache->size - cache->len); > + void *temp_objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */ > > /* How many do we require i.e. number to fill the cache + the request */ > ret = rte_mempool_ops_dequeue_bulk(mp, > - &cache->objs[cache->len], req); > + temp_objs, req); > if (unlikely(ret < 0)) { > /* > * In the off chance that we are buffer constrained, > @@ -1442,12 +1478,32 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table, > goto ring_dequeue; > } > > + len = cache->len; > + for (i = 0; i < req; ++i, ++len) { > + cache_objs[len] = (uint32_t) RTE_PTR_DIFF(temp_objs[i], base_value); > + } > + > cache->len += req; > } > > + uint64x2_t v_obj_table; > + uint64x2_t v_cache_objs; > + uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value); > + > /* Now fill in the response ... */ > +#if defined __ARM_NEON > + for (index = 0, len = cache->len - 1; index < (n & ~0x1); index+=2, > + len-=2, obj_table+=2) { > + v_cache_objs = vmovl_u32(vld1_u32(cache_objs + len - 1)); > + v_obj_table = vaddq_u64(v_cache_objs, v_base_value); > + vst1q_u64((uint64_t *)obj_table, v_obj_table); > + } > + if (n & 0x1) > + *obj_table = (void *) RTE_PTR_ADD(base_value, cache_objs[len]); > +#else > for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++) > - *obj_table = cache_objs[len]; > + *obj_table = (void *) RTE_PTR_ADD(base_value, cache_objs[len]); > +#endif > > cache->len -= n; > > -- > 2.17.1 >