From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9B2D4A0032;
	Fri,  1 Oct 2021 14:37:24 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 1DA054115E;
	Fri,  1 Oct 2021 14:37:24 +0200 (CEST)
Received: from mail-il1-f171.google.com (mail-il1-f171.google.com
 [209.85.166.171])
 by mails.dpdk.org (Postfix) with ESMTP id 1FCCD4114F
 for <dev@dpdk.org>; Fri,  1 Oct 2021 14:37:22 +0200 (CEST)
Received: by mail-il1-f171.google.com with SMTP id d11so10341050ilc.8
 for <dev@dpdk.org>; Fri, 01 Oct 2021 05:37:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc; bh=brsAWl20swRvHiG9J5BRDb+m7AhQMQjp9rfFnm2TLJk=;
 b=KIB11pX0zsFD6iD5WYhXMHA5bhvejBtcqFGQOjN7oldHpkhBIULMkem11Z3NoQxxKR
 fxYB+kfP704ATuVouCY7m04LkXylhnB3MEPI/shokMeuMiHU6TRNONJtB9dDek/3E9u0
 N0VLz/+I31sR8r4zz8x+5T4oR0ojNWx/tRfr2oWhjIIvaEJO/ekGkAYXLgrFRchoUSEd
 9Qvm9dhswo6vL13RDLMtbLYcON42iUi+IEcfJ4CS1e+BmWVPMRgU4v13hzbb6nT8/TKf
 HAvo+9EBQx3t3KZJqM8rY8aGIBHrK2F1+MWWtdms356G9XGkj55/BJ9leeKgi0K+Hizb
 HYVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=brsAWl20swRvHiG9J5BRDb+m7AhQMQjp9rfFnm2TLJk=;
 b=e89X4Jis0uMEVTEOoMHzYUYF5k/I0+rHu0TI1OfM7Wkoix6byptFAUdMQgMkeDH20c
 BK/AXtH9M81SmcOtBjnSOwPnI889XWpcAKCBwSTTbmrnGVvJhh1rvBEZpe/1XHOcrk9+
 tU5Ld5M95ZDe2feE+CPUhnwsZ0ODfjFJLUPIfoKA3EB6GPggnK/wdOlYGKKVXxRJjU6E
 0FSkLlGSWPNAMEIxu8/Z4bU4cDY7wQtF87VPrmshGeItoEB0paZwQYXCuVpD117dO6S/
 QDNdV1jVw2Bu7luEt+NaaQI3slKhObCjiAU3wX6dpBj46sEdacqdS8Ihqw9QyTF6zBT7
 Uo9Q==
X-Gm-Message-State: AOAM532nSF+rQKkS1oMpj5iA0JzIycMSSZiwqlB0+QP111gcU2g6zaRy
 eb6YgX+/rGwdjKjVXkkR4IaZrh/ReIl7nTxcQL8=
X-Google-Smtp-Source: ABdhPJwv//uv0gDF/oXEyJKeNcSlEhr9KlvNZfrjrQ6oQrasc74N+xNLPk2Goj2DWXGEBYOHlBWdlIh0nJyylMsjGHg=
X-Received: by 2002:a05:6e02:160b:: with SMTP id
 t11mr5690411ilu.251.1633091841142; 
 Fri, 01 Oct 2021 05:37:21 -0700 (PDT)
MIME-Version: 1.0
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
From: Jerin Jacob <jerinjacobk@gmail.com>
Date: Fri, 1 Oct 2021 18:06:55 +0530
Message-ID: <CALBAE1M16qcAM77LXgOfv8sd_+t69+uc6F0mZAkQuTJbYEozeQ@mail.gmail.com>
To: Dharmik Thakkar <dharmik.thakkar@arm.com>
Cc: Olivier Matz <olivier.matz@6wind.com>, 
 Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>, dpdk-dev <dev@dpdk.org>,
 nd <nd@arm.com>, Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>, 
 "Ruifeng Wang (Arm Technology China)" <ruifeng.wang@arm.com>
Content-Type: text/plain; charset="UTF-8"
Subject: Re: [dpdk-dev] [RFC] mempool: implement index-based per core cache
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Thu, Sep 30, 2021 at 10:57 PM Dharmik Thakkar
<dharmik.thakkar@arm.com> wrote:
>
> Current mempool per core cache implementation is based on pointer
> For most architectures, each pointer consumes 64b
> Replace it with index-based implementation, where in each buffer
> is addressed by (pool address + index)
> It will reduce memory requirements
>
> L3Fwd performance testing reveals minor improvements in the cache
> performance and no change in throughput
>
> Micro-benchmarking the patch using mempool_perf_test shows
> significant improvement with majority of the test cases
>
> Future plan involves replacing global pool's pointer-based implementation with index-based implementation
>
> Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>


Sane idea. Like VPP, we tried to do this for rte_graph, but not
observed much gain.
Since lcore cache is typically 512, maybe there is a gain on the mempool path.
Also, Since you are enabling only for local cache, it is good as
mempool drivers can work as-is.(i.e HW drivers works with 64bit)
I think, getting more performance numbers for various cases may be the
next step.

> ---
>  drivers/mempool/ring/rte_mempool_ring.c |  2 +-
>  lib/mempool/rte_mempool.c               |  8 +++
>  lib/mempool/rte_mempool.h               | 74 ++++++++++++++++++++++---
>  3 files changed, 74 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
> index b1f09ff28f4d..e55913e47f21 100644
> --- a/drivers/mempool/ring/rte_mempool_ring.c
> +++ b/drivers/mempool/ring/rte_mempool_ring.c
> @@ -101,7 +101,7 @@ ring_alloc(struct rte_mempool *mp, uint32_t rg_flags)
>                 return -rte_errno;
>
>         mp->pool_data = r;
> -
> +       mp->local_cache_base_addr = &r[1];
>         return 0;
>  }
>
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 59a588425bd6..424bdb19c323 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -480,6 +480,7 @@ rte_mempool_populate_default(struct rte_mempool *mp)
>         int ret;
>         bool need_iova_contig_obj;
>         size_t max_alloc_size = SIZE_MAX;
> +       unsigned lcore_id;
>
>         ret = mempool_ops_alloc_once(mp);
>         if (ret != 0)
> @@ -600,6 +601,13 @@ rte_mempool_populate_default(struct rte_mempool *mp)
>                 }
>         }
>
> +       /* Init all default caches. */
> +       if (mp->cache_size != 0) {
> +               for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++)
> +                       mp->local_cache[lcore_id].local_cache_base_value =
> +                               *(void **)mp->local_cache_base_addr;
> +       }
> +
>         rte_mempool_trace_populate_default(mp);
>         return mp->size;
>
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 4235d6f0bf2b..545405c0d3ce 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -51,6 +51,8 @@
>  #include <rte_memcpy.h>
>  #include <rte_common.h>
>
> +#include <arm_neon.h>
> +
>  #include "rte_mempool_trace_fp.h"
>
>  #ifdef __cplusplus
> @@ -91,11 +93,12 @@ struct rte_mempool_cache {
>         uint32_t size;        /**< Size of the cache */
>         uint32_t flushthresh; /**< Threshold before we flush excess elements */
>         uint32_t len;         /**< Current cache count */
> +       void *local_cache_base_value; /**< Base value to calculate indices */
>         /*
>          * Cache is allocated to this size to allow it to overflow in certain
>          * cases to avoid needless emptying of cache.
>          */
> -       void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> +       uint32_t objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
>  } __rte_cache_aligned;
>
>  /**
> @@ -172,7 +175,6 @@ struct rte_mempool_objtlr {
>   * A list of memory where objects are stored
>   */
>  STAILQ_HEAD(rte_mempool_memhdr_list, rte_mempool_memhdr);
> -
>  /**
>   * Callback used to free a memory chunk
>   */
> @@ -244,6 +246,7 @@ struct rte_mempool {
>         int32_t ops_index;
>
>         struct rte_mempool_cache *local_cache; /**< Per-lcore local cache */
> +       void *local_cache_base_addr; /**< Reference to the base value */
>
>         uint32_t populated_size;         /**< Number of populated objects. */
>         struct rte_mempool_objhdr_list elt_list; /**< List of objects in pool */
> @@ -1269,7 +1272,15 @@ rte_mempool_cache_flush(struct rte_mempool_cache *cache,
>         if (cache == NULL || cache->len == 0)
>                 return;
>         rte_mempool_trace_cache_flush(cache, mp);
> -       rte_mempool_ops_enqueue_bulk(mp, cache->objs, cache->len);
> +
> +       unsigned int i;
> +       unsigned int cache_len = cache->len;
> +       void *obj_table[RTE_MEMPOOL_CACHE_MAX_SIZE * 3];
> +       void *base_value = cache->local_cache_base_value;
> +       uint32_t *cache_objs = cache->objs;
> +       for (i = 0; i < cache_len; i++)
> +               obj_table[i] = (void *) RTE_PTR_ADD(base_value, cache_objs[i]);
> +       rte_mempool_ops_enqueue_bulk(mp, obj_table, cache->len);
>         cache->len = 0;
>  }
>
> @@ -1289,7 +1300,9 @@ static __rte_always_inline void
>  __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
>                       unsigned int n, struct rte_mempool_cache *cache)
>  {
> -       void **cache_objs;
> +       uint32_t *cache_objs;
> +       void *base_value;
> +       uint32_t i;
>
>         /* increment stat now, adding in mempool always success */
>         __MEMPOOL_STAT_ADD(mp, put_bulk, 1);
> @@ -1301,6 +1314,12 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
>
>         cache_objs = &cache->objs[cache->len];
>
> +       base_value = cache->local_cache_base_value;
> +
> +       uint64x2_t v_obj_table;
> +       uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value);
> +       uint32x2_t v_cache_objs;
> +
>         /*
>          * The cache follows the following algorithm
>          *   1. Add the objects to the cache
> @@ -1309,12 +1328,26 @@ __mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
>          */
>
>         /* Add elements back into the cache */
> -       rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +
> +#if defined __ARM_NEON
> +       for (i = 0; i < (n & ~0x1); i+=2) {
> +               v_obj_table = vld1q_u64((const uint64_t *)&obj_table[i]);
> +               v_cache_objs = vqmovn_u64(vsubq_u64(v_obj_table, v_base_value));
> +               vst1_u32(cache_objs + i, v_cache_objs);
> +       }
> +       if (n & 0x1) {
> +               cache_objs[i] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value);
> +       }
> +#else
> +       for (i = 0; i < n; i++) {
> +               cache_objs[i] = (uint32_t) RTE_PTR_DIFF(obj_table[i], base_value);
> +       }
> +#endif
>
>         cache->len += n;
>
>         if (cache->len >= cache->flushthresh) {
> -               rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> +               rte_mempool_ops_enqueue_bulk(mp, obj_table + cache->len - cache->size,
>                                 cache->len - cache->size);
>                 cache->len = cache->size;
>         }
> @@ -1415,23 +1448,26 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
>                       unsigned int n, struct rte_mempool_cache *cache)
>  {
>         int ret;
> +       uint32_t i;
>         uint32_t index, len;
> -       void **cache_objs;
> +       uint32_t *cache_objs;
>
>         /* No cache provided or cannot be satisfied from cache */
>         if (unlikely(cache == NULL || n >= cache->size))
>                 goto ring_dequeue;
>
> +       void *base_value = cache->local_cache_base_value;
>         cache_objs = cache->objs;
>
>         /* Can this be satisfied from the cache? */
>         if (cache->len < n) {
>                 /* No. Backfill the cache first, and then fill from it */
>                 uint32_t req = n + (cache->size - cache->len);
> +               void *temp_objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
>
>                 /* How many do we require i.e. number to fill the cache + the request */
>                 ret = rte_mempool_ops_dequeue_bulk(mp,
> -                       &cache->objs[cache->len], req);
> +                       temp_objs, req);
>                 if (unlikely(ret < 0)) {
>                         /*
>                          * In the off chance that we are buffer constrained,
> @@ -1442,12 +1478,32 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
>                         goto ring_dequeue;
>                 }
>
> +               len = cache->len;
> +               for (i = 0; i < req; ++i, ++len) {
> +                       cache_objs[len] = (uint32_t) RTE_PTR_DIFF(temp_objs[i], base_value);
> +               }
> +
>                 cache->len += req;
>         }
>
> +       uint64x2_t v_obj_table;
> +       uint64x2_t v_cache_objs;
> +       uint64x2_t v_base_value = vdupq_n_u64((uint64_t)base_value);
> +
>         /* Now fill in the response ... */
> +#if defined __ARM_NEON
> +       for (index = 0, len = cache->len - 1; index < (n & ~0x1); index+=2,
> +                                               len-=2, obj_table+=2) {
> +               v_cache_objs = vmovl_u32(vld1_u32(cache_objs + len - 1));
> +               v_obj_table = vaddq_u64(v_cache_objs, v_base_value);
> +               vst1q_u64((uint64_t *)obj_table, v_obj_table);
> +       }
> +       if (n & 0x1)
> +               *obj_table = (void *) RTE_PTR_ADD(base_value, cache_objs[len]);
> +#else
>         for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -               *obj_table = cache_objs[len];
> +               *obj_table = (void *) RTE_PTR_ADD(base_value, cache_objs[len]);
> +#endif
>
>         cache->len -= n;
>
> --
> 2.17.1
>