From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 12237A04A7;
	Mon, 24 Jan 2022 11:26:04 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id A014940E5A;
	Mon, 24 Jan 2022 11:26:03 +0100 (CET)
Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com
 [209.85.128.53]) by mails.dpdk.org (Postfix) with ESMTP id 20BF640E0F
 for <dev@dpdk.org>; Mon, 24 Jan 2022 11:26:02 +0100 (CET)
Received: by mail-wm1-f53.google.com with SMTP id
 a203-20020a1c98d4000000b0034d2956eb04so954660wme.5
 for <dev@dpdk.org>; Mon, 24 Jan 2022 02:26:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=2UM8hA6TQ9gPicPgxzCYs3rhIRNIhtbF6nWLXhdsE6c=;
 b=F60EJYp9qCSWA88Q6naMb6+shJkPiViLy1PBN61JwfIRQmPs/DVeaTSGt9zsW0gVro
 bLGrtnoGoPAKQQp+n1SwiPZlbEpkYZJOrwbkULHjd1RmRq5rivPYGA5FEeWs7ph/HxcQ
 K60TG3U+sUF0X3vgE7FKZU/NktpXRfmfQteQvvMyzOBGs5FDbmxDMWauf+0MIdGyWObp
 hXrUmkPLdiOOBkQaVyNPYJhOA8ye7G4KJG2GjOLXqbC2Q3eiIYDiUiq8GygMYr1AiOUb
 jWzcw2RopCh4qt5KMre+nmQbPvhXka8Ml9n3zv2DkgaErY6R8EFz2Q0aKVurSQeYQs4J
 Ehxg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=2UM8hA6TQ9gPicPgxzCYs3rhIRNIhtbF6nWLXhdsE6c=;
 b=flSwqFBxIkRX5BoFQigZZwBlTtsev8Iha5APZ545WJM+Hdjx1xCm5tXpt0dspBVT9R
 XkMdq9cvkZVfOEScb+malwOT10BQwv5H8PHExBI3U7N0bNQtLdDwYEBuHDyDPZrdwQWU
 id4THgkgVtzOss72JhNpY2fuFCRb84LqwGeUNUpMafiGzRwOKg5IT6HHYxVUdgDwqMLD
 aqjX5hIZOchHLyNomqWqEoYZtvEruVEoegR8FK8WdlYsfIbBMXf0dOSV75CEA0elHmJA
 cIINxp97Yi/ReDRTqUXGENUjZPNvHMTg5WZq3ADxEIjNsToRHfNdoUMmoqRfNLp4989Q
 yV2g==
X-Gm-Message-State: AOAM533z1xwgCVVlb1kPBBoxUh8ZptrlQsnTEDsS3RN8YRMLbC3vH7ax
 QwnfZBX0+4080VBc3zz+ONXTIg==
X-Google-Smtp-Source: ABdhPJy9pZvIEx/J6rmAGW6J7GzbDobFr45xp0wdbEK0kJqBEm/lRA0yc3x7q+H5+EK3BsvyMaJ54Q==
X-Received: by 2002:a05:600c:4e90:: with SMTP id
 f16mr1194942wmq.87.1643019961710; 
 Mon, 24 Jan 2022 02:26:01 -0800 (PST)
Received: from 6wind.com ([2a01:e0a:5ac:6460:c065:401d:87eb:9b25])
 by smtp.gmail.com with ESMTPSA id n8sm4038551wmk.18.2022.01.24.02.26.00
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 24 Jan 2022 02:26:01 -0800 (PST)
Date: Mon, 24 Jan 2022 11:26:00 +0100
From: Olivier Matz <olivier.matz@6wind.com>
To: Morten =?iso-8859-1?Q?Br=F8rup?= <mb@smartsharesystems.com>
Cc: andrew.rybchenko@oktetlabs.ru, thomas@monjalon.net,
 bruce.richardson@intel.com, jerinjacobk@gmail.com, dev@dpdk.org
Subject: Re: [PATCH] mempool: test performance with constant n
Message-ID: <Ye5+uJDd/75qq/7R@platinum>
References: <20220119113732.40167-1-mb@smartsharesystems.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20220119113732.40167-1-mb@smartsharesystems.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Hi Morten,

Thank you for enhancing the mempool test. Please see some comments
below.

On Wed, Jan 19, 2022 at 12:37:32PM +0100, Morten Brørup wrote:
> "What gets measured gets done."
> 
> This patch adds mempool performance tests where the number of objects to
> put and get is constant at compile time, which may significantly improve
> the performance of these functions. [*]
> 
> Also, it is ensured that the array holding the object used for testing
> is cache line aligned, for maximum performance.
> 
> And finally, the following entries are added to the list of tests:
> - Number of kept objects: 512
> - Number of objects to get and to put: The number of pointers fitting
>   into a cache line, i.e. 8 or 16
> 
> [*] Some example performance test (with cache) results:
> 
> get_bulk=4 put_bulk=4 keep=128 constant_n=false rate_persec=280480972
> get_bulk=4 put_bulk=4 keep=128 constant_n=true  rate_persec=622159462
> 
> get_bulk=8 put_bulk=8 keep=128 constant_n=false rate_persec=477967155
> get_bulk=8 put_bulk=8 keep=128 constant_n=true  rate_persec=917582643
> 
> get_bulk=32 put_bulk=32 keep=32 constant_n=false rate_persec=871248691
> get_bulk=32 put_bulk=32 keep=32 constant_n=true rate_persec=1134021836
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  app/test/test_mempool_perf.c | 120 +++++++++++++++++++++--------------
>  1 file changed, 74 insertions(+), 46 deletions(-)
> 
> diff --git a/app/test/test_mempool_perf.c b/app/test/test_mempool_perf.c
> index 87ad251367..ffefe934d5 100644
> --- a/app/test/test_mempool_perf.c
> +++ b/app/test/test_mempool_perf.c
> @@ -1,5 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>   */
>  
>  #include <string.h>
> @@ -55,19 +56,24 @@
>   *
>   *      - Bulk get from 1 to 32
>   *      - Bulk put from 1 to 32
> + *      - Bulk get and put from 1 to 32, compile time constant
>   *
>   *    - Number of kept objects (*n_keep*)
>   *
>   *      - 32
>   *      - 128
> + *      - 512
>   */
>  
>  #define N 65536
>  #define TIME_S 5
>  #define MEMPOOL_ELT_SIZE 2048
> -#define MAX_KEEP 128
> +#define MAX_KEEP 512
>  #define MEMPOOL_SIZE ((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE))-1)
>  
> +/* Number of pointers fitting into one cache line. */
> +#define CACHE_LINE_BURST (RTE_CACHE_LINE_SIZE/sizeof(uintptr_t))
> +

nit: I think it's better to follow the coding rules and add a space around the
'/', even if I can see the line right above does not follow this convention

>  #define LOG_ERR() printf("test failed at %s():%d\n", __func__, __LINE__)
>  #define RET_ERR() do {							\
>  		LOG_ERR();						\
> @@ -80,16 +86,16 @@
>  	} while (0)
>  
>  static int use_external_cache;
> -static unsigned external_cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
> +static unsigned int external_cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
>  
>  static uint32_t synchro;
>  
>  /* number of objects in one bulk operation (get or put) */
> -static unsigned n_get_bulk;
> -static unsigned n_put_bulk;
> +static int n_get_bulk;
> +static int n_put_bulk;
>  
>  /* number of objects retrieved from mempool before putting them back */
> -static unsigned n_keep;
> +static int n_keep;
>  
>  /* number of enqueues / dequeues */
>  struct mempool_test_stats {
> @@ -104,20 +110,43 @@ static struct mempool_test_stats stats[RTE_MAX_LCORE];
>   */
>  static void
>  my_obj_init(struct rte_mempool *mp, __rte_unused void *arg,
> -	    void *obj, unsigned i)
> +	    void *obj, unsigned int i)
>  {
>  	uint32_t *objnum = obj;
>  	memset(obj, 0, mp->elt_size);
>  	*objnum = i;
>  }
>  
> +#define test_loop(x_keep, x_get_bulk, x_put_bulk)		       \
> +		for (i = 0; likely(i < (N/x_keep)); i++) {	      \
> +			/* get x_keep objects by bulk of x_get_bulk */  \
> +			for (idx = 0; idx < x_keep; idx += x_get_bulk) {\
> +				ret = rte_mempool_generic_get(mp,       \
> +						&obj_table[idx],	\
> +						x_get_bulk,	     \
> +						cache);		 \
> +				if (unlikely(ret < 0)) {		\
> +					rte_mempool_dump(stdout, mp);   \
> +					GOTO_ERR(ret, out);	     \
> +				}				       \
> +			}					       \
> +									\
> +			/* put the objects back by bulk of x_put_bulk */\
> +			for (idx = 0; idx < x_keep; idx += x_put_bulk) {\
> +				rte_mempool_generic_put(mp,	     \
> +						&obj_table[idx],	\
> +						x_put_bulk,	     \
> +						cache);		 \
> +			}					       \
> +		}
> +

I think a static __rte_always_inline function would do the job and is
clearer. Something like this:

static __rte_always_inline int
test_loop(struct rte_mempool *mp, struct rte_mempool_cache *cache,
	  unsigned int x_keep, unsigned int x_get_bulk, unsigned int x_put_bulk)
{
	void *obj_table[MAX_KEEP] __rte_cache_aligned;
	unsigned int idx;
	unsigned int i;
	int ret;

	for (i = 0; likely(i < (N / x_keep)); i++) {
		/* get x_keep objects by bulk of x_get_bulk */
		for (idx = 0; idx < x_keep; idx += x_get_bulk) {
			ret = rte_mempool_generic_get(mp,
						      &obj_table[idx],
						      x_get_bulk,
						      cache);
			if (unlikely(ret < 0)) {
				rte_mempool_dump(stdout, mp);
				return ret;
			}
		}

		/* put the objects back by bulk of x_put_bulk */
		for (idx = 0; idx < x_keep; idx += x_put_bulk) {
			rte_mempool_generic_put(mp,
						&obj_table[idx],
						x_put_bulk,
						cache);
		}
	}

	return 0;
}


>  static int
>  per_lcore_mempool_test(void *arg)
>  {
> -	void *obj_table[MAX_KEEP];
> -	unsigned i, idx;
> +	void *obj_table[MAX_KEEP] __rte_cache_aligned;
> +	int i, idx;
>  	struct rte_mempool *mp = arg;
> -	unsigned lcore_id = rte_lcore_id();
> +	unsigned int lcore_id = rte_lcore_id();
>  	int ret = 0;
>  	uint64_t start_cycles, end_cycles;
>  	uint64_t time_diff = 0, hz = rte_get_timer_hz();
> @@ -139,6 +168,9 @@ per_lcore_mempool_test(void *arg)
>  		GOTO_ERR(ret, out);
>  	if (((n_keep / n_put_bulk) * n_put_bulk) != n_keep)
>  		GOTO_ERR(ret, out);
> +	/* for constant n, n_get_bulk and n_put_bulk must be the same */
> +	if (n_get_bulk < 0 && n_put_bulk != n_get_bulk)
> +		GOTO_ERR(ret, out);
>  
>  	stats[lcore_id].enq_count = 0;
>  
> @@ -149,30 +181,18 @@ per_lcore_mempool_test(void *arg)
>  	start_cycles = rte_get_timer_cycles();
>  
>  	while (time_diff/hz < TIME_S) {
> -		for (i = 0; likely(i < (N/n_keep)); i++) {
> -			/* get n_keep objects by bulk of n_bulk */
> -			idx = 0;
> -			while (idx < n_keep) {
> -				ret = rte_mempool_generic_get(mp,
> -							      &obj_table[idx],
> -							      n_get_bulk,
> -							      cache);
> -				if (unlikely(ret < 0)) {
> -					rte_mempool_dump(stdout, mp);
> -					/* in this case, objects are lost... */
> -					GOTO_ERR(ret, out);
> -				}
> -				idx += n_get_bulk;
> -			}
> -
> -			/* put the objects back */
> -			idx = 0;
> -			while (idx < n_keep) {
> -				rte_mempool_generic_put(mp, &obj_table[idx],
> -							n_put_bulk,
> -							cache);
> -				idx += n_put_bulk;
> -			}
> +		if (n_get_bulk > 0) {
> +			test_loop(n_keep, n_get_bulk, n_put_bulk);
> +		} else if (n_get_bulk == -1) {
> +			test_loop(-n_keep, 1, 1);
> +		} else if (n_get_bulk == -4) {
> +			test_loop(-n_keep, 4, 4);
> +		} else if (n_get_bulk == -(int)CACHE_LINE_BURST) {
> +			test_loop(-n_keep, CACHE_LINE_BURST, CACHE_LINE_BURST);
> +		} else if (n_get_bulk == -32) {
> +			test_loop(-n_keep, 32, 32);
> +		} else {
> +			GOTO_ERR(ret, out);
>  		}
>  		end_cycles = rte_get_timer_cycles();
>  		time_diff = end_cycles - start_cycles;

I'm not convinced that having negative values to mean "constant" is
clear. I'd prefer to have another global variable "use_constant_values",
which would give something like this:

		if (!use_constant_values)
			ret = test_loop(mp, cache, n_keep, n_get_bulk, n_put_bulk);
		else if (n_get_bulk == 1)
			ret = test_loop(mp, cache, n_keep, 1, 1);
		else if (n_get_bulk == 4)
			ret = test_loop(mp, cache, n_keep, 4, 4);
		else if (n_get_bulk == CACHE_LINE_BURST)
			ret = test_loop(mp, cache, n_keep,
					CACHE_LINE_BURST, CACHE_LINE_BURST);
		else if (n_get_bulk == 32)
			ret = test_loop(mp, cache, n_keep, 32, 32);
		else
			ret = -1;

		if (ret < 0)
			GOTO_ERR(ret, out);


> @@ -192,10 +212,10 @@ per_lcore_mempool_test(void *arg)
>  static int
>  launch_cores(struct rte_mempool *mp, unsigned int cores)
>  {
> -	unsigned lcore_id;
> +	unsigned int lcore_id;
>  	uint64_t rate;
>  	int ret;
> -	unsigned cores_save = cores;
> +	unsigned int cores_save = cores;
>  
>  	__atomic_store_n(&synchro, 0, __ATOMIC_RELAXED);
>  
> @@ -203,10 +223,14 @@ launch_cores(struct rte_mempool *mp, unsigned int cores)
>  	memset(stats, 0, sizeof(stats));
>  
>  	printf("mempool_autotest cache=%u cores=%u n_get_bulk=%u "
> -	       "n_put_bulk=%u n_keep=%u ",
> +	       "n_put_bulk=%u n_keep=%u constant_n=%s ",
>  	       use_external_cache ?
> -		   external_cache_size : (unsigned) mp->cache_size,
> -	       cores, n_get_bulk, n_put_bulk, n_keep);
> +		   external_cache_size : (unsigned int) mp->cache_size,
> +	       cores,
> +	       n_get_bulk > 0 ? n_get_bulk : -n_get_bulk,
> +	       n_put_bulk > 0 ? n_put_bulk : -n_put_bulk,
> +	       n_keep > 0 ? n_keep : -n_keep,
> +	       n_get_bulk > 0 ? "false" : "true");
>  

This would become much simpler with this new variable.

>  	if (rte_mempool_avail_count(mp) != MEMPOOL_SIZE) {
>  		printf("mempool is not full\n");
> @@ -253,12 +277,13 @@ launch_cores(struct rte_mempool *mp, unsigned int cores)
>  static int
>  do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
>  {
> -	unsigned bulk_tab_get[] = { 1, 4, 32, 0 };
> -	unsigned bulk_tab_put[] = { 1, 4, 32, 0 };
> -	unsigned keep_tab[] = { 32, 128, 0 };
> -	unsigned *get_bulk_ptr;
> -	unsigned *put_bulk_ptr;
> -	unsigned *keep_ptr;
> +	/* Negative n_get_bulk values represent constants in the test. */
> +	int bulk_tab_get[] = { 1, 4, CACHE_LINE_BURST, 32, -1, -4, -(int)CACHE_LINE_BURST, -32, 0 };
> +	int bulk_tab_put[] = { 1, 4, CACHE_LINE_BURST, 32, 0 };
> +	int keep_tab[] = { 32, 128, 512, 0 };
> +	int *get_bulk_ptr;
> +	int *put_bulk_ptr;
> +	int *keep_ptr;
>  	int ret;
>  

Same here, changes would be minimal.

>  	for (get_bulk_ptr = bulk_tab_get; *get_bulk_ptr; get_bulk_ptr++) {
> @@ -266,13 +291,16 @@ do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
>  			for (keep_ptr = keep_tab; *keep_ptr; keep_ptr++) {
>  
>  				n_get_bulk = *get_bulk_ptr;
> -				n_put_bulk = *put_bulk_ptr;
> -				n_keep = *keep_ptr;
> +				n_put_bulk = n_get_bulk > 0 ? *put_bulk_ptr : n_get_bulk;
> +				n_keep = n_get_bulk > 0 ? *keep_ptr : -*keep_ptr;
>  				ret = launch_cores(mp, cores);
>  
>  				if (ret < 0)
>  					return -1;

No change would be required above (except use_constant_values = 0).
And below, we could simply add instead:

				/* replay test with constant values */
				if (n_get_bulk == n_put_bulk) {
					use_constant_values = 1;
					ret = launch_cores(mp, cores);
					if (ret < 0)
						return -1;
				}


If you are ok with the proposed changes, I can directly submit a v2,
since I already made them locally.

Thanks,
Olivier