From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id E9855A00C2;
	Tue, 25 Jan 2022 13:56:38 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 8D918426E8;
	Tue, 25 Jan 2022 13:56:38 +0100 (CET)
Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com
 [209.85.221.45]) by mails.dpdk.org (Postfix) with ESMTP id C768D426E4
 for <dev@dpdk.org>; Tue, 25 Jan 2022 13:56:36 +0100 (CET)
Received: by mail-wr1-f45.google.com with SMTP id r25so3041344wrc.12
 for <dev@dpdk.org>; Tue, 25 Jan 2022 04:56:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to;
 bh=1KA4kQRN/ztQwe8/2PqgxXcn+in+M8dizaId2BNjWWM=;
 b=bRfXL9TgJsZfncjvwQTDOkjhe2MzZqtJSFMCat1ley09qiEssd5NmxvjnHuohd6wZb
 jLdb6kMWGKjoXK8NnUFty/1NJB8wKpJ0HFPouCY1/hr185EYsr1BpmkNhC9+wQbpNQf4
 RYFRVy7QNaLoczpW/z2quZZwKzm31wHcQA073ebW2Fxrxm79vejk+irzZyxyxn1JZook
 neaTgHYNjTI52yaRsx44g1mhre66YxpV2yji1kbRWU+yCoSzlGu84aRcON1jvU+XB/R+
 0tsA9+8HMzar+P3CzVzcL+2HFElPAWSYNqHkFVR52X8IMdoNO4tNjyb3s+1mCQnlRYxX
 vmvQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to;
 bh=1KA4kQRN/ztQwe8/2PqgxXcn+in+M8dizaId2BNjWWM=;
 b=RAFyd4xkYlISPmLiKhSwR+pV3WLPgVWf+8PQjqb9q4rrgijq2M8BnzPP0DKfZzS0DN
 di1jBgOMIDlVKDVVeYoD8evElG9z+sYFDpGZ6Kr+glr2rgXaLV3krGEXHFNxYk5OWuFd
 euskSrI+zX7CRAj8iEucG9d9JYbVlz1CTit62PpdLPetmJg2ATy7wlhee4rXc5vWc94o
 DPM3v6w5oHOGRNwiWk2euQQJ/RnHyh452wUhcpOEdke9zKS58IfO6JXmecwXOojULnpj
 9rIj8h8Rjls8MBXgyD8hePT3U156BFOfpt20t14ampDW5l7TqZthFmZSm1OElQ8oeoh8
 COaQ==
X-Gm-Message-State: AOAM531yLEruvY7MqwfiIXzFpB4guKP0Cnzm5IUzG7exYsj61Du9M6rv
 L62fVnvZHmAi6RKQme/QtGdEow==
X-Google-Smtp-Source: ABdhPJxj91665uZaQhtUmvrcaJjfvNR2yvOygs9VaHG7nzyNA+qEdCJT9iFmQq75rwRzj1/WSq+z6A==
X-Received: by 2002:a05:6000:1091:: with SMTP id
 y17mr18143565wrw.310.1643115396424; 
 Tue, 25 Jan 2022 04:56:36 -0800 (PST)
Received: from 6wind.com ([2a01:e0a:5ac:6460:c065:401d:87eb:9b25])
 by smtp.gmail.com with ESMTPSA id p14sm324794wmq.40.2022.01.25.04.56.35
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 25 Jan 2022 04:56:35 -0800 (PST)
Date: Tue, 25 Jan 2022 13:56:35 +0100
From: Olivier Matz <olivier.matz@6wind.com>
To: Morten =?iso-8859-1?Q?Br=F8rup?= <mb@smartsharesystems.com>
Cc: dev@dpdk.org, andrew.rybchenko@oktetlabs.ru, bruce.richardson@intel.com,
 jerinjacobk@gmail.com, thomas@monjalon.net
Subject: Re: [PATCH v2] mempool: test performance with constant n
Message-ID: <Ye/zg00F9IRob1xm@platinum>
References: <20220119113732.40167-1-mb@smartsharesystems.com>
 <20220124145953.14281-1-olivier.matz@6wind.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86E29@smartserver.smartshare.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86E29@smartserver.smartshare.dk>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Hi Morten,

On Mon, Jan 24, 2022 at 06:20:49PM +0100, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Monday, 24 January 2022 16.00
> > 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > 
> > "What gets measured gets done."
> > 
> > This patch adds mempool performance tests where the number of objects
> > to
> > put and get is constant at compile time, which may significantly
> > improve
> > the performance of these functions. [*]
> > 
> > Also, it is ensured that the array holding the object used for testing
> > is cache line aligned, for maximum performance.
> > 
> > And finally, the following entries are added to the list of tests:
> > - Number of kept objects: 512
> > - Number of objects to get and to put: The number of pointers fitting
> >   into a cache line, i.e. 8 or 16
> > 
> > [*] Some example performance test (with cache) results:
> > 
> > get_bulk=4 put_bulk=4 keep=128 constant_n=false rate_persec=280480972
> > get_bulk=4 put_bulk=4 keep=128 constant_n=true  rate_persec=622159462
> > 
> > get_bulk=8 put_bulk=8 keep=128 constant_n=false rate_persec=477967155
> > get_bulk=8 put_bulk=8 keep=128 constant_n=true  rate_persec=917582643
> > 
> > get_bulk=32 put_bulk=32 keep=32 constant_n=false rate_persec=871248691
> > get_bulk=32 put_bulk=32 keep=32 constant_n=true rate_persec=1134021836
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> > ---
> > 
> > Hi Morten,
> > 
> > Here is the updated patch.
> > 
> > I launched the mempool_perf on my desktop machine, but I don't
> > reproduce the numbers: constant or
> > non-constant give almost the same rate on my machine (it's even worst
> > with constants). I tested with
> > your initial patch and with this one. Can you please try this patch,
> > and/or give some details about
> > your test environment?
> 
> Test environment:
> VMware virtual machine running Ubuntu 20.04.3 LTS.
> 4 CPUs and 8 GB RAM assigned.
> The physical CPU is a Xeon E5-2620 v4 with plenty of RAM.
> Although other VMs are running on the same server, it is not very oversubscribed.
> 
> Hugepages established with:
> usertools/dpdk-hugepages.py -p 2M --setup 2G
> 
> Build steps:
> meson -Dplatform=generic work
> cd work
> ninja
> 
> > Here is what I get:
> > 
> > with your patch:
> > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=false rate_persec=152620236
> > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=true rate_persec=144716595
> > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=false rate_persec=306996838
> > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=true rate_persec=287375359
> > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8
> > n_keep=128 constant_n=false rate_persec=977626723
> > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8
> > n_keep=128 constant_n=true rate_persec=963103944
> 
> My test results were with an experimental, optimized version of the mempool library, which showed a larger difference. (This was the reason for updating the perf test - to measure the effects of optimizing the mempool library.)
> 
> However, testing the patch (version 1) with a brand new git checkout still shows a huge difference, e.g.:
> 
> mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=false rate_persec=501009612
> mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=true rate_persec=799014912
> 
> You should also see a significant difference when testing.
> 
> My rate_persec without constant n is 3 x yours (501 M vs. 156 M ops/s), so the baseline seems wrong! I don't think our server rig is so much faster than your desktop machine. Perhaps mempool debug, telemetry or other background noise is polluting your test.

Sorry, I just realized that I was indeed using a "debugoptimzed" build.
It's much better in release mode.

mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=0 rate_persec=1425473536
mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=1 rate_persec=2159660236
mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=0 rate_persec=2796342476
mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=1 rate_persec=4351577292
mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=0 rate_persec=8589777300
mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8 n_keep=128 constant_n=1 rate_persec=13560971258

Thanks,
Olivier


> 
> > 
> > with this patch:
> > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=0 rate_persec=156460646
> > mempool_autotest cache=512 cores=1 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=1 rate_persec=142173798
> > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=0 rate_persec=312410111
> > mempool_autotest cache=512 cores=2 n_get_bulk=8 n_put_bulk=8 n_keep=128
> > constant_n=1 rate_persec=281699942
> > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8
> > n_keep=128 constant_n=0 rate_persec=983315247
> > mempool_autotest cache=512 cores=12 n_get_bulk=8 n_put_bulk=8
> > n_keep=128 constant_n=1 rate_persec=950350638
> > 
> > 
> > v2:
> > - use a flag instead of a negative value to enable tests with
> >   compile-time constant
> > - use a static inline function instead of a macro
> > - remove some "noise" (do not change variable type when not required)
> > 
> > 
> > Thanks,
> > Olivier
> > 
> > 
> >  app/test/test_mempool_perf.c | 110 ++++++++++++++++++++++++-----------
> >  1 file changed, 77 insertions(+), 33 deletions(-)
> > 
> > diff --git a/app/test/test_mempool_perf.c
> > b/app/test/test_mempool_perf.c
> > index 87ad251367..ce7c6241ab 100644
> > --- a/app/test/test_mempool_perf.c
> > +++ b/app/test/test_mempool_perf.c
> > @@ -1,5 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2010-2014 Intel Corporation
> > + * Copyright(c) 2022 SmartShare Systems
> >   */
> > 
> >  #include <string.h>
> > @@ -55,19 +56,24 @@
> >   *
> >   *      - Bulk get from 1 to 32
> >   *      - Bulk put from 1 to 32
> > + *      - Bulk get and put from 1 to 32, compile time constant
> >   *
> >   *    - Number of kept objects (*n_keep*)
> >   *
> >   *      - 32
> >   *      - 128
> > + *      - 512
> >   */
> > 
> >  #define N 65536
> >  #define TIME_S 5
> >  #define MEMPOOL_ELT_SIZE 2048
> > -#define MAX_KEEP 128
> > +#define MAX_KEEP 512
> >  #define MEMPOOL_SIZE
> > ((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE))-1)
> > 
> > +/* Number of pointers fitting into one cache line. */
> > +#define CACHE_LINE_BURST (RTE_CACHE_LINE_SIZE / sizeof(uintptr_t))
> > +
> >  #define LOG_ERR() printf("test failed at %s():%d\n", __func__,
> > __LINE__)
> >  #define RET_ERR() do {							\
> >  		LOG_ERR();						\
> > @@ -91,6 +97,9 @@ static unsigned n_put_bulk;
> >  /* number of objects retrieved from mempool before putting them back
> > */
> >  static unsigned n_keep;
> > 
> > +/* true if we want to test with constant n_get_bulk and n_put_bulk */
> > +static int use_constant_values;
> > +
> >  /* number of enqueues / dequeues */
> >  struct mempool_test_stats {
> >  	uint64_t enq_count;
> > @@ -111,11 +120,43 @@ my_obj_init(struct rte_mempool *mp, __rte_unused
> > void *arg,
> >  	*objnum = i;
> >  }
> > 
> > +static __rte_always_inline int
> > +test_loop(struct rte_mempool *mp, struct rte_mempool_cache *cache,
> > +	  unsigned int x_keep, unsigned int x_get_bulk, unsigned int
> > x_put_bulk)
> > +{
> > +	void *obj_table[MAX_KEEP] __rte_cache_aligned;
> > +	unsigned int idx;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	for (i = 0; likely(i < (N / x_keep)); i++) {
> > +		/* get x_keep objects by bulk of x_get_bulk */
> > +		for (idx = 0; idx < x_keep; idx += x_get_bulk) {
> > +			ret = rte_mempool_generic_get(mp,
> > +						      &obj_table[idx],
> > +						      x_get_bulk,
> > +						      cache);
> > +			if (unlikely(ret < 0)) {
> > +				rte_mempool_dump(stdout, mp);
> > +				return ret;
> > +			}
> > +		}
> > +
> > +		/* put the objects back by bulk of x_put_bulk */
> > +		for (idx = 0; idx < x_keep; idx += x_put_bulk) {
> > +			rte_mempool_generic_put(mp,
> > +						&obj_table[idx],
> > +						x_put_bulk,
> > +						cache);
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static int
> >  per_lcore_mempool_test(void *arg)
> >  {
> > -	void *obj_table[MAX_KEEP];
> > -	unsigned i, idx;
> >  	struct rte_mempool *mp = arg;
> >  	unsigned lcore_id = rte_lcore_id();
> >  	int ret = 0;
> > @@ -139,6 +180,9 @@ per_lcore_mempool_test(void *arg)
> >  		GOTO_ERR(ret, out);
> >  	if (((n_keep / n_put_bulk) * n_put_bulk) != n_keep)
> >  		GOTO_ERR(ret, out);
> > +	/* for constant n, n_get_bulk and n_put_bulk must be the same */
> > +	if (use_constant_values && n_put_bulk != n_get_bulk)
> > +		GOTO_ERR(ret, out);
> > 
> >  	stats[lcore_id].enq_count = 0;
> > 
> > @@ -149,31 +193,23 @@ per_lcore_mempool_test(void *arg)
> >  	start_cycles = rte_get_timer_cycles();
> > 
> >  	while (time_diff/hz < TIME_S) {
> > -		for (i = 0; likely(i < (N/n_keep)); i++) {
> > -			/* get n_keep objects by bulk of n_bulk */
> > -			idx = 0;
> > -			while (idx < n_keep) {
> > -				ret = rte_mempool_generic_get(mp,
> > -							      &obj_table[idx],
> > -							      n_get_bulk,
> > -							      cache);
> > -				if (unlikely(ret < 0)) {
> > -					rte_mempool_dump(stdout, mp);
> > -					/* in this case, objects are lost... */
> > -					GOTO_ERR(ret, out);
> > -				}
> > -				idx += n_get_bulk;
> > -			}
> > +		if (!use_constant_values)
> > +			ret = test_loop(mp, cache, n_keep, n_get_bulk,
> > n_put_bulk);
> > +		else if (n_get_bulk == 1)
> > +			ret = test_loop(mp, cache, n_keep, 1, 1);
> > +		else if (n_get_bulk == 4)
> > +			ret = test_loop(mp, cache, n_keep, 4, 4);
> > +		else if (n_get_bulk == CACHE_LINE_BURST)
> > +			ret = test_loop(mp, cache, n_keep,
> > +					CACHE_LINE_BURST, CACHE_LINE_BURST);
> > +		else if (n_get_bulk == 32)
> > +			ret = test_loop(mp, cache, n_keep, 32, 32);
> > +		else
> > +			ret = -1;
> > +
> > +		if (ret < 0)
> > +			GOTO_ERR(ret, out);
> > 
> > -			/* put the objects back */
> > -			idx = 0;
> > -			while (idx < n_keep) {
> > -				rte_mempool_generic_put(mp, &obj_table[idx],
> > -							n_put_bulk,
> > -							cache);
> > -				idx += n_put_bulk;
> > -			}
> > -		}
> >  		end_cycles = rte_get_timer_cycles();
> >  		time_diff = end_cycles - start_cycles;
> >  		stats[lcore_id].enq_count += N;
> > @@ -203,10 +239,10 @@ launch_cores(struct rte_mempool *mp, unsigned int
> > cores)
> >  	memset(stats, 0, sizeof(stats));
> > 
> >  	printf("mempool_autotest cache=%u cores=%u n_get_bulk=%u "
> > -	       "n_put_bulk=%u n_keep=%u ",
> > +	       "n_put_bulk=%u n_keep=%u constant_n=%u ",
> >  	       use_external_cache ?
> >  		   external_cache_size : (unsigned) mp->cache_size,
> > -	       cores, n_get_bulk, n_put_bulk, n_keep);
> > +	       cores, n_get_bulk, n_put_bulk, n_keep,
> > use_constant_values);
> > 
> >  	if (rte_mempool_avail_count(mp) != MEMPOOL_SIZE) {
> >  		printf("mempool is not full\n");
> > @@ -253,9 +289,9 @@ launch_cores(struct rte_mempool *mp, unsigned int
> > cores)
> >  static int
> >  do_one_mempool_test(struct rte_mempool *mp, unsigned int cores)
> >  {
> > -	unsigned bulk_tab_get[] = { 1, 4, 32, 0 };
> > -	unsigned bulk_tab_put[] = { 1, 4, 32, 0 };
> > -	unsigned keep_tab[] = { 32, 128, 0 };
> > +	unsigned int bulk_tab_get[] = { 1, 4, CACHE_LINE_BURST, 32, 0 };
> > +	unsigned int bulk_tab_put[] = { 1, 4, CACHE_LINE_BURST, 32, 0 };
> > +	unsigned int keep_tab[] = { 32, 128, 512, 0 };
> >  	unsigned *get_bulk_ptr;
> >  	unsigned *put_bulk_ptr;
> >  	unsigned *keep_ptr;
> > @@ -265,13 +301,21 @@ do_one_mempool_test(struct rte_mempool *mp,
> > unsigned int cores)
> >  		for (put_bulk_ptr = bulk_tab_put; *put_bulk_ptr;
> > put_bulk_ptr++) {
> >  			for (keep_ptr = keep_tab; *keep_ptr; keep_ptr++) {
> > 
> > +				use_constant_values = 0;
> >  				n_get_bulk = *get_bulk_ptr;
> >  				n_put_bulk = *put_bulk_ptr;
> >  				n_keep = *keep_ptr;
> >  				ret = launch_cores(mp, cores);
> > -
> >  				if (ret < 0)
> >  					return -1;
> > +
> > +				/* replay test with constant values */
> > +				if (n_get_bulk == n_put_bulk) {
> > +					use_constant_values = 1;
> > +					ret = launch_cores(mp, cores);
> > +					if (ret < 0)
> > +						return -1;
> > +				}
> >  			}
> >  		}
> >  	}
> > --
> > 2.30.2
>