From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 10C7A4297C;
	Tue, 18 Apr 2023 13:29:54 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 8D403410EA;
	Tue, 18 Apr 2023 13:29:53 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id BF24F40698
 for <dev@dpdk.org>; Tue, 18 Apr 2023 13:29:51 +0200 (CEST)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH] mempool: optimize get objects with constant n
Date: Tue, 18 Apr 2023 13:29:49 +0200
X-MimeOLE: Produced By Microsoft Exchange V6.5
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8788D@smartserver.smartshare.dk>
In-Reply-To: <ZD55wqJLvwt1Bq7r@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH] mempool: optimize get objects with constant n
Thread-Index: Adlx5eNoMsrPQtOsSqGxW/FbimAxogAAGMbA
References: <20230411064845.37713-1-mb@smartsharesystems.com>
 <ZD55wqJLvwt1Bq7r@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <olivier.matz@6wind.com>, <andrew.rybchenko@oktetlabs.ru>, <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 18 April 2023 13.07
>=20
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Br=F8rup wrote:
> > When getting objects from the mempool, the number of objects to get =
is
> > often constant at build time.
> >
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> >
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec =
by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> >
> > The code path where the number of objects to get is unknown at build =
time
> > remains essentially unchanged.
> >
> > Signed-off-by: Morten Br=F8rup <mb@smartsharesystems.com>
>=20
> Change looks a good idea. Some suggestions inline below, which you may =
want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
>=20
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>=20
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct =
rte_mempool *mp,
> void **obj_table,
> >  	if (unlikely(cache =3D=3D NULL))
> >  		goto driver_dequeue;
> >
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len =3D RTE_MIN(remaining, cache->len);
> >  	cache_objs =3D &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <=3D cache->len) {
> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -=3D n;
> > +		for (index =3D 0; index < n; index++)
> > +			*obj_table++ =3D *--cache_objs;
> > +
>=20
> This loop looks a little awkward to me. Would it be clearer (and =
perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
>=20
> 	cache->len -=3D n;
> 	cache_objs =3D &cache->objs[cache->len];
> 	for (index =3D 0; index < n; index++)
> 		obj_table[index] =3D cache_objs[index];

The mempool cache is a stack, so the copy loop needs get the objects in =
decrementing order. I.e. the source index decrements and the destination =
index increments.

Regardless, your point here is still valid! I expected that any =
unrolling capable compiler can unroll *dst++ =3D *--src; but I can =
experiment with different compilers on godbolt.org to see if dst[index] =
=3D src[-index] is better.

A future version could be hand coded with vector instructions here, like =
in the Ethdev drivers.

>=20
> alternatively those last two lines can be replaced by a memcpy, which =
the
> compiler should nicely optimize itself, for constant size copy:
>=20
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

Just for reference: It was previously discussed to ignore the stack =
ordering and simply copy the objects, but the idea was rejected due to =
the potential performance impact of not returning the hottest objects, =
i.e. the objects at the top of the stack, as the first in the returned =
array.

>=20
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len =3D __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
>=20
> This line confused me a lot, until I applied the patch and stared at =
it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that =
when n
> is a compile-time constant, at this point it much be "> cache->len" =
since
> the previous block was untaken, therefore we don't need to call =
RTE_MIN.

I agree. This is almost like perl... write-only source code.

I will add a few explanatory comments.

>=20
> >  	cache->len -=3D len;
> >  	remaining -=3D len;
> >  	for (index =3D 0; index < len; index++)
> >  		*obj_table++ =3D *--cache_objs;
> >
> > -	if (remaining =3D=3D 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining =3D=3D 0) =
{
> >  		/* The entire request is satisfied from the cache. */
> >
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
>=20
> Once I'd worked out the logic for the above conditional check, then =
this
> conditional adjustment was clear. I just wonder if a further comment =
might
> help here.
>=20
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.

I get your point, but the code is still short enough to grasp (after =
comments have been added). So I prefer to avoid code duplication.

>=20
> > --
> > 2.17.1
> >