From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 10C7A4297C; Tue, 18 Apr 2023 13:29:54 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 8D403410EA; Tue, 18 Apr 2023 13:29:53 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id BF24F40698 for ; Tue, 18 Apr 2023 13:29:51 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH] mempool: optimize get objects with constant n Date: Tue, 18 Apr 2023 13:29:49 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8788D@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH] mempool: optimize get objects with constant n Thread-Index: Adlx5eNoMsrPQtOsSqGxW/FbimAxogAAGMbA References: <20230411064845.37713-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Tuesday, 18 April 2023 13.07 >=20 > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Br=F8rup wrote: > > When getting objects from the mempool, the number of objects to get = is > > often constant at build time. > > > > This patch adds another code path for this case, so the compiler can > > optimize more, e.g. unroll the copy loop when the entire request is > > satisfied from the cache. > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > mempool_perf_test with constant n shows an increase in rate_persec = by an > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > The code path where the number of objects to get is unknown at build = time > > remains essentially unchanged. > > > > Signed-off-by: Morten Br=F8rup >=20 > Change looks a good idea. Some suggestions inline below, which you may = want to > take on board for any future version. I'd strongly suggest adding some > extra clarifying code comments, as I suggest below. > With those exta code comments: >=20 > Acked-by: Bruce Richardson >=20 > > --- > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 9f530db24b..ade0100ec7 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct = rte_mempool *mp, > void **obj_table, > > if (unlikely(cache =3D=3D NULL)) > > goto driver_dequeue; > > > > - /* Use the cache as much as we have to return hot objects first */ > > - len =3D RTE_MIN(remaining, cache->len); > > cache_objs =3D &cache->objs[cache->len]; > > + > > + if (__extension__(__builtin_constant_p(n)) && n <=3D cache->len) { > > + /* > > + * The request size is known at build time, and > > + * the entire request can be satisfied from the cache, > > + * so let the compiler unroll the fixed length copy loop. > > + */ > > + cache->len -=3D n; > > + for (index =3D 0; index < n; index++) > > + *obj_table++ =3D *--cache_objs; > > + >=20 > This loop looks a little awkward to me. Would it be clearer (and = perhaps > easier for compilers to unroll efficiently if it was rewritten as: >=20 > cache->len -=3D n; > cache_objs =3D &cache->objs[cache->len]; > for (index =3D 0; index < n; index++) > obj_table[index] =3D cache_objs[index]; The mempool cache is a stack, so the copy loop needs get the objects in = decrementing order. I.e. the source index decrements and the destination = index increments. Regardless, your point here is still valid! I expected that any = unrolling capable compiler can unroll *dst++ =3D *--src; but I can = experiment with different compilers on godbolt.org to see if dst[index] = =3D src[-index] is better. A future version could be hand coded with vector instructions here, like = in the Ethdev drivers. >=20 > alternatively those last two lines can be replaced by a memcpy, which = the > compiler should nicely optimize itself, for constant size copy: >=20 > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); Just for reference: It was previously discussed to ignore the stack = ordering and simply copy the objects, but the idea was rejected due to = the potential performance impact of not returning the hottest objects, = i.e. the objects at the top of the stack, as the first in the returned = array. >=20 > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > + > > + return 0; > > + } > > + > > + /* Use the cache as much as we have to return hot objects first */ > > + len =3D __extension__(__builtin_constant_p(n)) ? cache->len : > > + RTE_MIN(remaining, cache->len); >=20 > This line confused me a lot, until I applied the patch and stared at = it a > lot :-). Why would the length value depend upon whether or not n is a > compile-time constant? > I think it would really help here to add in a comment stating that = when n > is a compile-time constant, at this point it much be "> cache->len" = since > the previous block was untaken, therefore we don't need to call = RTE_MIN. I agree. This is almost like perl... write-only source code. I will add a few explanatory comments. >=20 > > cache->len -=3D len; > > remaining -=3D len; > > for (index =3D 0; index < len; index++) > > *obj_table++ =3D *--cache_objs; > > > > - if (remaining =3D=3D 0) { > > + if (!__extension__(__builtin_constant_p(n)) && remaining =3D=3D 0) = { > > /* The entire request is satisfied from the cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); >=20 > Once I'd worked out the logic for the above conditional check, then = this > conditional adjustment was clear. I just wonder if a further comment = might > help here. >=20 > I am also wondering if having larger blocks for the constant and > non-constant cases may help. It would lead to some duplication but may > clarify the code. I get your point, but the code is still short enough to grasp (after = comments have been added). So I prefer to avoid code duplication. >=20 > > -- > > 2.17.1 > >