From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D38904297D; Tue, 18 Apr 2023 17:15:10 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id AACF2410EA; Tue, 18 Apr 2023 17:15:10 +0200 (CEST) Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by mails.dpdk.org (Postfix) with ESMTP id B42AF4014F for ; Tue, 18 Apr 2023 17:15:09 +0200 (CEST) Received: by linux.microsoft.com (Postfix, from userid 1086) id 0AE7F21C202C; Tue, 18 Apr 2023 08:15:09 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 0AE7F21C202C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1681830909; bh=FMZPfM0u6mDD3LBQSscHbKvUlXgasmL1LWrD2KLbfjQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Ewf01PDwyuhsRi6nhn+/pM9KmOJ1y5TZcP7PMCd7M898gme8kzC/GJmSMZ2M45IHI f4nNrlW7wqykX0MJwgZ0CBFSZlRLMyQXr0CytwAJlhz+7jKpRJ9ThhKO1NHRTeOfdM qSLp832NRZhBHz0dDJ0i09DaCDtvI04ZSJpIqscM= Date: Tue, 18 Apr 2023 08:15:09 -0700 From: Tyler Retzlaff To: Bruce Richardson Cc: Morten =?iso-8859-1?Q?Br=F8rup?= , olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, dev@dpdk.org Subject: Re: [PATCH] mempool: optimize get objects with constant n Message-ID: <20230418151509.GB32568@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net> References: <20230411064845.37713-1-mb@smartsharesystems.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > When getting objects from the mempool, the number of objects to get is > > often constant at build time. > > > > This patch adds another code path for this case, so the compiler can > > optimize more, e.g. unroll the copy loop when the entire request is > > satisfied from the cache. > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > mempool_perf_test with constant n shows an increase in rate_persec by an > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > The code path where the number of objects to get is unknown at build time > > remains essentially unchanged. > > > > Signed-off-by: Morten Brørup > > Change looks a good idea. Some suggestions inline below, which you may want to > take on board for any future version. I'd strongly suggest adding some > extra clarifying code comments, as I suggest below. > With those exta code comments: > > Acked-by: Bruce Richardson > > > --- > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 9f530db24b..ade0100ec7 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, > > if (unlikely(cache == NULL)) > > goto driver_dequeue; > > > > - /* Use the cache as much as we have to return hot objects first */ > > - len = RTE_MIN(remaining, cache->len); > > cache_objs = &cache->objs[cache->len]; > > + > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { don't take direct dependency on compiler builtins. define a macro so we don't have to play shotgun surgery later. also what is the purpose of using __extension__ here? are you annotating the use of __builtin_constant_p() or is there more? because if that's the only reason i see no need to use __extension__ when already using a compiler specific builtin like this, that it is not standard is implied and enforced by a compile break. > > + /* > > + * The request size is known at build time, and > > + * the entire request can be satisfied from the cache, > > + * so let the compiler unroll the fixed length copy loop. > > + */ > > + cache->len -= n; > > + for (index = 0; index < n; index++) > > + *obj_table++ = *--cache_objs; > > + > > This loop looks a little awkward to me. Would it be clearer (and perhaps > easier for compilers to unroll efficiently if it was rewritten as: > > cache->len -= n; > cache_objs = &cache->objs[cache->len]; > for (index = 0; index < n; index++) > obj_table[index] = cache_objs[index]; > > alternatively those last two lines can be replaced by a memcpy, which the > compiler should nicely optimize itself, for constant size copy: > > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > + > > + return 0; > > + } > > + > > + /* Use the cache as much as we have to return hot objects first */ > > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > > + RTE_MIN(remaining, cache->len); > > This line confused me a lot, until I applied the patch and stared at it a > lot :-). Why would the length value depend upon whether or not n is a > compile-time constant? > I think it would really help here to add in a comment stating that when n > is a compile-time constant, at this point it much be "> cache->len" since > the previous block was untaken, therefore we don't need to call RTE_MIN. > > > cache->len -= len; > > remaining -= len; > > for (index = 0; index < len; index++) > > *obj_table++ = *--cache_objs; > > > > - if (remaining == 0) { > > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > > /* The entire request is satisfied from the cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > Once I'd worked out the logic for the above conditional check, then this > conditional adjustment was clear. I just wonder if a further comment might > help here. > > I am also wondering if having larger blocks for the constant and > non-constant cases may help. It would lead to some duplication but may > clarify the code. > > > -- > > 2.17.1 > >