From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DE84AA0350;
	Sat,  8 Jan 2022 10:37:18 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 68845410E1;
	Sat,  8 Jan 2022 10:37:18 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id AA24640042
 for <dev@dpdk.org>; Sat,  8 Jan 2022 10:37:16 +0100 (CET)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH 0/1] mempool: implement index-based per core cache
Date: Sat, 8 Jan 2022 10:37:10 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DEF@smartserver.smartshare.dk>
In-Reply-To: <YdhFKhWtpzKS6g7l@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH 0/1] mempool: implement index-based per core cache
Thread-Index: AdgDzZNRf9kRoJmFScqQtiMx6QknDgAlQeWg
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
 <20211224225923.806498-1-dharmik.thakkar@arm.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DAD@smartserver.smartshare.dk>
 <Ydgg7YlndpBb0g+w@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DEA@smartserver.smartshare.dk>
 <YdhFKhWtpzKS6g7l@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>,
 "Dharmik Thakkar" <dharmik.thakkar@arm.com>, <honnappa.nagarahalli@arm.com>
Cc: <dev@dpdk.org>,
	<nd@arm.com>,
	<ruifeng.wang@arm.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 7 January 2022 14.51
>=20
> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Br=F8rup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 7 January 2022 12.16
> > >
> > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Br=F8rup wrote:
> > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
> > > Friday, 24
> > > > > December 2021 23.59
> > > > >
> > > > > Current mempool per core cache implementation stores pointers
> to
> > > mbufs
> > > > > On 64b architectures, each pointer consumes 8B This patch
> replaces
> > > it
> > > > > with index-based implementation, where in each buffer is
> addressed
> > > by
> > > > > (pool base address + index) It reduces the amount of
> memory/cache
> > > > > required for per core cache
> > > > >
> > > > > L3Fwd performance testing reveals minor improvements in the
> cache
> > > > > performance (L1 and L2 misses reduced by 0.60%) with no change
> in
> > > > > throughput
> > > > >
> > > > > Micro-benchmarking the patch using mempool_perf_test shows
> > > significant
> > > > > improvement with majority of the test cases
> > > > >
> > > >
> > > > I still think this is very interesting. And your performance
> numbers
> > > are
> > > > looking good.
> > > >
> > > > However, it limits the size of a mempool to 4 GB. As previously
> > > > discussed, the max mempool size can be increased by multiplying
> the
> > > index
> > > > with a constant.
> > > >
> > > > I would suggest using sizeof(uintptr_t) as the constant
> multiplier,
> > > so
> > > > the mempool can hold objects of any size divisible by
> > > sizeof(uintptr_t).
> > > > And it would be silly to use a mempool to hold objects smaller
> than
> > > > sizeof(uintptr_t).
> > > >
> > > > How does the performance look if you multiply the index by
> > > > sizeof(uintptr_t)?
> > > >
> > >
> > > Each mempool entry is cache aligned, so we can use that if we want
> a
> > > bigger
> > > multiplier.
> >
> > Thanks for chiming in, Bruce.
> >
> > Please also read this discussion about the multiplier:
> > =
http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjb=
RuyA@mail.gmail.com/
> >
>=20
> I actually wondered after I had sent the email whether we had indeed =
an
> option to disable the cache alignment or not! Thanks for pointing out
> that
> we do. This brings a couple additional thoughts:
>=20
> * Using indexes for the cache should probably be a runtime flag rather
> than
>   a build-time one.
> * It would seem reasonable to me to disallow use of the indexed-cache
> flag
>   and the non-cache aligned flag simultaneously.
> * On the offchance that that restriction is unacceptable, then we can
>   make things a little more complicated by doing a runtime computation
> of
>   the "index-shiftwidth" to use.
>=20
> Overall, I think defaulting to cacheline shiftwidth and disallowing
> index-based addressing when using unaligned buffers is simplest and
> easiest
> unless we can come up with a valid usecase for needing more than that.
>=20
> /Bruce

This feature is a performance optimization.

With that in mind, it should not introduce function pointers or similar =
run-time checks or in the fast path, to determine what kind of cache to =
use per mempool. And if an index multiplier is implemented, it should be =
a compile time constant, probably something between sizeof(uintptr_t) or =
RTE_MEMPOOL_ALIGN (=3DRTE_CACHE_LINE_SIZE).

The patch comes with a tradeoff between better performance and limited =
mempool size, and possibly some limitations regarding very small objects =
that are not cache line aligned to avoid wasting memory =
(RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).

With no multiplier, the only tradeoff is that the mempool size is =
limited to 4 GB.

If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the =
mempool size is limited to 32 GB. (And a waste of memory for objects =
smaller than 8 byte; but I don't think anyone would use a mempool to =
hold objects smaller than 8 byte.)

If the multiplier is larger (i.e. 64 bytes cache line size), the mempool =
size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ =
has no effect.

Note: 32 bit platforms have no benefit from this patch: The pointer =
already only uses 4 bytes, so replacing the pointer with a 4 byte index =
makes no difference.


Since this feature is a performance optimization only, and doesn't =
provide any new features, I don't mind it being a compile time option.

If this feature is a compile time option, and the mempool library is =
compiled with the large multiplier, then =
RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public =
header file, so compilation of applications using the flag will fail. =
And rte_mempool_create() could RTE_ASSERT() that =
RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or =
emit a warning about the flag being ignored. Obviously, =
rte_mempool_create() should also RTE_ASSERT() that the mempool is not =
larger than the library supports, possibly emitting a message that the =
mempool library should be built without this feature to support the =
larger mempool.

Here is another thought: If only exotic applications use mempools larger =
than 32 GB, this would be a generally acceptable limit, and DPDK should =
use index-based cache as default, making the opposite (i.e. =
pointer-based cache) a compile time option instead. A similar decision =
was recently made for limiting the RTE_MAX_LCORE default.


Although DPDK is moving away from compile time options in order to =
better support Linux distros, there should be a general exception for =
performance and memory optimizations. Otherwise, network appliance =
vendors will inherit the increasing amount of DPDK bloat, and we =
(network appliance vendors) will eventually be forced to fork DPDK to =
get rid of the bloat and achieve the goals originally intended by DPDK. =
If anyone disagrees with the principle about a general exception for =
performance and memory optimizations, I would like to pass on the =
decision to the Techboard!