From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6EA2A45AAF; Fri, 4 Oct 2024 17:50:15 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 5E46C4278D; Fri, 4 Oct 2024 17:50:15 +0200 (CEST) Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by mails.dpdk.org (Postfix) with ESMTP id 0B71E4278B for ; Fri, 4 Oct 2024 17:50:14 +0200 (CEST) Received: by mail-pg1-f169.google.com with SMTP id 41be03b00d2f7-7163489149eso1908121a12.1 for ; Fri, 04 Oct 2024 08:50:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1728057013; x=1728661813; darn=dpdk.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=eY9xN/VhC3dhfF4lumGXPRO4vHN5TSDbTeEEJHSzaRc=; b=WpJpbX+mwxz3/T63zZ6VQsEKu/rL4vKZzjkD+28PKEdb8b9S2A3qyxswm4+jkrX5Iz O1/pE4Hy0I6IDK3utxP6LMpLtmtQ4D/GmCpMtwahTt5EEFE8OyPyZTADBC/0XKb1fdvZ OyXs+yXozNRTTNvJc/1rouJ5TUdzdBprEbbxqNwa7MzdlNHxTTnK4jqRz05WfcIVV7bQ AWMW+AyzlyEV2rxQPQcWFmgD9zwwuAuhSKN01Cg9UXKqKdVTNdzXEmpS32JMRn3PCcaa dya1mPdd3yKjomdVOQ7xQQ4Ukoz59+YqGB0+LJVpN/OZaaoHXumNHW8nHMahcDT/7Qk+ fOKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728057013; x=1728661813; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eY9xN/VhC3dhfF4lumGXPRO4vHN5TSDbTeEEJHSzaRc=; b=nh/4WNARNKEXkkBZ115XkTceRoEHgRbtkAQlUl0SXd3/in2K3N+YyDkoRRow9oyq7g gb1VSoYTsjzoKcRzVBAdpBAOiRGs8z7V5M08EEPUZk4JuTgtdwHY60a/VGNnajtHz3i/ POd4rK+nJeiItkitW6C6pyOP/hlmSd0KHaKD7XLBiO/dzqsy9cA37ujitYQu28HM6jGQ tyHDdatJFnpajkQwCoTeil+n7Vc8By6jCoYxOmJL1mvY+q8EJV8albS0moMWDy5FutXx qyoRWTXRjV3CwcNpEDWdYGfhaA+hMrKnJOMaN4gZW+55bYdCvnoDeuCGObBx6EuSN2GP bRCg== X-Forwarded-Encrypted: i=1; AJvYcCWYHRUn9905FBFeytm/HUVd2zm4XUs+adoeQVb66dZN1CWeTCMoZdx+e2Vty/McpnQVisE=@dpdk.org X-Gm-Message-State: AOJu0YyKf0cpJpTFpeZPfoT6QOSx8AdAhPF0OOaeX3KJRJSx1f0qZ1YD b+5p8vavvntkwLn53W51Y7GA7Mcr9b562Z9CNgvtzvVlZoGfH2x/KEqyd8oY424= X-Google-Smtp-Source: AGHT+IHV7vr/8YeOzDIhBQo1au0WThDa491Ar7SgNFPGgLEX0cOLgCdkiQZJP8zBMxJXeE8XDmjr4Q== X-Received: by 2002:a17:903:11d0:b0:20b:6bf4:4acf with SMTP id d9443c01a7336-20bff074b06mr58471415ad.48.1728057013031; Fri, 04 Oct 2024 08:50:13 -0700 (PDT) Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20c138cfd81sm90485ad.77.2024.10.04.08.50.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Oct 2024 08:50:12 -0700 (PDT) Date: Fri, 4 Oct 2024 08:50:11 -0700 From: Stephen Hemminger To: Morten =?UTF-8?B?QnLDuHJ1cA==?= Cc: "Bruce Richardson" , "Andrew Rybchenko" , Subject: Re: [RFC] mempool: CPU cache aligning mempool driver accesses Message-ID: <20241004085011.332fc105@hermes.local> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F003@smartserver.smartshare.dk> References: <98CBD80474FA8B44BF855DF32C47DC35E9EFD4@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9F003@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, 9 Nov 2023 11:45:46 +0100 Morten Br=C3=B8rup wrote: > +TO: Andrew, mempool maintainer >=20 > > From: Morten Br=C3=B8rup [mailto:mb@smartsharesystems.com] > > Sent: Monday, 6 November 2023 11.29 > > =20 > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Monday, 6 November 2023 10.45 > > > > > > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Br=C3=B8rup wrote: = =20 > > > > I tried a little experiment, which gave a 25 % improvement in =20 > > mempool =20 > > > > perf tests for long bursts (n_get_bulk=3D32 n_put_bulk=3D32 n_keep= =3D512 > > > > constant_n=3D0) on a Xeon E5-2620 v4 based system. > > > > > > > > This is the concept: > > > > > > > > If all accesses to the mempool driver goes through the mempool =20 > > cache, =20 > > > > we can ensure that these bulk load/stores are always CPU cache =20 > > > aligned, =20 > > > > by using cache->size when loading/storing to the mempool driver. > > > > > > > > Furthermore, it is rumored that most applications use the default > > > > mempool cache size, so if the driver tests for that specific value, > > > > it can use rte_memcpy(src,dst,N) with N known at build time, =20 > > allowing =20 > > > > optimal performance for copying the array of objects. > > > > > > > > Unfortunately, I need to change the flush threshold from 1.5 to 2 = =20 > > to =20 > > > > be able to always use cache->size when loading/storing to the =20 > > mempool =20 > > > > driver. > > > > > > > > What do you think? =20 >=20 > It's the concept of accessing the underlying mempool in entire cache line= s I am seeking feedback for. >=20 > The provided code is just an example, mainly for testing performance of t= he concept. >=20 > > > > > > > > PS: If we can't get rid of the mempool cache size threshold factor, > > > > we really need to expose it through public APIs. A job for another = =20 > > > day. =20 >=20 > The concept that a mempool per-lcore cache can hold more objects than its= size is extremely weird, and certainly unexpected by any normal developer.= And thus it is likely to cause runtime errors for applications designing t= ightly sized mempools. >=20 > So, if we move forward with this RFC, I propose eliminating the threshold= factor, so the mempool per-lcore caches cannot hold more objects than thei= r size. > When doing this, we might also choose to double RTE_MEMPOOL_CACHE_MAX_SIZ= E, to prevent any performance degradation. >=20 > > > > > > > > Signed-off-by: Morten Br=C3=B8rup > > > > --- =20 > > > Interesting, thanks. > > > > > > Out of interest, is there any different in performance you observe if > > > using > > > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy > > > amount is constant, a regular memcpy call should be expanded by the > > > compiler itself, and so should be pretty efficient. =20 > >=20 > > I ran some tests without patching rte_ring_elem_pvt.h, i.e. without > > introducing the constant-size copy loop. I got the majority of the > > performance gain at this point. > >=20 > > At this point, both pointers are CPU cache aligned when refilling the > > mempool cache, and the destination pointer is CPU cache aligned when > > draining the mempool cache. > >=20 > > In other words: When refilling the mempool cache, it is both loading > > and storing entire CPU cache lines. And when draining, it is storing > > entire CPU cache lines. > >=20 > >=20 > > Adding the fixed-size copy loop provided an additional performance > > gain. I didn't test other constant-size copy methods than rte_memcpy. > >=20 > > rte_memcpy should have optimal conditions in this patch, because N is > > known to be 512 * 8 =3D 4 KiB at build time. Furthermore, both pointers > > are CPU cache aligned when refilling the mempool cache, and the > > destination pointer is CPU cache aligned when draining the mempool > > cache. I don't recall if pointer alignment matters for rte_memcpy, > > though. > >=20 > > The memcpy in libc (or more correctly: intrinsic to the compiler) will > > do non-temporal copying for large sizes, and I don't know what that > > threshold is, so I think rte_memcpy is the safe bet here. Especially if > > someone builds DPDK with a larger mempool cache size than 512 objects. > >=20 > > On the other hand, non-temporal access to the objects in the ring might > > be beneficial if the ring is so large that they go cold before the > > application loads them from the ring again. =20 >=20 Patch makes sense. Would prefer use of memcpy(), rte_memcpy usage should decline and it makes no sense here, compiler inlining will be the same. Also, patch no longer applies to main so needs rebase.