DPDK patches and discussions
 help / color / mirror / Atom feed
From: Dharmik Thakkar <Dharmik.Thakkar@arm.com>
To: Jerin Jacob <jerinjacobk@gmail.com>
Cc: "Morten Brørup" <mb@smartsharesystems.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>,
	"Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>,
	dpdk-dev <dev@dpdk.org>, nd <nd@arm.com>,
	"Ruifeng Wang" <Ruifeng.Wang@arm.com>
Subject: Re: [PATCH 0/1] mempool: implement index-based per core cache
Date: Thu, 13 Jan 2022 05:31:18 +0000	[thread overview]
Message-ID: <DD1F9EC6-8316-4EFE-BAD6-4D285F0FB6BF@arm.com> (raw)
In-Reply-To: <CALBAE1NWcTNiAjxD7hSgGBzsYP0bx+0bsaUxiJ0LpGmCByLRtA@mail.gmail.com>

Hi,

Thank you for your valuable review comments and suggestions!

I will be sending out a v2 in which I have increased the size of the mempool to 32GB by using division by sizeof(uintptr_t).
However, I am seeing ~5% performance degradation with mempool_perf_autotest (for bulk size of 32) with this change
when compared to the base performance.
Earlier, without this change, I was seeing an improvement of ~13% compared to base performance. So, this is a significant degradation.
I would appreciate your review comments on v2.

Thank you!

> On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote:
> 
> On Sat, Jan 8, 2022 at 3:07 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>> 
>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>> Sent: Friday, 7 January 2022 14.51
>>> 
>>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Brørup wrote:
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Friday, 7 January 2022 12.16
>>>>> 
>>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Brørup wrote:
>>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent:
>>>>> Friday, 24
>>>>>>> December 2021 23.59
>>>>>>> 
>>>>>>> Current mempool per core cache implementation stores pointers
>>> to
>>>>> mbufs
>>>>>>> On 64b architectures, each pointer consumes 8B This patch
>>> replaces
>>>>> it
>>>>>>> with index-based implementation, where in each buffer is
>>> addressed
>>>>> by
>>>>>>> (pool base address + index) It reduces the amount of
>>> memory/cache
>>>>>>> required for per core cache
>>>>>>> 
>>>>>>> L3Fwd performance testing reveals minor improvements in the
>>> cache
>>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change
>>> in
>>>>>>> throughput
>>>>>>> 
>>>>>>> Micro-benchmarking the patch using mempool_perf_test shows
>>>>> significant
>>>>>>> improvement with majority of the test cases
>>>>>>> 
>>>>>> 
>>>>>> I still think this is very interesting. And your performance
>>> numbers
>>>>> are
>>>>>> looking good.
>>>>>> 
>>>>>> However, it limits the size of a mempool to 4 GB. As previously
>>>>>> discussed, the max mempool size can be increased by multiplying
>>> the
>>>>> index
>>>>>> with a constant.
>>>>>> 
>>>>>> I would suggest using sizeof(uintptr_t) as the constant
>>> multiplier,
>>>>> so
>>>>>> the mempool can hold objects of any size divisible by
>>>>> sizeof(uintptr_t).
>>>>>> And it would be silly to use a mempool to hold objects smaller
>>> than
>>>>>> sizeof(uintptr_t).
>>>>>> 
>>>>>> How does the performance look if you multiply the index by
>>>>>> sizeof(uintptr_t)?
>>>>>> 
>>>>> 
>>>>> Each mempool entry is cache aligned, so we can use that if we want
>>> a
>>>>> bigger
>>>>> multiplier.
>>>> 
>>>> Thanks for chiming in, Bruce.
>>>> 
>>>> Please also read this discussion about the multiplier:
>>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjbRuyA@mail.gmail.com/
>>>> 
>>> 
>>> I actually wondered after I had sent the email whether we had indeed an
>>> option to disable the cache alignment or not! Thanks for pointing out
>>> that
>>> we do. This brings a couple additional thoughts:
>>> 
>>> * Using indexes for the cache should probably be a runtime flag rather
>>> than
>>>  a build-time one.
>>> * It would seem reasonable to me to disallow use of the indexed-cache
>>> flag
>>>  and the non-cache aligned flag simultaneously.
>>> * On the offchance that that restriction is unacceptable, then we can
>>>  make things a little more complicated by doing a runtime computation
>>> of
>>>  the "index-shiftwidth" to use.
>>> 
>>> Overall, I think defaulting to cacheline shiftwidth and disallowing
>>> index-based addressing when using unaligned buffers is simplest and
>>> easiest
>>> unless we can come up with a valid usecase for needing more than that.
>>> 
>>> /Bruce
>> 
>> This feature is a performance optimization.
>> 
>> With that in mind, it should not introduce function pointers or similar run-time checks or in the fast path, to determine what kind of cache to use per mempool. And if an index multiplier is implemented, it should be a compile time constant, probably something between sizeof(uintptr_t) or RTE_MEMPOOL_ALIGN (=RTE_CACHE_LINE_SIZE).
>> 
>> The patch comes with a tradeoff between better performance and limited mempool size, and possibly some limitations regarding very small objects that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ).
>> 
>> With no multiplier, the only tradeoff is that the mempool size is limited to 4 GB.
>> 
>> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the mempool size is limited to 32 GB. (And a waste of memory for objects smaller than 8 byte; but I don't think anyone would use a mempool to hold objects smaller than 8 byte.)
>> 
>> If the multiplier is larger (i.e. 64 bytes cache line size), the mempool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect.
>> 
>> Note: 32 bit platforms have no benefit from this patch: The pointer already only uses 4 bytes, so replacing the pointer with a 4 byte index makes no difference.
>> 
>> 
>> Since this feature is a performance optimization only, and doesn't provide any new features, I don't mind it being a compile time option.
>> 
>> If this feature is a compile time option, and the mempool library is compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public header file, so compilation of applications using the flag will fail. And rte_mempool_create() could RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or emit a warning about the flag being ignored. Obviously, rte_mempool_create() should also RTE_ASSERT() that the mempool is not larger than the library supports, possibly emitting a message that the mempool library should be built without this feature to support the larger mempool.
>> 
>> Here is another thought: If only exotic applications use mempools larger than 32 GB, this would be a generally acceptable limit, and DPDK should use index-based cache as default, making the opposite (i.e. pointer-based cache) a compile time option instead. A similar decision was recently made for limiting the RTE_MAX_LCORE default.
>> 
>> 
>> Although DPDK is moving away from compile time options in order to better support Linux distros, there should be a general exception for performance and memory optimizations. Otherwise, network appliance vendors will inherit the increasing amount of DPDK bloat, and we (network appliance vendors) will eventually be forced to fork DPDK to get rid of the bloat and achieve the goals originally intended by DPDK.
> 
> Agree with Morten's view on this.
> 
>> If anyone disagrees with the principle about a general exception for performance and memory optimizations, I would like to pass on the decision to the Techboard!
>> 


  reply	other threads:[~2022-01-13  5:31 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-30 17:27 [dpdk-dev] [RFC] " Dharmik Thakkar
2021-10-01 12:36 ` Jerin Jacob
2021-10-01 15:44   ` Honnappa Nagarahalli
2021-10-01 17:32     ` Jerin Jacob
2021-10-01 17:57       ` Honnappa Nagarahalli
2021-10-01 18:21       ` Jerin Jacob
2021-10-01 21:30 ` Ananyev, Konstantin
2021-10-02  0:07   ` Honnappa Nagarahalli
2021-10-02 18:51     ` Ananyev, Konstantin
2021-10-04 16:36       ` Honnappa Nagarahalli
2021-10-30 10:23         ` Morten Brørup
2021-10-31  8:14         ` Morten Brørup
2021-11-03 15:12           ` Dharmik Thakkar
2021-11-03 15:52             ` Morten Brørup
2021-11-04  4:42               ` Dharmik Thakkar
2021-11-04  8:04                 ` Morten Brørup
2021-11-08  4:32                   ` Honnappa Nagarahalli
2021-11-08  7:22                     ` Morten Brørup
2021-11-08 15:29                       ` Honnappa Nagarahalli
2021-11-08 15:39                         ` Morten Brørup
2021-11-08 15:46                           ` Honnappa Nagarahalli
2021-11-08 16:03                             ` Morten Brørup
2021-11-08 16:47                               ` Jerin Jacob
2021-12-24 22:59 ` [PATCH 0/1] " Dharmik Thakkar
2021-12-24 22:59   ` [PATCH 1/1] " Dharmik Thakkar
2022-01-11  2:26     ` Ananyev, Konstantin
2022-01-13  5:17       ` Dharmik Thakkar
2022-01-13 10:37         ` Ananyev, Konstantin
2022-01-19 15:32           ` Dharmik Thakkar
2022-01-21 11:25             ` Ananyev, Konstantin
2022-01-21 11:31               ` Ananyev, Konstantin
2022-03-24 19:51               ` Dharmik Thakkar
2021-12-25  0:16   ` [PATCH 0/1] " Morten Brørup
2022-01-07 11:15     ` Bruce Richardson
2022-01-07 11:29       ` Morten Brørup
2022-01-07 13:50         ` Bruce Richardson
2022-01-08  9:37           ` Morten Brørup
2022-01-10  6:38             ` Jerin Jacob
2022-01-13  5:31               ` Dharmik Thakkar [this message]
2023-07-06 17:43                 ` Stephen Hemminger
2023-07-31 12:23                   ` Thomas Monjalon
2023-07-31 12:33                     ` Morten Brørup
2023-07-31 14:57                       ` Dharmik Jayesh Thakkar
2022-01-13  5:36   ` [PATCH v2 " Dharmik Thakkar
2022-01-13  5:36     ` [PATCH v2 1/1] " Dharmik Thakkar
2022-01-13 10:18       ` Jerin Jacob
2022-01-20  8:21       ` Morten Brørup
2022-01-21  6:01         ` Honnappa Nagarahalli
2022-01-21  7:36           ` Morten Brørup
2022-01-24 13:05             ` Ray Kinsella
2022-01-21  9:12           ` Bruce Richardson
2022-01-23  7:13       ` Wang, Haiyue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DD1F9EC6-8316-4EFE-BAD6-4D285F0FB6BF@arm.com \
    --to=dharmik.thakkar@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Ruifeng.Wang@arm.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=jerinjacobk@gmail.com \
    --cc=mb@smartsharesystems.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).