* Bug in rte_mempool_do_generic_get? @ 2023-02-24 3:02 Harris, James R 2023-02-24 12:13 ` Morten Brørup 0 siblings, 1 reply; 8+ messages in thread From: Harris, James R @ 2023-02-24 3:02 UTC (permalink / raw) To: dev, mb [-- Attachment #1: Type: text/plain, Size: 1240 bytes --] Hi, I’ve tracked down a regression in SPDK to DPDK commit a2833ecc5 (“mempool: fix get objects from mempool with cache”). Here’s an example that demonstrates the problem: Allocate mempool with 2048 buffers and cache size 256. Core 0 allocates 512 buffers. Mempool pulls 512 + 256 buffers from backing ring, returns 512 of them to caller, puts the other 256 in core 0 cache. Backing ring now has 1280 buffers. Core 1 allocates 512 buffers. Mempool pulls 512 + 256 buffers from backing ring, returns 512 of them to caller, puts the other 256 in core 1 cache. Backing ring now has 512 buffers. Core 2 allocates 512 buffers. Mempool pulls remaining 512 buffers from backing ring and returns all of them to caller. Backing ring now has 0 buffers. Core 3 tries to allocate 512 buffers and it fails. In the SPDK case, we don’t really need or use the mempool cache in this case, so changing the cache size to 0 fixes the problem and is what we’re going to move forward with. But the behavior did cause a regression so I thought I’d mention it here. If you have a mempool with 2048 objects, shouldn’t 4 cores each be able to do a 512 buffer bulk get, regardless of the configured cache size? Regards, Jim Harris [-- Attachment #2: Type: text/html, Size: 2976 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Bug in rte_mempool_do_generic_get? 2023-02-24 3:02 Bug in rte_mempool_do_generic_get? Harris, James R @ 2023-02-24 12:13 ` Morten Brørup 2023-02-24 13:56 ` Honnappa Nagarahalli 0 siblings, 1 reply; 8+ messages in thread From: Morten Brørup @ 2023-02-24 12:13 UTC (permalink / raw) To: Harris, James R, dev > From: Harris, James R [mailto:james.r.harris@intel.com] > Sent: Friday, 24 February 2023 04.03 > > Hi, > > I've tracked down a regression in SPDK to DPDK commit a2833ecc5 ("mempool: fix get objects from mempool with cache"). The problem probably goes all the way back to the introduction of the cache flush threshold, which effectively increased the cache size to 1.5 times the configured cache size, in this commit: http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea5dd2744b90b330f07fd10f327ab99ef55c7266 It might even go further back. > > Here's an example that demonstrates the problem: > > Allocate mempool with 2048 buffers and cache size 256. > Core 0 allocates 512 buffers. Mempool pulls 512 + 256 buffers from backing ring, returns 512 of them to caller, puts the other 256 in core 0 cache. Backing ring now has 1280 buffers. > Core 1 allocates 512 buffers. Mempool pulls 512 + 256 buffers from backing ring, returns 512 of them to caller, puts the other 256 in core 1 cache. Backing ring now has 512 buffers. > Core 2 allocates 512 buffers. Mempool pulls remaining 512 buffers from backing ring and returns all of them to caller. Backing ring now has 0 buffers. > Core 3 tries to allocate 512 buffers and it fails. > > In the SPDK case, we don't really need or use the mempool cache in this case, so changing the cache size to 0 fixes the problem and is what we're going to move forward with. If you are not making get/put requests smaller than the cache size, then yes, having no cache is the best solution. > > But the behavior did cause a regression so I thought I'd mention it here. Thank you. > If you have a mempool with 2048 objects, shouldn't 4 cores each be able to do a 512 buffer bulk get, regardless of the configured cache size? No, the scenario you described above is the expected behavior. I think it is documented somewhere that objects in the caches are unavailable for other cores, but now I cannot find where this is documented. Furthermore, since the effective per-core cache size is 1.5 * configured cache size, a configured cache size of 256 may leave up to 384 objects in each per-core cache. With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in the caches of other cores. If you want to be able to pull 512 objects with each core, the pool size should be 4 * 512 + 1152 objects. ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Bug in rte_mempool_do_generic_get? 2023-02-24 12:13 ` Morten Brørup @ 2023-02-24 13:56 ` Honnappa Nagarahalli 2023-02-24 16:42 ` Harris, James R 0 siblings, 1 reply; 8+ messages in thread From: Honnappa Nagarahalli @ 2023-02-24 13:56 UTC (permalink / raw) To: Morten Brørup, Harris, James R, dev; +Cc: nd, nd > -----Original Message----- > From: Morten Brørup <mb@smartsharesystems.com> > Sent: Friday, February 24, 2023 6:13 AM > To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org > Subject: RE: Bug in rte_mempool_do_generic_get? > > > From: Harris, James R [mailto:james.r.harris@intel.com] > > Sent: Friday, 24 February 2023 04.03 > > > > Hi, > > > > I've tracked down a regression in SPDK to DPDK commit a2833ecc5 > ("mempool: fix get objects from mempool with cache"). > > The problem probably goes all the way back to the introduction of the cache > flush threshold, which effectively increased the cache size to 1.5 times the > configured cache size, in this commit: > http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea > 5dd2744b90b330f07fd10f327ab99ef55c7266 > > It might even go further back. > > > > > Here's an example that demonstrates the problem: > > > > Allocate mempool with 2048 buffers and cache size 256. > > Core 0 allocates 512 buffers. Mempool pulls 512 + 256 buffers from > backing ring, returns 512 of them to caller, puts the other 256 in core 0 > cache. Backing ring now has 1280 buffers. > > Core 1 allocates 512 buffers. Mempool pulls 512 + 256 buffers from > backing ring, returns 512 of them to caller, puts the other 256 in core 1 > cache. Backing ring now has 512 buffers. > > Core 2 allocates 512 buffers. Mempool pulls remaining 512 buffers from > backing ring and returns all of them to caller. Backing ring now has 0 buffers. > > Core 3 tries to allocate 512 buffers and it fails. > > > > In the SPDK case, we don't really need or use the mempool cache in this > case, so changing the cache size to 0 fixes the problem and is what we're > going to move forward with. > > If you are not making get/put requests smaller than the cache size, then yes, > having no cache is the best solution. > > > > > But the behavior did cause a regression so I thought I'd mention it here. > > Thank you. > > > If you have a mempool with 2048 objects, shouldn't 4 cores each be able to > do a 512 buffer bulk get, regardless of the configured cache size? > > No, the scenario you described above is the expected behavior. I think it is > documented somewhere that objects in the caches are unavailable for other > cores, but now I cannot find where this is documented. > > > Furthermore, since the effective per-core cache size is 1.5 * configured cache > size, a configured cache size of 256 may leave up to 384 objects in each per- > core cache. > > With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in the > caches of other cores. If you want to be able to pull 512 objects with each > core, the pool size should be 4 * 512 + 1152 objects. May be we should document this in mempool library? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug in rte_mempool_do_generic_get? 2023-02-24 13:56 ` Honnappa Nagarahalli @ 2023-02-24 16:42 ` Harris, James R 2023-02-26 10:45 ` Morten Brørup 0 siblings, 1 reply; 8+ messages in thread From: Harris, James R @ 2023-02-24 16:42 UTC (permalink / raw) To: Honnappa Nagarahalli; +Cc: Morten Brørup, dev, nd > On Feb 24, 2023, at 6:56 AM, Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote: > > > >> -----Original Message----- >> From: Morten Brørup <mb@smartsharesystems.com> >> Sent: Friday, February 24, 2023 6:13 AM >> To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org >> Subject: RE: Bug in rte_mempool_do_generic_get? >> >>> >> >>> If you have a mempool with 2048 objects, shouldn't 4 cores each be able to do a 512 buffer bulk get, regardless of the configured cache size? >> >> No, the scenario you described above is the expected behavior. I think it is >> documented somewhere that objects in the caches are unavailable for other >> cores, but now I cannot find where this is documented. >> Thanks Morten. Yeah, I think it is documented somewhere, but I also couldn’t find it. I was aware of cores not being able to allocate from another core’s cache. My surprise was that in a pristine new mempool, that 4 cores could not each do one initial 512 buffer bulk get. But I also see that even before the a2833ecc5 patch, the cache would get populated on gets less than cache size, in addition to the buffers requested by the user. So if cache size is 256, and bulk get is for 128 buffers, it pulls 384 buffers from backing pool - 128 for the caller, another 256 to prefill the cache. Your patch makes this cache filling consistent between less-than-cache-size and greater-than-or-equal-to-cache-size cases. >> Furthermore, since the effective per-core cache size is 1.5 * configured cache >> size, a configured cache size of 256 may leave up to 384 objects in each per- >> core cache. >> >> With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in the >> caches of other cores. If you want to be able to pull 512 objects with each >> core, the pool size should be 4 * 512 + 1152 objects. > May be we should document this in mempool library? > Maybe. But this case I described here is a bit wonky - SPDK should never have been specifying a non-zero cache in this case. We only noticed this change in behavior because we were creating the mempool with a cache when we shouldn’t have. -Jim ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Bug in rte_mempool_do_generic_get? 2023-02-24 16:42 ` Harris, James R @ 2023-02-26 10:45 ` Morten Brørup 2023-02-27 9:09 ` Olivier Matz 0 siblings, 1 reply; 8+ messages in thread From: Morten Brørup @ 2023-02-26 10:45 UTC (permalink / raw) To: Harris, James R, Honnappa Nagarahalli Cc: dev, nd, olivier.matz, andrew.rybchenko, bruce.richardson > From: Harris, James R [mailto:james.r.harris@intel.com] > Sent: Friday, 24 February 2023 17.43 > > > On Feb 24, 2023, at 6:56 AM, Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com> wrote: > > > >> From: Morten Brørup <mb@smartsharesystems.com> > >> Sent: Friday, February 24, 2023 6:13 AM > >> To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org > >> Subject: RE: Bug in rte_mempool_do_generic_get? > >> > >>> If you have a mempool with 2048 objects, shouldn't 4 cores each be > able to do a 512 buffer bulk get, regardless of the configured cache > size? > >> > >> No, the scenario you described above is the expected behavior. I > think it is > >> documented somewhere that objects in the caches are unavailable for > other > >> cores, but now I cannot find where this is documented. > >> > > Thanks Morten. > > Yeah, I think it is documented somewhere, but I also couldn’t find it. > I was aware of cores not being able to allocate from another core’s > cache. My surprise was that in a pristine new mempool, that 4 cores > could not each do one initial 512 buffer bulk get. But I also see that > even before the a2833ecc5 patch, the cache would get populated on gets > less than cache size, in addition to the buffers requested by the user. > So if cache size is 256, and bulk get is for 128 buffers, it pulls 384 > buffers from backing pool - 128 for the caller, another 256 to prefill > the cache. Your patch makes this cache filling consistent between less- > than-cache-size and greater-than-or-equal-to-cache-size cases. Yeah... I have tried hard to clean up some of the mess introduced by the <insert curse words here> patch [1] that increased the effective cache size by factor 1.5. [1] http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea5dd2744b90b330f07fd10f327ab99ef55c7266 E.g., somewhat related to your use case, I have argued against the design goal of trying to keep the mempool caches full at all times, but have not yet been able to convince the community against it. Overall, this is one of the absolute core DPDK libraries, so the support for the changes I would like to see is near zero - even minor adjustments are considered very high risk, which I must admit has quite a lot of merit. So instead of fixing the library's implementation to make it behave as reasonably expected according to its documentation, the library's implementation is considered the reference. And as a consequence, the library's internals, such as the cache size multiplier, is getting promoted to be part of the public API, e.g. for applications to implement mempool sizing calculations like the one below. I recall running into problems once myself, when I had sized a mempool with a specific cache size for a specific number of cores, because 50 % additional objects got stuck in the caches. If you ask me, this bug is so fundamental that it should be fixed at the root, not by trying to document the weird behavior of allocating 50 % more than specified. > > >> Furthermore, since the effective per-core cache size is 1.5 * > configured cache > >> size, a configured cache size of 256 may leave up to 384 objects in > each per- > >> core cache. > >> > >> With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in > the > >> caches of other cores. If you want to be able to pull 512 objects > with each > >> core, the pool size should be 4 * 512 + 1152 objects. > > May be we should document this in mempool library? > > > > Maybe. But this case I described here is a bit wonky - SPDK should > never have been specifying a non-zero cache in this case. We only > noticed this change in behavior because we were creating the mempool > with a cache when we shouldn’t have. So, looking at the positive side, this regression revealed a "bug" in SPDK. ;-) PS: I assume that you are aware that mempools generally perform better with cache, so I assume that you know what you are doing when you say that you don't need the cache. PPS: Sorry about venting here. If nothing else, I hope it had some entertainment value. :-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug in rte_mempool_do_generic_get? 2023-02-26 10:45 ` Morten Brørup @ 2023-02-27 9:09 ` Olivier Matz 2023-02-27 9:48 ` Morten Brørup 0 siblings, 1 reply; 8+ messages in thread From: Olivier Matz @ 2023-02-27 9:09 UTC (permalink / raw) To: Morten Brørup Cc: Harris, James R, Honnappa Nagarahalli, dev, nd, andrew.rybchenko, bruce.richardson Hi, On Sun, Feb 26, 2023 at 11:45:48AM +0100, Morten Brørup wrote: > > From: Harris, James R [mailto:james.r.harris@intel.com] > > Sent: Friday, 24 February 2023 17.43 > > > > > On Feb 24, 2023, at 6:56 AM, Honnappa Nagarahalli > > <Honnappa.Nagarahalli@arm.com> wrote: > > > > > >> From: Morten Brørup <mb@smartsharesystems.com> > > >> Sent: Friday, February 24, 2023 6:13 AM > > >> To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org > > >> Subject: RE: Bug in rte_mempool_do_generic_get? > > >> > > >>> If you have a mempool with 2048 objects, shouldn't 4 cores each be > > able to do a 512 buffer bulk get, regardless of the configured cache > > size? > > >> > > >> No, the scenario you described above is the expected behavior. I > > think it is > > >> documented somewhere that objects in the caches are unavailable for > > other > > >> cores, but now I cannot find where this is documented. > > >> > > > > Thanks Morten. > > > > Yeah, I think it is documented somewhere, but I also couldn’t find it. > > I was aware of cores not being able to allocate from another core’s > > cache. My surprise was that in a pristine new mempool, that 4 cores > > could not each do one initial 512 buffer bulk get. But I also see that > > even before the a2833ecc5 patch, the cache would get populated on gets > > less than cache size, in addition to the buffers requested by the user. > > So if cache size is 256, and bulk get is for 128 buffers, it pulls 384 > > buffers from backing pool - 128 for the caller, another 256 to prefill > > the cache. Your patch makes this cache filling consistent between less- > > than-cache-size and greater-than-or-equal-to-cache-size cases. > > Yeah... I have tried hard to clean up some of the mess introduced by the <insert curse words here> patch [1] that increased the effective cache size by factor 1.5. > > [1] http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea5dd2744b90b330f07fd10f327ab99ef55c7266 > > E.g., somewhat related to your use case, I have argued against the design goal of trying to keep the mempool caches full at all times, but have not yet been able to convince the community against it. > > Overall, this is one of the absolute core DPDK libraries, so the support for the changes I would like to see is near zero - even minor adjustments are considered very high risk, which I must admit has quite a lot of merit. > > So instead of fixing the library's implementation to make it behave as reasonably expected according to its documentation, the library's implementation is considered the reference. And as a consequence, the library's internals, such as the cache size multiplier, is getting promoted to be part of the public API, e.g. for applications to implement mempool sizing calculations like the one below. > > I recall running into problems once myself, when I had sized a mempool with a specific cache size for a specific number of cores, because 50 % additional objects got stuck in the caches. If you ask me, this bug is so fundamental that it should be fixed at the root, not by trying to document the weird behavior of allocating 50 % more than specified. I think we should document it: even in the case the 1.5 factor is removed, it is helpful to document that a user needs to reserve more objects when using a cache, to ensure that all cores can allocate their objects. Morten, what would you think about this change below? --- a/doc/guides/prog_guide/mempool_lib.rst +++ b/doc/guides/prog_guide/mempool_lib.rst @@ -89,9 +89,13 @@ In this way, each core has full access to its own cache (with locks) of free obj only when the cache fills does the core need to shuffle some of the free objects back to the pools ring or obtain more objects when the cache is empty. -While this may mean a number of buffers may sit idle on some core's cache, +While this may mean a number of buffers may sit idle on some core's cache (up to ``1.5 * cache_length``), the speed at which a core can access its own cache for a specific memory pool without locks provides performance gains. +However, to ensure that all cores can get objects when a cache is used, it is required to allocate +more objects: if N objects are needed, allocate a mempool with ``N + (number_of_active_cores * +cache_size * 1.5)`` objects. + The cache is composed of a small, per-core table of pointers and its length (used as a stack). This internal cache can be enabled or disabled at creation of the pool. > > > > > >> Furthermore, since the effective per-core cache size is 1.5 * > > configured cache > > >> size, a configured cache size of 256 may leave up to 384 objects in > > each per- > > >> core cache. > > >> > > >> With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in > > the > > >> caches of other cores. If you want to be able to pull 512 objects > > with each > > >> core, the pool size should be 4 * 512 + 1152 objects. > > > May be we should document this in mempool library? > > > > > > > Maybe. But this case I described here is a bit wonky - SPDK should > > never have been specifying a non-zero cache in this case. We only > > noticed this change in behavior because we were creating the mempool > > with a cache when we shouldn’t have. > > So, looking at the positive side, this regression revealed a "bug" in SPDK. ;-) > > PS: I assume that you are aware that mempools generally perform better with cache, so I assume that you know what you are doing when you say that you don't need the cache. > > PPS: Sorry about venting here. If nothing else, I hope it had some entertainment value. :-) > ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Bug in rte_mempool_do_generic_get? 2023-02-27 9:09 ` Olivier Matz @ 2023-02-27 9:48 ` Morten Brørup 2023-02-27 10:39 ` Olivier Matz 0 siblings, 1 reply; 8+ messages in thread From: Morten Brørup @ 2023-02-27 9:48 UTC (permalink / raw) To: Olivier Matz Cc: Harris, James R, Honnappa Nagarahalli, dev, nd, andrew.rybchenko, bruce.richardson > From: Olivier Matz [mailto:olivier.matz@6wind.com] > Sent: Monday, 27 February 2023 10.10 > > Hi, > > On Sun, Feb 26, 2023 at 11:45:48AM +0100, Morten Brørup wrote: > > > From: Harris, James R [mailto:james.r.harris@intel.com] > > > Sent: Friday, 24 February 2023 17.43 > > > > > > > On Feb 24, 2023, at 6:56 AM, Honnappa Nagarahalli > > > <Honnappa.Nagarahalli@arm.com> wrote: > > > > > > > >> From: Morten Brørup <mb@smartsharesystems.com> > > > >> Sent: Friday, February 24, 2023 6:13 AM > > > >> To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org > > > >> Subject: RE: Bug in rte_mempool_do_generic_get? > > > >> > > > >>> If you have a mempool with 2048 objects, shouldn't 4 cores each be > > > able to do a 512 buffer bulk get, regardless of the configured cache > > > size? > > > >> > > > >> No, the scenario you described above is the expected behavior. I > > > think it is > > > >> documented somewhere that objects in the caches are unavailable for > > > other > > > >> cores, but now I cannot find where this is documented. > > > >> > > > > > > Thanks Morten. > > > > > > Yeah, I think it is documented somewhere, but I also couldn’t find it. > > > I was aware of cores not being able to allocate from another core’s > > > cache. My surprise was that in a pristine new mempool, that 4 cores > > > could not each do one initial 512 buffer bulk get. But I also see that > > > even before the a2833ecc5 patch, the cache would get populated on gets > > > less than cache size, in addition to the buffers requested by the user. > > > So if cache size is 256, and bulk get is for 128 buffers, it pulls 384 > > > buffers from backing pool - 128 for the caller, another 256 to prefill > > > the cache. Your patch makes this cache filling consistent between less- > > > than-cache-size and greater-than-or-equal-to-cache-size cases. > > > > Yeah... I have tried hard to clean up some of the mess introduced by the > <insert curse words here> patch [1] that increased the effective cache size by > factor 1.5. > > > > [1] > http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea5dd2744b > 90b330f07fd10f327ab99ef55c7266 > > > > E.g., somewhat related to your use case, I have argued against the design > goal of trying to keep the mempool caches full at all times, but have not yet > been able to convince the community against it. > > > > Overall, this is one of the absolute core DPDK libraries, so the support for > the changes I would like to see is near zero - even minor adjustments are > considered very high risk, which I must admit has quite a lot of merit. > > > > So instead of fixing the library's implementation to make it behave as > reasonably expected according to its documentation, the library's > implementation is considered the reference. And as a consequence, the > library's internals, such as the cache size multiplier, is getting promoted to > be part of the public API, e.g. for applications to implement mempool sizing > calculations like the one below. > > > > I recall running into problems once myself, when I had sized a mempool with > a specific cache size for a specific number of cores, because 50 % additional > objects got stuck in the caches. If you ask me, this bug is so fundamental > that it should be fixed at the root, not by trying to document the weird > behavior of allocating 50 % more than specified. > > I think we should document it: even in the case the 1.5 factor is > removed, it is helpful to document that a user needs to reserve more > objects when using a cache, to ensure that all cores can allocate their > objects. > > Morten, what would you think about this change below? A big improvement! Only a few comments, inline below. > > --- a/doc/guides/prog_guide/mempool_lib.rst > +++ b/doc/guides/prog_guide/mempool_lib.rst > @@ -89,9 +89,13 @@ In this way, each core has full access to its own cache > (with locks) of free obj > only when the cache fills does the core need to shuffle some of the free > objects back to the pools ring or > obtain more objects when the cache is empty. > > -While this may mean a number of buffers may sit idle on some core's cache, > +While this may mean a number of buffers may sit idle on some core's cache > (up to ``1.5 * cache_length``), Typo: cache_length -> cache_size Looking at this, I noticed another detail: Chapter 10.4 uses the term "buffers", but it should use the term "objects", because a mempool may hold other objects than packet buffers. (Using the term "packet buffers" in the example in chapter 10.3 is correct.) > the speed at which a core can access its own cache for a specific memory > pool without locks provides performance gains. > > +However, to ensure that all cores can get objects when a cache is used, it > is required to allocate > +more objects: if N objects are needed, allocate a mempool with ``N + > (number_of_active_cores * > +cache_size * 1.5)`` objects. > + Perhaps add here that the factor 1.5 stems from the cache being allowed to exceed its configured size by 50 %. > The cache is composed of a small, per-core table of pointers and its length > (used as a stack). > This internal cache can be enabled or disabled at creation of the pool. > If something similar is added to the description of the "cache_size" parameter to the "rte_mempool_create()" function, the most obvious places to look for documentation are covered. A lengthy description might be redundant here, so perhaps only mention that the actual cache size is 1.5 * cache_size, because the cache is allowed to exceed its configured size by 50 %. > > > > > > > > > >> Furthermore, since the effective per-core cache size is 1.5 * > > > configured cache > > > >> size, a configured cache size of 256 may leave up to 384 objects in > > > each per- > > > >> core cache. > > > >> > > > >> With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in > > > the > > > >> caches of other cores. If you want to be able to pull 512 objects > > > with each > > > >> core, the pool size should be 4 * 512 + 1152 objects. > > > > May be we should document this in mempool library? > > > > > > > > > > Maybe. But this case I described here is a bit wonky - SPDK should > > > never have been specifying a non-zero cache in this case. We only > > > noticed this change in behavior because we were creating the mempool > > > with a cache when we shouldn’t have. > > > > So, looking at the positive side, this regression revealed a "bug" in SPDK. > ;-) > > > > PS: I assume that you are aware that mempools generally perform better with > cache, so I assume that you know what you are doing when you say that you > don't need the cache. > > > > PPS: Sorry about venting here. If nothing else, I hope it had some > entertainment value. :-) > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bug in rte_mempool_do_generic_get? 2023-02-27 9:48 ` Morten Brørup @ 2023-02-27 10:39 ` Olivier Matz 0 siblings, 0 replies; 8+ messages in thread From: Olivier Matz @ 2023-02-27 10:39 UTC (permalink / raw) To: Morten Brørup Cc: Harris, James R, Honnappa Nagarahalli, dev, nd, andrew.rybchenko, bruce.richardson On Mon, Feb 27, 2023 at 10:48:09AM +0100, Morten Brørup wrote: > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > Sent: Monday, 27 February 2023 10.10 > > > > Hi, > > > > On Sun, Feb 26, 2023 at 11:45:48AM +0100, Morten Brørup wrote: > > > > From: Harris, James R [mailto:james.r.harris@intel.com] > > > > Sent: Friday, 24 February 2023 17.43 > > > > > > > > > On Feb 24, 2023, at 6:56 AM, Honnappa Nagarahalli > > > > <Honnappa.Nagarahalli@arm.com> wrote: > > > > > > > > > >> From: Morten Brørup <mb@smartsharesystems.com> > > > > >> Sent: Friday, February 24, 2023 6:13 AM > > > > >> To: Harris, James R <james.r.harris@intel.com>; dev@dpdk.org > > > > >> Subject: RE: Bug in rte_mempool_do_generic_get? > > > > >> > > > > >>> If you have a mempool with 2048 objects, shouldn't 4 cores each be > > > > able to do a 512 buffer bulk get, regardless of the configured cache > > > > size? > > > > >> > > > > >> No, the scenario you described above is the expected behavior. I > > > > think it is > > > > >> documented somewhere that objects in the caches are unavailable for > > > > other > > > > >> cores, but now I cannot find where this is documented. > > > > >> > > > > > > > > Thanks Morten. > > > > > > > > Yeah, I think it is documented somewhere, but I also couldn’t find it. > > > > I was aware of cores not being able to allocate from another core’s > > > > cache. My surprise was that in a pristine new mempool, that 4 cores > > > > could not each do one initial 512 buffer bulk get. But I also see that > > > > even before the a2833ecc5 patch, the cache would get populated on gets > > > > less than cache size, in addition to the buffers requested by the user. > > > > So if cache size is 256, and bulk get is for 128 buffers, it pulls 384 > > > > buffers from backing pool - 128 for the caller, another 256 to prefill > > > > the cache. Your patch makes this cache filling consistent between less- > > > > than-cache-size and greater-than-or-equal-to-cache-size cases. > > > > > > Yeah... I have tried hard to clean up some of the mess introduced by the > > <insert curse words here> patch [1] that increased the effective cache size by > > factor 1.5. > > > > > > [1] > > http://git.dpdk.org/dpdk/commit/lib/librte_mempool/rte_mempool.h?id=ea5dd2744b > > 90b330f07fd10f327ab99ef55c7266 > > > > > > E.g., somewhat related to your use case, I have argued against the design > > goal of trying to keep the mempool caches full at all times, but have not yet > > been able to convince the community against it. > > > > > > Overall, this is one of the absolute core DPDK libraries, so the support for > > the changes I would like to see is near zero - even minor adjustments are > > considered very high risk, which I must admit has quite a lot of merit. > > > > > > So instead of fixing the library's implementation to make it behave as > > reasonably expected according to its documentation, the library's > > implementation is considered the reference. And as a consequence, the > > library's internals, such as the cache size multiplier, is getting promoted to > > be part of the public API, e.g. for applications to implement mempool sizing > > calculations like the one below. > > > > > > I recall running into problems once myself, when I had sized a mempool with > > a specific cache size for a specific number of cores, because 50 % additional > > objects got stuck in the caches. If you ask me, this bug is so fundamental > > that it should be fixed at the root, not by trying to document the weird > > behavior of allocating 50 % more than specified. > > > > I think we should document it: even in the case the 1.5 factor is > > removed, it is helpful to document that a user needs to reserve more > > objects when using a cache, to ensure that all cores can allocate their > > objects. > > > > Morten, what would you think about this change below? > > A big improvement! Only a few comments, inline below. > > > > > --- a/doc/guides/prog_guide/mempool_lib.rst > > +++ b/doc/guides/prog_guide/mempool_lib.rst > > @@ -89,9 +89,13 @@ In this way, each core has full access to its own cache > > (with locks) of free obj > > only when the cache fills does the core need to shuffle some of the free > > objects back to the pools ring or > > obtain more objects when the cache is empty. > > > > -While this may mean a number of buffers may sit idle on some core's cache, > > +While this may mean a number of buffers may sit idle on some core's cache > > (up to ``1.5 * cache_length``), > > Typo: cache_length -> cache_size > > Looking at this, I noticed another detail: Chapter 10.4 uses the term "buffers", but it should use the term "objects", because a mempool may hold other objects than packet buffers. (Using the term "packet buffers" in the example in chapter 10.3 is correct.) > > > the speed at which a core can access its own cache for a specific memory > > pool without locks provides performance gains. > > > > +However, to ensure that all cores can get objects when a cache is used, it > > is required to allocate > > +more objects: if N objects are needed, allocate a mempool with ``N + > > (number_of_active_cores * > > +cache_size * 1.5)`` objects. > > + > > Perhaps add here that the factor 1.5 stems from the cache being allowed to exceed its configured size by 50 %. > > > The cache is composed of a small, per-core table of pointers and its length > > (used as a stack). > > This internal cache can be enabled or disabled at creation of the pool. > > > > If something similar is added to the description of the "cache_size" parameter to the "rte_mempool_create()" function, the most obvious places to look for documentation are covered. > > A lengthy description might be redundant here, so perhaps only mention that the actual cache size is 1.5 * cache_size, because the cache is allowed to exceed its configured size by 50 %. Thanks for the quick feedback, I'll submit a patch for this. > > > > > > > > > > > > > > >> Furthermore, since the effective per-core cache size is 1.5 * > > > > configured cache > > > > >> size, a configured cache size of 256 may leave up to 384 objects in > > > > each per- > > > > >> core cache. > > > > >> > > > > >> With 4 cores, you can expect up to 3 * 384 = 1152 objects sitting in > > > > the > > > > >> caches of other cores. If you want to be able to pull 512 objects > > > > with each > > > > >> core, the pool size should be 4 * 512 + 1152 objects. > > > > > May be we should document this in mempool library? > > > > > > > > > > > > > Maybe. But this case I described here is a bit wonky - SPDK should > > > > never have been specifying a non-zero cache in this case. We only > > > > noticed this change in behavior because we were creating the mempool > > > > with a cache when we shouldn’t have. > > > > > > So, looking at the positive side, this regression revealed a "bug" in SPDK. > > ;-) > > > > > > PS: I assume that you are aware that mempools generally perform better with > > cache, so I assume that you know what you are doing when you say that you > > don't need the cache. > > > > > > PPS: Sorry about venting here. If nothing else, I hope it had some > > entertainment value. :-) > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-02-27 10:39 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-02-24 3:02 Bug in rte_mempool_do_generic_get? Harris, James R 2023-02-24 12:13 ` Morten Brørup 2023-02-24 13:56 ` Honnappa Nagarahalli 2023-02-24 16:42 ` Harris, James R 2023-02-26 10:45 ` Morten Brørup 2023-02-27 9:09 ` Olivier Matz 2023-02-27 9:48 ` Morten Brørup 2023-02-27 10:39 ` Olivier Matz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).