* allocating a mempool w/ rte_pktmbuf_pool_create() @ 2022-01-29 23:46 fwefew 4t4tg 2022-01-30 1:23 ` Dmitry Kozlyuk 0 siblings, 1 reply; 5+ messages in thread From: fwefew 4t4tg @ 2022-01-29 23:46 UTC (permalink / raw) To: users [-- Attachment #1: Type: text/plain, Size: 4145 bytes --] The API rte_pktmbuf_pool_create() https://doc.dpdk.org/api/rte__mbuf_8h.html#a593921f13307803b94bbb4e0932db962 at first glance seems minimal, complete. It's not. It's really not. The doc's link to `rte_mempool_create()` helps a little but not much. It lacks at least the following important facets. I would appreciate answers from whomever knows how DPDK thinks here: 1. Does cache_size include or exclude data_room_size? 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? 4. What lcore is the allocated memory pinned to? The lcore of the caller when this method is run? The answer here is important. If it's the lcore of the caller when called, this routine should be called in the lcore's entry point so it's on the right lcore the memory is intended. Calling it on the lcore that happens to be running main, for example, could have a bad side effect if it's different from where the memory will be ultimately used. 5. Which one of the formal arguments represents tail room indicated in https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 My answers best I can tell follow: 1. Excludes 2. Excludes 3. Excludes 4. Caller does not enter into this situation; see below 5. Unknown. Perhaps if you want private data which corresponds to tail room in the diagram above one has to call rte_mempool_create() instead and focus on private_data_size. Discussion: Mempool creation is like malloc: you request the total number of absolute bytes required. The API will not add or remove bytes to the number you specify. Therefore the number you give must be inclusive of all needs including your payload, any DPDK overheader, headroom, tailroom, and so on. DPDK is not adding to the number you give for its own purposes. Clearer? Perhaps ... but what needs? Read on ... Unlike malloc rte_pktmbuf_pool_create() takes *n* the number of objects that memory will hold. Therefore the cache_size mod n should be 0. Indeed, some DPDK code like that of https://github.com/erpc-io/eRPC the author allocates one mempool per RXQ or TXQ where the memory requested and number of objects in that memory pool are appropriate for exactly one RXQ or exactly one TXQ. Clearly then the total amount of memory in a specific mempool divided by the number of objects it's intended to hold should mod to zero. This then begs the question, ok, if DPDK can do this what lcore is the memory pinned or cached for? The caller's lcore? If so it depends on what lcore one allocates the memory which can depend on where/when in the program's history the call is made. Note the API does not take a lcore argument. A *careful reading, however, suggests the eRPC code is misguided*. DPDK does not support creating a mempool for usage on one lcore. The doc reads the formal argument *cache_size* gives the *size of the per-core object cache. See rte_mempool_create() for details.* Also the doc reads: *the optimum size (in terms of memory usage) for a mempool is when n is a power of two minus one: n = (2^q - 1) *also contradicts a mod 0 scenario. Now I'm a little but not totally surprised here. Yes, in some applications TXQs and RXQs are bouncing around uniform packets. So telling DPDK I need X bytes and letting DPDK do the spade work of breaking up the memory for efficient per core access ultimately per RXQ or per TXQ is a real benefit. DPDK will give nice per lcore cache, lockless memory access. Gotta like that. Right? No. I might not. I might have half my TXQ and RXQs dealing with tiny mbufs/packets, and the other half dealing with completely different traffic of a completely different size and structure. So I might want memory pool allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK doesn't seem to permit this. DPDK seems to require me to allocate for the largest possible application mbuf size possible for all cases and, then upon allocating a mbuf from the pool, *use as much or as few bytes of allocation as needed for the particular purpose at hand. *If that's what DPDK wants, fine. But I think it ought to be a hell of lot easy to see that's so. [-- Attachment #2: Type: text/html, Size: 4745 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: allocating a mempool w/ rte_pktmbuf_pool_create() 2022-01-29 23:46 allocating a mempool w/ rte_pktmbuf_pool_create() fwefew 4t4tg @ 2022-01-30 1:23 ` Dmitry Kozlyuk 2022-01-30 2:29 ` fwefew 4t4tg 0 siblings, 1 reply; 5+ messages in thread From: Dmitry Kozlyuk @ 2022-01-30 1:23 UTC (permalink / raw) To: fwefew 4t4tg; +Cc: users 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: [...] > 1. Does cache_size include or exclude data_room_size? > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? Cache size is measured in the number of elements, irrelevant of their size. It is not a memory size, so the questions above are not really meaningful. > 4. What lcore is the allocated memory pinned to? Memory is associated with a NUMA node (DPDK calls it "socket"), not an lcore. Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). > The lcore of the caller > when this method is run? The answer here is important. If it's the lcore of > the caller when called, this routine should be called in the lcore's entry > point so it's on the right lcore the memory is intended. Calling it on the > lcore that happens to be running main, for example, could have a bad side > effect if it's different from where the memory will be ultimately used. The NUMA node is controlled by "socket_id" parameter. Your considerations are correct, often you should create separate mempools for each NUMA node to avoid this performance issue. (You should also consider which NUMA node each device belongs to.) > 5. Which one of the formal arguments represents tail room indicated in > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 [...] > 5. Unknown. Perhaps if you want private data which corresponds to tail room > in the diagram above one has to call rte_mempool_create() instead and focus > on private_data_size. Incorrect; tail room is simply an unused part at the end of the data room. Private data is for the entire mempool, not for individual mbufs. > Mempool creation is like malloc: you request the total number of absolute > bytes required. The API will not add or remove bytes to the number you > specify. Therefore the number you give must be inclusive of all needs > including your payload, any DPDK overheader, headroom, tailroom, and so on. > DPDK is not adding to the number you give for its own purposes. Clearer? > Perhaps ... but what needs? Read on ... On the contrary: rte_pktmbuf_pool_create() takes the amount of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf) alignment, may spread the elements between pages, etc. [...] > No. I might not. I might have half my TXQ and RXQs dealing with tiny > mbufs/packets, and the other half dealing with completely different traffic > of a completely different size and structure. So I might want memory pool > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK > doesn't seem to permit this. You can create different mempools for each purpose and specify the proper mempool to rte_eth_rx_queue_setup(). When creating them, you can and should also take NUMA into account. Take a look at init_mem() function of examples/l3fwd. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: allocating a mempool w/ rte_pktmbuf_pool_create() 2022-01-30 1:23 ` Dmitry Kozlyuk @ 2022-01-30 2:29 ` fwefew 4t4tg 2022-01-30 2:33 ` fwefew 4t4tg 0 siblings, 1 reply; 5+ messages in thread From: fwefew 4t4tg @ 2022-01-30 2:29 UTC (permalink / raw) To: Dmitry Kozlyuk, users [-- Attachment #1: Type: text/plain, Size: 6388 bytes --] Dmitry, "On the contrary: rte_pktmbuf_pool_create() takes the amount of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf) alignment, may spread the elements between pages, etc." Thanks. This is a crucial correction to my erroneous statement. I'd like to press-on then with one of my questions that, after some additional thought is answered however implicitly. For the benefit of other programmers who are new to this work. I'll explain. If wrong, please hammer on it. The other crucial insight is: so long as memory is allocated on the same NUMA node as the RXQ/TXQ runs that ultimately uses it, there is only marginal performance advantage to having per-core caching of mbufs in a mempool as provided by the private_data_size formal argument in rte_mempool_create() here: https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd In fact the API doc should really point out the advantage; perhaps it eliminates some cache sloshing to get the last few percent of performance. It probably is not a major factor in latency or bandwidth with or without private_data_size==0. Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly unchanged to any other distinct lcore y != x provided y also runs on N *and the memory was allocated* *for N*. Therefore, lcore affinity to a mempool is pretty much a red herring. Consider this code which originally I used as indicative of good mempool creation, but upon further thinking got me confused: https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 for (size_t i = 0; i < kMaxQueuesPerPort; i++) { const std::string pname = get_mempool_name(phy_port, i); rte_mempool *mempool = rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, 0 /* priv size */, kMbufSize, numa_node); This has the appearance of creating one mempool per each RXQ and each TXQ. And in fact this is what it does. The programmer here ensures the numa_node passed in as the last argument is the same numa_node the RXQ/TXQ eventually runs. Since each lcore has its own mempool and because rte_pktmbuf_create never calls into rte_mempool_create() with a non-zero private_data_size, per lcore caching doesn't arise. (I briefly checked mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is irrelevant* provided the RXQ for a given mempool runs on the same numa_node as specified in the last argument to rte_pktmbuf_pool_create. Let's turn then to a larger issue: what happens if different RXQ/TXQs have radically different needs? As the code above illustrates, one merely allocates a size appropriate to an individual RXQ/TXQ by changing the count and size of mbufs ---- which is as simple as it can get. You have 10 queues each with their own memory needs? OK, then allocate one memory pool for each. None of the other 9 queues will have that mempool pointer. Each queue will use the mempool only that was specified for it. To beat a dead horse just make sure the numa_node in the allocation and the numa node which will ultimately run the RXQ/TXQ are the same. On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote: > 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: > [...] > > 1. Does cache_size include or exclude data_room_size? > > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? > > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? > > Cache size is measured in the number of elements, irrelevant of their size. > It is not a memory size, so the questions above are not really meaningful. > > > 4. What lcore is the allocated memory pinned to? > > Memory is associated with a NUMA node (DPDK calls it "socket"), not an > lcore. > Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). > > > The lcore of the caller > > when this method is run? The answer here is important. If it's the lcore > of > > the caller when called, this routine should be called in the lcore's > entry > > point so it's on the right lcore the memory is intended. Calling it on > the > > lcore that happens to be running main, for example, could have a bad side > > effect if it's different from where the memory will be ultimately used. > > The NUMA node is controlled by "socket_id" parameter. > Your considerations are correct, often you should create separate mempools > for each NUMA node to avoid this performance issue. (You should also > consider which NUMA node each device belongs to.) > > > 5. Which one of the formal arguments represents tail room indicated in > > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 > [...] > > 5. Unknown. Perhaps if you want private data which corresponds to tail > room > > in the diagram above one has to call rte_mempool_create() instead and > focus > > on private_data_size. > > Incorrect; tail room is simply an unused part at the end of the data room. > Private data is for the entire mempool, not for individual mbufs. > > > Mempool creation is like malloc: you request the total number of absolute > > bytes required. The API will not add or remove bytes to the number you > > specify. Therefore the number you give must be inclusive of all needs > > including your payload, any DPDK overheader, headroom, tailroom, and so > on. > > DPDK is not adding to the number you give for its own purposes. Clearer? > > Perhaps ... but what needs? Read on ... > > On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc. > > [...] > > No. I might not. I might have half my TXQ and RXQs dealing with tiny > > mbufs/packets, and the other half dealing with completely different > traffic > > of a completely different size and structure. So I might want memory pool > > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK > > doesn't seem to permit this. > > You can create different mempools for each purpose > and specify the proper mempool to rte_eth_rx_queue_setup(). > When creating them, you can and should also take NUMA into account. > Take a look at init_mem() function of examples/l3fwd. > [-- Attachment #2: Type: text/html, Size: 9734 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: allocating a mempool w/ rte_pktmbuf_pool_create() 2022-01-30 2:29 ` fwefew 4t4tg @ 2022-01-30 2:33 ` fwefew 4t4tg 2022-01-30 11:32 ` Dmitry Kozlyuk 0 siblings, 1 reply; 5+ messages in thread From: fwefew 4t4tg @ 2022-01-30 2:33 UTC (permalink / raw) To: Dmitry Kozlyuk, users [-- Attachment #1: Type: text/plain, Size: 7537 bytes --] Apologies reader: I realize too late that my reference to private_data_size in reference to rte_mempool_create() is a typo. I meant cache_size for which the doc reads: If cache_size is non-zero, the rte_mempool <https://doc.dpdk.org/api/structrte__mempool.html> library will try to limit the accesses to the common lockless pool, by maintaining a per-lcore object cache. This argument must be lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose cache_size to have "n modulo cache_size == 0": if this is not the case, some elements will always stay in the pool and will never be used. The access to the per-lcore table is of course faster than the multi-producer/consumer pool. The cache can be disabled if the cache_size argument is set to 0; it can be useful to avoid losing objects in cache. On Sat, Jan 29, 2022 at 9:29 PM fwefew 4t4tg <7532yahoo@gmail.com> wrote: > Dmitry, > > "On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc." > > Thanks. This is a crucial correction to my erroneous statement. > > I'd like to press-on then with one of my questions that, after some > additional thought > is answered however implicitly. For the benefit of other programmers who > are new > to this work. I'll explain. If wrong, please hammer on it. > > The other crucial insight is: so long as memory is allocated on the same > NUMA > node as the RXQ/TXQ runs that ultimately uses it, there is only marginal > performance > advantage to having per-core caching of mbufs in a mempool as provided by > the > private_data_size formal argument in rte_mempool_create() here: > > > https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd > > In fact the API doc should really point out the advantage; perhaps it > eliminates some > cache sloshing to get the last few percent of performance. It probably is > not a major > factor in latency or bandwidth with or without private_data_size==0. > > Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly > unchanged > to any other distinct lcore y != x provided y also runs on N *and the > memory was allocated* > *for N*. Therefore, lcore affinity to a mempool is pretty much a red > herring. > > Consider this code which originally I used as indicative of good mempool > creation, > but upon further thinking got me confused: > > > https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 > > for (size_t i = 0; i < kMaxQueuesPerPort; i++) { > > const std::string pname = get_mempool_name(phy_port, i); > > rte_mempool *mempool = > > rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, > > 0 /* priv size */, kMbufSize, numa_node); > > This has the appearance of creating one mempool per each RXQ and each TXQ. > And in > fact this is what it does. The programmer here ensures the numa_node > passed in as the > last argument is the same numa_node the RXQ/TXQ eventually runs. Since > each lcore > has its own mempool and because rte_pktmbuf_create never calls into > rte_mempool_create() > with a non-zero private_data_size, per lcore caching doesn't arise. (I > briefly checked > mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is > irrelevant* provided the RXQ > for a given mempool runs on the same numa_node as specified in the last > argument to > rte_pktmbuf_pool_create. > > Let's turn then to a larger issue: what happens if different RXQ/TXQs have > radically different > needs? > > As the code above illustrates, one merely allocates a size appropriate to > an individual RXQ/TXQ > by changing the count and size of mbufs ---- which is as simple as it can > get. You have 10 queues > each with their own memory needs? OK, then allocate one memory pool for > each. None of the other > 9 queues will have that mempool pointer. Each queue will use the mempool > only that was specified > for it. To beat a dead horse just make sure the numa_node in the > allocation and the numa node which > will ultimately run the RXQ/TXQ are the same. > > > On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> > wrote: > >> 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: >> [...] >> > 1. Does cache_size include or exclude data_room_size? >> > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? >> > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? >> >> Cache size is measured in the number of elements, irrelevant of their >> size. >> It is not a memory size, so the questions above are not really meaningful. >> >> > 4. What lcore is the allocated memory pinned to? >> >> Memory is associated with a NUMA node (DPDK calls it "socket"), not an >> lcore. >> Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). >> >> > The lcore of the caller >> > when this method is run? The answer here is important. If it's the >> lcore of >> > the caller when called, this routine should be called in the lcore's >> entry >> > point so it's on the right lcore the memory is intended. Calling it on >> the >> > lcore that happens to be running main, for example, could have a bad >> side >> > effect if it's different from where the memory will be ultimately used. >> >> The NUMA node is controlled by "socket_id" parameter. >> Your considerations are correct, often you should create separate mempools >> for each NUMA node to avoid this performance issue. (You should also >> consider which NUMA node each device belongs to.) >> >> > 5. Which one of the formal arguments represents tail room indicated in >> > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 >> [...] >> > 5. Unknown. Perhaps if you want private data which corresponds to tail >> room >> > in the diagram above one has to call rte_mempool_create() instead and >> focus >> > on private_data_size. >> >> Incorrect; tail room is simply an unused part at the end of the data room. >> Private data is for the entire mempool, not for individual mbufs. >> >> > Mempool creation is like malloc: you request the total number of >> absolute >> > bytes required. The API will not add or remove bytes to the number you >> > specify. Therefore the number you give must be inclusive of all needs >> > including your payload, any DPDK overheader, headroom, tailroom, and so >> on. >> > DPDK is not adding to the number you give for its own purposes. Clearer? >> > Perhaps ... but what needs? Read on ... >> >> On the contrary: rte_pktmbuf_pool_create() takes the amount >> of usable memory (dataroom) and adds space for rte_mbuf and the headroom. >> Furthermore, the underlying rte_mempool_create() ensures element (mbuf) >> alignment, may spread the elements between pages, etc. >> >> [...] >> > No. I might not. I might have half my TXQ and RXQs dealing with tiny >> > mbufs/packets, and the other half dealing with completely different >> traffic >> > of a completely different size and structure. So I might want memory >> pool >> > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK >> > doesn't seem to permit this. >> >> You can create different mempools for each purpose >> and specify the proper mempool to rte_eth_rx_queue_setup(). >> When creating them, you can and should also take NUMA into account. >> Take a look at init_mem() function of examples/l3fwd. >> > [-- Attachment #2: Type: text/html, Size: 10991 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: allocating a mempool w/ rte_pktmbuf_pool_create() 2022-01-30 2:33 ` fwefew 4t4tg @ 2022-01-30 11:32 ` Dmitry Kozlyuk 0 siblings, 0 replies; 5+ messages in thread From: Dmitry Kozlyuk @ 2022-01-30 11:32 UTC (permalink / raw) To: fwefew 4t4tg; +Cc: users Hi, 2022-01-29 21:33 (UTC-0500), fwefew 4t4tg: [...] > > The other crucial insight is: so long as memory is allocated on the same > > NUMA node as the RXQ/TXQ runs that ultimately uses it, there is only marginal > > performance advantage to having per-core caching of mbufs in a mempool > > as provided by the private_data_size formal argument in rte_mempool_create() here: > > > > https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd > > > > In fact the API doc should really point out the advantage; perhaps it > > eliminates some cache sloshing to get the last few percent of performance. Note: "cache sloshing", aka "false sharing", is not the case here. There is a true, not false, concurrency for the mempool ring in case multiple lcores use one mempool (see below why you may want this). A colloquial term is "contention", per-lcore caching reduces it. Later you are talking about the case when a mempool is created for each queue. The potential issue with this approach is that one queue may quickly deplete its mempool; say, if it does IPv4 reassembly and holds fragments for long. To counter this, each queue mempool must be large, which is a memory waste. This is why often one mempool is created for a set of queues (processed on lcores from a single NUMA node at least). If one queue consumes more mbufs then the others, it is not a problem anymore as long as the mempool as a whole is not depleted. Per-lcore caching is optimizing this case when many lcores access one mempool. It may be less relevant for your case. You can run "mempool_perf_autotest" command of app/test/dpdk-test binary to see how the cache influences performance. See also: https://doc.dpdk.org/guides/prog_guide/mempool_lib.html#mempool-local-cache [...] > > Let's turn then to a larger issue: what happens if different RXQ/TXQs have > > radically different needs? > > > > As the code above illustrates, one merely allocates a size appropriate to > > an individual RXQ/TXQ by changing the count and size of mbufs ---- > > which is as simple as it can get. Correct. As explained above, it can be also one mempool per queue group. What do you think is missing here for your use case? ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2022-01-30 11:32 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-29 23:46 allocating a mempool w/ rte_pktmbuf_pool_create() fwefew 4t4tg 2022-01-30 1:23 ` Dmitry Kozlyuk 2022-01-30 2:29 ` fwefew 4t4tg 2022-01-30 2:33 ` fwefew 4t4tg 2022-01-30 11:32 ` Dmitry Kozlyuk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).