Apologies reader: I realize too late that my reference to private_data_size in reference to rte_mempool_create() is a typo. I meant cache_size for which the doc reads: If cache_size is non-zero, the rte_mempool library will try to limit the accesses to the common lockless pool, by maintaining a per-lcore object cache. This argument must be lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose cache_size to have "n modulo cache_size == 0": if this is not the case, some elements will always stay in the pool and will never be used. The access to the per-lcore table is of course faster than the multi-producer/consumer pool. The cache can be disabled if the cache_size argument is set to 0; it can be useful to avoid losing objects in cache. On Sat, Jan 29, 2022 at 9:29 PM fwefew 4t4tg <7532yahoo@gmail.com> wrote: > Dmitry, > > "On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc." > > Thanks. This is a crucial correction to my erroneous statement. > > I'd like to press-on then with one of my questions that, after some > additional thought > is answered however implicitly. For the benefit of other programmers who > are new > to this work. I'll explain. If wrong, please hammer on it. > > The other crucial insight is: so long as memory is allocated on the same > NUMA > node as the RXQ/TXQ runs that ultimately uses it, there is only marginal > performance > advantage to having per-core caching of mbufs in a mempool as provided by > the > private_data_size formal argument in rte_mempool_create() here: > > > https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd > > In fact the API doc should really point out the advantage; perhaps it > eliminates some > cache sloshing to get the last few percent of performance. It probably is > not a major > factor in latency or bandwidth with or without private_data_size==0. > > Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly > unchanged > to any other distinct lcore y != x provided y also runs on N *and the > memory was allocated* > *for N*. Therefore, lcore affinity to a mempool is pretty much a red > herring. > > Consider this code which originally I used as indicative of good mempool > creation, > but upon further thinking got me confused: > > > https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 > > for (size_t i = 0; i < kMaxQueuesPerPort; i++) { > > const std::string pname = get_mempool_name(phy_port, i); > > rte_mempool *mempool = > > rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, > > 0 /* priv size */, kMbufSize, numa_node); > > This has the appearance of creating one mempool per each RXQ and each TXQ. > And in > fact this is what it does. The programmer here ensures the numa_node > passed in as the > last argument is the same numa_node the RXQ/TXQ eventually runs. Since > each lcore > has its own mempool and because rte_pktmbuf_create never calls into > rte_mempool_create() > with a non-zero private_data_size, per lcore caching doesn't arise. (I > briefly checked > mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is > irrelevant* provided the RXQ > for a given mempool runs on the same numa_node as specified in the last > argument to > rte_pktmbuf_pool_create. > > Let's turn then to a larger issue: what happens if different RXQ/TXQs have > radically different > needs? > > As the code above illustrates, one merely allocates a size appropriate to > an individual RXQ/TXQ > by changing the count and size of mbufs ---- which is as simple as it can > get. You have 10 queues > each with their own memory needs? OK, then allocate one memory pool for > each. None of the other > 9 queues will have that mempool pointer. Each queue will use the mempool > only that was specified > for it. To beat a dead horse just make sure the numa_node in the > allocation and the numa node which > will ultimately run the RXQ/TXQ are the same. > > > On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk > wrote: > >> 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: >> [...] >> > 1. Does cache_size include or exclude data_room_size? >> > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? >> > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? >> >> Cache size is measured in the number of elements, irrelevant of their >> size. >> It is not a memory size, so the questions above are not really meaningful. >> >> > 4. What lcore is the allocated memory pinned to? >> >> Memory is associated with a NUMA node (DPDK calls it "socket"), not an >> lcore. >> Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). >> >> > The lcore of the caller >> > when this method is run? The answer here is important. If it's the >> lcore of >> > the caller when called, this routine should be called in the lcore's >> entry >> > point so it's on the right lcore the memory is intended. Calling it on >> the >> > lcore that happens to be running main, for example, could have a bad >> side >> > effect if it's different from where the memory will be ultimately used. >> >> The NUMA node is controlled by "socket_id" parameter. >> Your considerations are correct, often you should create separate mempools >> for each NUMA node to avoid this performance issue. (You should also >> consider which NUMA node each device belongs to.) >> >> > 5. Which one of the formal arguments represents tail room indicated in >> > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 >> [...] >> > 5. Unknown. Perhaps if you want private data which corresponds to tail >> room >> > in the diagram above one has to call rte_mempool_create() instead and >> focus >> > on private_data_size. >> >> Incorrect; tail room is simply an unused part at the end of the data room. >> Private data is for the entire mempool, not for individual mbufs. >> >> > Mempool creation is like malloc: you request the total number of >> absolute >> > bytes required. The API will not add or remove bytes to the number you >> > specify. Therefore the number you give must be inclusive of all needs >> > including your payload, any DPDK overheader, headroom, tailroom, and so >> on. >> > DPDK is not adding to the number you give for its own purposes. Clearer? >> > Perhaps ... but what needs? Read on ... >> >> On the contrary: rte_pktmbuf_pool_create() takes the amount >> of usable memory (dataroom) and adds space for rte_mbuf and the headroom. >> Furthermore, the underlying rte_mempool_create() ensures element (mbuf) >> alignment, may spread the elements between pages, etc. >> >> [...] >> > No. I might not. I might have half my TXQ and RXQs dealing with tiny >> > mbufs/packets, and the other half dealing with completely different >> traffic >> > of a completely different size and structure. So I might want memory >> pool >> > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK >> > doesn't seem to permit this. >> >> You can create different mempools for each purpose >> and specify the proper mempool to rte_eth_rx_queue_setup(). >> When creating them, you can and should also take NUMA into account. >> Take a look at init_mem() function of examples/l3fwd. >> >