Apologies reader: I realize too late that my reference to private_data_size
in reference to rte_mempool_create() is a typo.
I meant cache_size for which the doc reads:


If cache_size is non-zero, the rte_mempool
<https://doc.dpdk.org/api/structrte__mempool.html> library will try to
limit the accesses to the common lockless pool, by maintaining a per-lcore
object cache. This argument must be lower or equal to
RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose cache_size
to have "n modulo cache_size == 0": if this is not the case, some elements
will always stay in the pool and will never be used. The access to the
per-lcore table is of course faster than the multi-producer/consumer pool.
The cache can be disabled if the cache_size argument is set to 0; it can be
useful to avoid losing objects in cache.


On Sat, Jan 29, 2022 at 9:29 PM fwefew 4t4tg <7532yahoo@gmail.com> wrote:

> Dmitry,
>
> "On the contrary: rte_pktmbuf_pool_create() takes the amount
> of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
> Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
> alignment, may spread the elements between pages, etc."
>
> Thanks. This is a crucial correction to my erroneous statement.
>
> I'd like to press-on then with one of my questions that, after some
> additional thought
> is answered however implicitly. For the benefit of other programmers who
> are new
> to this work. I'll explain. If wrong, please hammer on it.
>
> The other crucial insight is: so long as memory is allocated on the same
> NUMA
> node as the RXQ/TXQ runs that ultimately uses it, there is only marginal
> performance
> advantage to having per-core caching of mbufs in a mempool as provided by
> the
> private_data_size formal argument in rte_mempool_create() here:
>
>
> https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd
>
> In fact the API doc should really point out the advantage; perhaps it
> eliminates some
> cache sloshing to get the last few percent of performance. It probably is
> not a major
> factor in latency or bandwidth with or without private_data_size==0.
>
> Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly
> unchanged
> to any other distinct lcore y != x provided y also runs on N *and the
> memory was allocated*
> *for N*. Therefore, lcore affinity to a mempool is pretty much a red
> herring.
>
> Consider this code which originally I used as indicative of good mempool
> creation,
> but upon further thinking got me confused:
>
>
> https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76
>
>   for (size_t i = 0; i < kMaxQueuesPerPort; i++) {
>
>     const std::string pname = get_mempool_name(phy_port, i);
>
>     rte_mempool *mempool =
>
>         rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */,
>
>                                 0 /* priv size */, kMbufSize, numa_node);
>
> This has the appearance of creating one mempool per each RXQ and each TXQ.
> And in
> fact this is what it does. The programmer here ensures the numa_node
> passed in as the
> last argument is the same numa_node the RXQ/TXQ eventually runs. Since
> each lcore
> has its own mempool and because rte_pktmbuf_create never calls into
> rte_mempool_create()
> with a non-zero private_data_size, per lcore caching doesn't arise. (I
> briefly checked
> mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is
> irrelevant* provided the RXQ
> for a given mempool runs on the same numa_node as specified in the last
> argument to
> rte_pktmbuf_pool_create.
>
> Let's turn then to a larger issue: what happens if different RXQ/TXQs have
> radically different
> needs?
>
> As the code above illustrates, one merely allocates a size appropriate to
> an individual RXQ/TXQ
> by changing the count and size of mbufs ---- which is as simple as it can
> get. You have 10 queues
> each with their own memory needs? OK, then allocate one memory pool for
> each. None of the other
> 9 queues will have that mempool pointer. Each queue will use the mempool
> only that was specified
> for it. To beat a dead horse just make sure the numa_node in the
> allocation and the numa node which
> will ultimately run the RXQ/TXQ are the same.
>
>
> On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> wrote:
>
>> 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg:
>> [...]
>> > 1. Does cache_size include or exclude data_room_size?
>> > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)?
>> > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM?
>>
>> Cache size is measured in the number of elements, irrelevant of their
>> size.
>> It is not a memory size, so the questions above are not really meaningful.
>>
>> > 4. What lcore is the allocated memory pinned to?
>>
>> Memory is associated with a NUMA node (DPDK calls it "socket"), not an
>> lcore.
>> Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id().
>>
>> > The lcore of the caller
>> > when this method is run? The answer here is important. If it's the
>> lcore of
>> > the caller when called, this routine should be called in the lcore's
>> entry
>> > point so it's on the right lcore the memory is intended. Calling it on
>> the
>> > lcore that happens to be running main, for example, could have a bad
>> side
>> > effect if it's different from where the memory will be ultimately used.
>>
>> The NUMA node is controlled by "socket_id" parameter.
>> Your considerations are correct, often you should create separate mempools
>> for each NUMA node to avoid this performance issue. (You should also
>> consider which NUMA node each device belongs to.)
>>
>> > 5. Which one of the formal arguments represents tail room indicated in
>> > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1
>> [...]
>> > 5. Unknown. Perhaps if you want private data which corresponds to tail
>> room
>> > in the diagram above one has to call rte_mempool_create() instead and
>> focus
>> > on private_data_size.
>>
>> Incorrect; tail room is simply an unused part at the end of the data room.
>> Private data is for the entire mempool, not for individual mbufs.
>>
>> > Mempool creation is like malloc: you request the total number of
>> absolute
>> > bytes required. The API will not add or remove bytes to the number you
>> > specify. Therefore the number you give must be inclusive of all needs
>> > including your payload, any DPDK overheader, headroom, tailroom, and so
>> on.
>> > DPDK is not adding to the number you give for its own purposes. Clearer?
>> > Perhaps ... but what needs? Read on ...
>>
>> On the contrary: rte_pktmbuf_pool_create() takes the amount
>> of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
>> Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
>> alignment, may spread the elements between pages, etc.
>>
>> [...]
>> > No. I might not. I might have half my TXQ and RXQs dealing with tiny
>> > mbufs/packets, and the other half dealing with completely different
>> traffic
>> > of a completely different size and structure. So I might want memory
>> pool
>> > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK
>> > doesn't seem to permit this.
>>
>> You can create different mempools for each purpose
>> and specify the proper mempool to rte_eth_rx_queue_setup().
>> When creating them, you can and should also take NUMA into account.
>> Take a look at init_mem() function of examples/l3fwd.
>>
>