Dmitry,

"On the contrary: rte_pktmbuf_pool_create() takes the amount
of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
alignment, may spread the elements between pages, etc."

Thanks. This is a crucial correction to my erroneous statement. 

I'd like to press-on then with one of my questions that, after some additional thought
is answered however implicitly. For the benefit of other programmers who are new 
to this work. I'll explain. If wrong, please hammer on it.

The other crucial insight is: so long as memory is allocated on the same NUMA
node as the RXQ/TXQ runs that ultimately uses it, there is only marginal performance
advantage to having per-core caching of mbufs in a mempool as provided by the
private_data_size formal argument in rte_mempool_create() here:

https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd

In fact the API doc should really point out the advantage; perhaps it eliminates some
cache sloshing to get the last few percent of performance. It probably is not a major
factor in latency or bandwidth with or without private_data_size==0.

Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly unchanged
to any other distinct lcore y != x provided y also runs on N and the memory was allocated
for N. Therefore, lcore affinity to a mempool is pretty much a red herring.

Consider this code which originally I used as indicative of good mempool creation,
but upon further thinking got me confused:

https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76

  for (size_t i = 0; i < kMaxQueuesPerPort; i++) {

    const std::string pname = get_mempool_name(phy_port, i);

    rte_mempool *mempool =

        rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */,

                                0 /* priv size */, kMbufSize, numa_node);


This has the appearance of creating one mempool per each RXQ and each TXQ. And in
fact this is what it does. The programmer here ensures the numa_node passed in as the
last argument is the same numa_node the RXQ/TXQ eventually runs. Since each lcore
has its own mempool and because rte_pktmbuf_create never calls into rte_mempool_create()
with a non-zero private_data_size, per lcore caching doesn't arise. (I briefly checked 
mbuf/rte_mbuf.c to confirm). Indeed lcore v. mempool affinity is irrelevant provided the RXQ 
for a given mempool runs on the same numa_node as specified in the last argument to
rte_pktmbuf_pool_create.

Let's turn then to a larger issue: what happens if different RXQ/TXQs have radically different
needs? 

As the code above illustrates, one merely allocates a size appropriate to an individual RXQ/TXQ
by changing the count and size of mbufs ---- which is as simple as it can get. You have 10 queues
each with their own memory needs? OK, then allocate one memory pool for each. None of the other
9 queues will have that mempool pointer. Each queue will use the mempool only that was specified
for it. To beat a dead horse just make sure the numa_node in the allocation and the numa node which 
will ultimately run the RXQ/TXQ are the same.


On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
2022-01-29 18:46 (UTC-0500), fwefew 4t4tg:
[...]
> 1. Does cache_size include or exclude data_room_size?
> 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)?
> 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM?

Cache size is measured in the number of elements, irrelevant of their size.
It is not a memory size, so the questions above are not really meaningful.

> 4. What lcore is the allocated memory pinned to?

Memory is associated with a NUMA node (DPDK calls it "socket"), not an lcore.
Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id().

> The lcore of the caller
> when this method is run? The answer here is important. If it's the lcore of
> the caller when called, this routine should be called in the lcore's entry
> point so it's on the right lcore the memory is intended. Calling it on the
> lcore that happens to be running main, for example, could have a bad side
> effect if it's different from where the memory will be ultimately used.

The NUMA node is controlled by "socket_id" parameter.
Your considerations are correct, often you should create separate mempools
for each NUMA node to avoid this performance issue. (You should also
consider which NUMA node each device belongs to.)

> 5. Which one of the formal arguments represents tail room indicated in
> https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1
[...]
> 5. Unknown. Perhaps if you want private data which corresponds to tail room
> in the diagram above one has to call rte_mempool_create() instead and focus
> on private_data_size.

Incorrect; tail room is simply an unused part at the end of the data room.
Private data is for the entire mempool, not for individual mbufs.

> Mempool creation is like malloc: you request the total number of absolute
> bytes required. The API will not add or remove bytes to the number you
> specify. Therefore the number you give must be inclusive of all needs
> including your payload, any DPDK overheader, headroom, tailroom, and so on.
> DPDK is not adding to the number you give for its own purposes. Clearer?
> Perhaps ... but what needs? Read on ...

On the contrary: rte_pktmbuf_pool_create() takes the amount
of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
alignment, may spread the elements between pages, etc.

[...]
> No. I might not. I might have half my TXQ and RXQs dealing with tiny
> mbufs/packets, and the other half dealing with completely different traffic
> of a completely different size and structure. So I might want memory pool
> allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK
> doesn't seem to permit this.

You can create different mempools for each purpose
and specify the proper mempool to rte_eth_rx_queue_setup().
When creating them, you can and should also take NUMA into account.
Take a look at init_mem() function of examples/l3fwd.