DPDK usage discussions
 help / color / mirror / Atom feed
From: fwefew 4t4tg <7532yahoo@gmail.com>
To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>, users@dpdk.org
Subject: Re: allocating a mempool w/ rte_pktmbuf_pool_create()
Date: Sat, 29 Jan 2022 21:29:15 -0500	[thread overview]
Message-ID: <CA+Tq66Xb10EZVykTtNNRVQp09v5Y3aQ4S9BqAuc3qUrty4uVoQ@mail.gmail.com> (raw)
In-Reply-To: <20220130042309.5e590857@sovereign>

[-- Attachment #1: Type: text/plain, Size: 6388 bytes --]

Dmitry,

"On the contrary: rte_pktmbuf_pool_create() takes the amount
of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
alignment, may spread the elements between pages, etc."

Thanks. This is a crucial correction to my erroneous statement.

I'd like to press-on then with one of my questions that, after some
additional thought
is answered however implicitly. For the benefit of other programmers who
are new
to this work. I'll explain. If wrong, please hammer on it.

The other crucial insight is: so long as memory is allocated on the same
NUMA
node as the RXQ/TXQ runs that ultimately uses it, there is only marginal
performance
advantage to having per-core caching of mbufs in a mempool as provided by
the
private_data_size formal argument in rte_mempool_create() here:

https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd

In fact the API doc should really point out the advantage; perhaps it
eliminates some
cache sloshing to get the last few percent of performance. It probably is
not a major
factor in latency or bandwidth with or without private_data_size==0.

Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly
unchanged
to any other distinct lcore y != x provided y also runs on N *and the
memory was allocated*
*for N*. Therefore, lcore affinity to a mempool is pretty much a red
herring.

Consider this code which originally I used as indicative of good mempool
creation,
but upon further thinking got me confused:

https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76

  for (size_t i = 0; i < kMaxQueuesPerPort; i++) {

    const std::string pname = get_mempool_name(phy_port, i);

    rte_mempool *mempool =

        rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */,

                                0 /* priv size */, kMbufSize, numa_node);

This has the appearance of creating one mempool per each RXQ and each TXQ.
And in
fact this is what it does. The programmer here ensures the numa_node passed
in as the
last argument is the same numa_node the RXQ/TXQ eventually runs. Since each
lcore
has its own mempool and because rte_pktmbuf_create never calls into
rte_mempool_create()
with a non-zero private_data_size, per lcore caching doesn't arise. (I
briefly checked
mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is
irrelevant* provided the RXQ
for a given mempool runs on the same numa_node as specified in the last
argument to
rte_pktmbuf_pool_create.

Let's turn then to a larger issue: what happens if different RXQ/TXQs have
radically different
needs?

As the code above illustrates, one merely allocates a size appropriate to
an individual RXQ/TXQ
by changing the count and size of mbufs ---- which is as simple as it can
get. You have 10 queues
each with their own memory needs? OK, then allocate one memory pool for
each. None of the other
9 queues will have that mempool pointer. Each queue will use the mempool
only that was specified
for it. To beat a dead horse just make sure the numa_node in the allocation
and the numa node which
will ultimately run the RXQ/TXQ are the same.


On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
wrote:

> 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg:
> [...]
> > 1. Does cache_size include or exclude data_room_size?
> > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)?
> > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM?
>
> Cache size is measured in the number of elements, irrelevant of their size.
> It is not a memory size, so the questions above are not really meaningful.
>
> > 4. What lcore is the allocated memory pinned to?
>
> Memory is associated with a NUMA node (DPDK calls it "socket"), not an
> lcore.
> Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id().
>
> > The lcore of the caller
> > when this method is run? The answer here is important. If it's the lcore
> of
> > the caller when called, this routine should be called in the lcore's
> entry
> > point so it's on the right lcore the memory is intended. Calling it on
> the
> > lcore that happens to be running main, for example, could have a bad side
> > effect if it's different from where the memory will be ultimately used.
>
> The NUMA node is controlled by "socket_id" parameter.
> Your considerations are correct, often you should create separate mempools
> for each NUMA node to avoid this performance issue. (You should also
> consider which NUMA node each device belongs to.)
>
> > 5. Which one of the formal arguments represents tail room indicated in
> > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1
> [...]
> > 5. Unknown. Perhaps if you want private data which corresponds to tail
> room
> > in the diagram above one has to call rte_mempool_create() instead and
> focus
> > on private_data_size.
>
> Incorrect; tail room is simply an unused part at the end of the data room.
> Private data is for the entire mempool, not for individual mbufs.
>
> > Mempool creation is like malloc: you request the total number of absolute
> > bytes required. The API will not add or remove bytes to the number you
> > specify. Therefore the number you give must be inclusive of all needs
> > including your payload, any DPDK overheader, headroom, tailroom, and so
> on.
> > DPDK is not adding to the number you give for its own purposes. Clearer?
> > Perhaps ... but what needs? Read on ...
>
> On the contrary: rte_pktmbuf_pool_create() takes the amount
> of usable memory (dataroom) and adds space for rte_mbuf and the headroom.
> Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
> alignment, may spread the elements between pages, etc.
>
> [...]
> > No. I might not. I might have half my TXQ and RXQs dealing with tiny
> > mbufs/packets, and the other half dealing with completely different
> traffic
> > of a completely different size and structure. So I might want memory pool
> > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK
> > doesn't seem to permit this.
>
> You can create different mempools for each purpose
> and specify the proper mempool to rte_eth_rx_queue_setup().
> When creating them, you can and should also take NUMA into account.
> Take a look at init_mem() function of examples/l3fwd.
>

[-- Attachment #2: Type: text/html, Size: 9734 bytes --]

  reply	other threads:[~2022-01-30  2:29 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-29 23:46 fwefew 4t4tg
2022-01-30  1:23 ` Dmitry Kozlyuk
2022-01-30  2:29   ` fwefew 4t4tg [this message]
2022-01-30  2:33     ` fwefew 4t4tg
2022-01-30 11:32       ` Dmitry Kozlyuk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+Tq66Xb10EZVykTtNNRVQp09v5Y3aQ4S9BqAuc3qUrty4uVoQ@mail.gmail.com \
    --to=7532yahoo@gmail.com \
    --cc=dmitry.kozliuk@gmail.com \
    --cc=users@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).