From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1193EA034C for ; Sun, 30 Jan 2022 03:29:29 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 8ED094069F; Sun, 30 Jan 2022 03:29:28 +0100 (CET) Received: from mail-wr1-f46.google.com (mail-wr1-f46.google.com [209.85.221.46]) by mails.dpdk.org (Postfix) with ESMTP id 5361940041 for ; Sun, 30 Jan 2022 03:29:27 +0100 (CET) Received: by mail-wr1-f46.google.com with SMTP id h21so18483228wrb.8 for ; Sat, 29 Jan 2022 18:29:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=sTYHAp3wNFaNPqrSnihkW3tJCZafXccyu04eOAi0Pow=; b=lM69o4epNdoOugEiwROH3Atg2LnvPA2BvH3X0GEAUMO8VWdk7q1JhG/X/Bcvi1eUtu NYEfPq11eyUef3lv6wnakwAzPVGZ7j1IdI03yi4HGAEF0W+/wyi1DMzte3bk7crIOSUW VklPrCzzRs9zoR2O5E+03eZtYW65KnKY76hK6ETxHIXsF4hDtTK69orGu3Teo0EGB6Ec SsNXJqzsmh6H8LoF0QXEPKanfuHzkLgs879v+tOUwCUgdaj7iR58zkGXIVVIZeUkBnjQ igs+qN8wHkXTT0W/h90sXHEgdedbhchnKy7p7U2eU02zH9gFpoV1CbXR0bUIaP/QVho8 uJ6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=sTYHAp3wNFaNPqrSnihkW3tJCZafXccyu04eOAi0Pow=; b=EWjA8xF11wwo7EZd+hv9D/d2V9OdvyXJ0yTs8B/Kk8bWVN9GUuwemWBbEl1nFa2qUO Ju47JsYpqcpEC9X0kuhnoJeW5xogVplwe8ZrxFJW71wicer/nIWz1qiRuHnkn//CUFnR oCyYVMCF7QysrZikgSxPms2NDLb9ecUTomaj8yuiqujVUq5dE4lDmNL2ze01H7AD9WNn wjZUQfLiOgndSRm3xcLA0vqeLwNpkdJtI92/6+S8Hri5HiDNlBN5Z8Pkou3XKQZGDMec 85MqkrxZSDJvqHwZ2TPgLeuU3eA6tqD7H5ZqXexE1y4uHYvf4L+9SmyHZTzg1XV+XuYC UW2A== X-Gm-Message-State: AOAM5319efUOFIfa27spEgYqbBOL/Em64oPuwz9EVyXjNPVL3eTz+Ohw 0jeMcY6Y2sh3K5kORgW4zFtcRgkcLtreNXDcit7gc+E8vJg= X-Google-Smtp-Source: ABdhPJyE3qKlAs2AwvLJPP1n3iJlvxIht5ei6y+txdDONxVZO13wGC98ibghHg3fEezG5n/Kay1WaM8iBbgsRBJSgMk= X-Received: by 2002:adf:aade:: with SMTP id i30mr12481663wrc.179.1643509766845; Sat, 29 Jan 2022 18:29:26 -0800 (PST) MIME-Version: 1.0 References: <20220130042309.5e590857@sovereign> In-Reply-To: <20220130042309.5e590857@sovereign> From: fwefew 4t4tg <7532yahoo@gmail.com> Date: Sat, 29 Jan 2022 21:29:15 -0500 Message-ID: Subject: Re: allocating a mempool w/ rte_pktmbuf_pool_create() To: Dmitry Kozlyuk , users@dpdk.org Content-Type: multipart/alternative; boundary="000000000000d67a2505d6c36fa1" X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: users-bounces@dpdk.org --000000000000d67a2505d6c36fa1 Content-Type: text/plain; charset="UTF-8" Dmitry, "On the contrary: rte_pktmbuf_pool_create() takes the amount of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf) alignment, may spread the elements between pages, etc." Thanks. This is a crucial correction to my erroneous statement. I'd like to press-on then with one of my questions that, after some additional thought is answered however implicitly. For the benefit of other programmers who are new to this work. I'll explain. If wrong, please hammer on it. The other crucial insight is: so long as memory is allocated on the same NUMA node as the RXQ/TXQ runs that ultimately uses it, there is only marginal performance advantage to having per-core caching of mbufs in a mempool as provided by the private_data_size formal argument in rte_mempool_create() here: https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd In fact the API doc should really point out the advantage; perhaps it eliminates some cache sloshing to get the last few percent of performance. It probably is not a major factor in latency or bandwidth with or without private_data_size==0. Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly unchanged to any other distinct lcore y != x provided y also runs on N *and the memory was allocated* *for N*. Therefore, lcore affinity to a mempool is pretty much a red herring. Consider this code which originally I used as indicative of good mempool creation, but upon further thinking got me confused: https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 for (size_t i = 0; i < kMaxQueuesPerPort; i++) { const std::string pname = get_mempool_name(phy_port, i); rte_mempool *mempool = rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, 0 /* priv size */, kMbufSize, numa_node); This has the appearance of creating one mempool per each RXQ and each TXQ. And in fact this is what it does. The programmer here ensures the numa_node passed in as the last argument is the same numa_node the RXQ/TXQ eventually runs. Since each lcore has its own mempool and because rte_pktmbuf_create never calls into rte_mempool_create() with a non-zero private_data_size, per lcore caching doesn't arise. (I briefly checked mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is irrelevant* provided the RXQ for a given mempool runs on the same numa_node as specified in the last argument to rte_pktmbuf_pool_create. Let's turn then to a larger issue: what happens if different RXQ/TXQs have radically different needs? As the code above illustrates, one merely allocates a size appropriate to an individual RXQ/TXQ by changing the count and size of mbufs ---- which is as simple as it can get. You have 10 queues each with their own memory needs? OK, then allocate one memory pool for each. None of the other 9 queues will have that mempool pointer. Each queue will use the mempool only that was specified for it. To beat a dead horse just make sure the numa_node in the allocation and the numa node which will ultimately run the RXQ/TXQ are the same. On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk wrote: > 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: > [...] > > 1. Does cache_size include or exclude data_room_size? > > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? > > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? > > Cache size is measured in the number of elements, irrelevant of their size. > It is not a memory size, so the questions above are not really meaningful. > > > 4. What lcore is the allocated memory pinned to? > > Memory is associated with a NUMA node (DPDK calls it "socket"), not an > lcore. > Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). > > > The lcore of the caller > > when this method is run? The answer here is important. If it's the lcore > of > > the caller when called, this routine should be called in the lcore's > entry > > point so it's on the right lcore the memory is intended. Calling it on > the > > lcore that happens to be running main, for example, could have a bad side > > effect if it's different from where the memory will be ultimately used. > > The NUMA node is controlled by "socket_id" parameter. > Your considerations are correct, often you should create separate mempools > for each NUMA node to avoid this performance issue. (You should also > consider which NUMA node each device belongs to.) > > > 5. Which one of the formal arguments represents tail room indicated in > > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 > [...] > > 5. Unknown. Perhaps if you want private data which corresponds to tail > room > > in the diagram above one has to call rte_mempool_create() instead and > focus > > on private_data_size. > > Incorrect; tail room is simply an unused part at the end of the data room. > Private data is for the entire mempool, not for individual mbufs. > > > Mempool creation is like malloc: you request the total number of absolute > > bytes required. The API will not add or remove bytes to the number you > > specify. Therefore the number you give must be inclusive of all needs > > including your payload, any DPDK overheader, headroom, tailroom, and so > on. > > DPDK is not adding to the number you give for its own purposes. Clearer? > > Perhaps ... but what needs? Read on ... > > On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc. > > [...] > > No. I might not. I might have half my TXQ and RXQs dealing with tiny > > mbufs/packets, and the other half dealing with completely different > traffic > > of a completely different size and structure. So I might want memory pool > > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK > > doesn't seem to permit this. > > You can create different mempools for each purpose > and specify the proper mempool to rte_eth_rx_queue_setup(). > When creating them, you can and should also take NUMA into account. > Take a look at init_mem() function of examples/l3fwd. > --000000000000d67a2505d6c36fa1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Dmitry,

"On the contrary: rte_pktmbuf_pool_cre= ate() takes the amount
of usable memory (dataroom) and adds space for rt= e_mbuf and the headroom.
Furthermore, the underlying rte_mempool_create(= ) ensures element (mbuf)
alignment, may spread the elements between page= s, etc."

Thanks. This is a crucial correction to my erroneous s= tatement.=C2=A0

I'd like to press-on then with one of my qu= estions that, after some additional thought
is answered however i= mplicitly. For the benefit of other programmers who are new=C2=A0
to this work. I'll explain. If wrong, please hammer on it.

The = other crucial insight=C2=A0is: so long as memory=C2=A0is allocated on the s= ame NUMA
node as the RXQ/TXQ runs that ultimately uses it, there = is only marginal performance
advantage to having per-core caching= of mbufs in a mempool as provided by the
private_data_size formal argum= ent in rte_mempool_create() here:

https://doc.dpdk.= org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd

I= n fact the API doc should really point out the advantage; perhaps it elimin= ates some
cache sloshing to get the last few percent of performan= ce. It probably is not a major
factor in latency or bandwidth wit= h or without private_data_size=3D=3D0.

Memory access from an = lcore x (aka H/W thread, vCPU) on NUMA N is fairly unchanged
to a= ny other distinct lcore y !=3D x provided y also runs on N and the memor= y was allocated
for N. Therefore, lcore affinity to a = mempool is pretty much a red herring.

Consider thi= s code which originally I used as indicative of good mempool creation,
but upon further thinking got me confused:

https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dp= dk/dpdk_init.cc#L76

= =C2=A0=C2=A0for (size_t = i =3D 0; i < kMaxQueuesPerPort; i++) {

= =C2=A0 =C2=A0 const std:= :string pname =3D get_mempool_name(phy_port, i);

= =C2=A0 =C2=A0 rte_mempoo= l *mempool =3D

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 rte_pktmbuf_p= ool_create(pname.c_str(), kNumMbufs, 0 /* cache */,

= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 0 /* priv size */, kMbufSize, numa_node);


= This has the appearance of creating one mempool per each RXQ and each TXQ. = And in
fact this is what it does. The programmer here ensures=C2= =A0the=C2=A0numa_node passed in as the
last argument is the same = numa_node the RXQ/TXQ eventually runs. Since each lcore
has its o= wn mempool and because rte_pktmbuf_create never calls into rte_mempool_crea= te()
with a non-zero private_data_size, per lcore caching doesn&#= 39;t arise. (I briefly checked=C2=A0
mbuf/rte_mbuf.c to confirm).= Indeed lcore v. mempool affinity is irrelevant provided the RXQ=C2= =A0
for a given mempool runs on the same numa_node as specified i= n the last argument to
rte_pktmbuf_pool_create.

Let's turn then to a larger issue: what happens if different RX= Q/TXQs have radically different
needs?=C2=A0

=
As the code above illustrates, one merely allocates a size appro= priate to an individual RXQ/TXQ
by changing the count and size of mbufs = ---- which is as simple as it can get. You have 10 queues
each wi= th their own memory needs? OK, then allocate one memory pool for each. None= of the other
9 queues will have that mempool pointer. Each queue= will use the mempool only that was specified
for it. To beat a d= ead horse just make sure the numa_node in the allocation and the numa node = which=C2=A0
will ultimately run the RXQ/TXQ are the same.


On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:
2022-01-29 18:46 (= UTC-0500), fwefew 4t4tg:
[...]
> 1. Does cache_size include or exclude data_room_size?
> 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)?
> 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM?

Cache size is measured in the number of elements, irrelevant of their size.=
It is not a memory size, so the questions above are not really meaningful.<= br>
> 4. What lcore is the allocated memory pinned to?

Memory is associated with a NUMA node (DPDK calls it "socket"), n= ot an lcore.
Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id().

> The lcore of the caller
> when this method is run? The answer here is important. If it's the= lcore of
> the caller when called, this routine should be called in the lcore'= ;s entry
> point so it's on the right lcore the memory is intended. Calling i= t on the
> lcore that happens to be running main, for example, could have a bad s= ide
> effect if it's different from where the memory will be ultimately = used.

The NUMA node is controlled by "socket_id" parameter.
Your considerations are correct, often you should create separate mempools<= br> for each NUMA node to avoid this performance issue. (You should also
consider which NUMA node each device belongs to.)

> 5. Which one of the formal arguments represents tail room indicated in=
> https://doc.dpdk.org/guides/pr= og_guide/mbuf_lib.html#figure-mbuf1
[...]
> 5. Unknown. Perhaps if you want private data which corresponds to tail= room
> in the diagram above one has to call rte_mempool_create() instead and = focus
> on private_data_size.

Incorrect; tail room is simply an unused part at the end of the data room.<= br> Private data is for the entire mempool, not for individual mbufs.

> Mempool creation is like malloc: you request the total number of absol= ute
> bytes required. The API will not add or remove bytes to the number you=
> specify. Therefore the number you give must be inclusive of all needs<= br> > including your payload, any DPDK overheader, headroom, tailroom, and s= o on.
> DPDK is not adding to the number you give for its own purposes. Cleare= r?
> Perhaps ... but what needs? Read on ...

On the contrary: rte_pktmbuf_pool_create() takes the amount
of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
alignment, may spread the elements between pages, etc.

[...]
> No. I might not. I might have half my TXQ and RXQs dealing with tiny > mbufs/packets, and the other half dealing with completely different tr= affic
> of a completely different size and structure. So I might want memory p= ool
> allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK<= br> > doesn't seem to permit this.

You can create different mempools for each purpose
and specify the proper mempool to rte_eth_rx_queue_setup().
When creating them, you can and should also take NUMA into account.
Take a look at init_mem() function of examples/l3fwd.
--000000000000d67a2505d6c36fa1--