From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 41FC9A034C for ; Sun, 30 Jan 2022 03:33:13 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id BE5984069F; Sun, 30 Jan 2022 03:33:12 +0100 (CET) Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) by mails.dpdk.org (Postfix) with ESMTP id CAACF40041 for ; Sun, 30 Jan 2022 03:33:11 +0100 (CET) Received: by mail-wr1-f51.google.com with SMTP id f17so18547704wrx.1 for ; Sat, 29 Jan 2022 18:33:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=a2MbtM/z1plEIhfFLXxN5D7r8VoCWihAGRAPkmExy7M=; b=OrxSKUfyDuQWjMYzYm9ZuGtfmaR4qRA2w2jrVGkTQ3yxFndYZETgOY0Ce41S7TBzDR FvO1jlXH63qZNKRESJqL/7RMK67jwh7UfHWIpxMU764zyXEhhIVs4QxCapUiGP4bCKS+ 6C1cd2F6AxS7bjzjx3RtdpHEv3uXan4NzDOTbVxS/FwUjaL2UvV1YmYzkG+AlnRD2WfN Wn2mmFDH2uIM11BPt1LgXGSAuOzTVSqsSUO9bRXiSfstAL1QvcjRZAm350REsz0nu2vA hLQVtkZ97IDoNf8s+aM0V73rjPYCHrvH4UxMncBUk3ARl0MzzmPCmk7mhnta4WYG22au j9vQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=a2MbtM/z1plEIhfFLXxN5D7r8VoCWihAGRAPkmExy7M=; b=XuXmlA/GVUVYGUb92RAxaMAKGukrE7/V+x8xnWe3mZIp/tCideF3VmfFHhLzg2wSzX AuR0PiaK0HbyOdyYpcA14Pzkg17IjCeectij2LeME+ZUHN5i/8eI+yG9M1M/LgRzMpf0 I3VsrVDdRQ5WS+V1v9KFmEdZmNsTNo92HhdB/RqwVsP577OnBNxMwtwg6gXdY+dbcdEG +ZvVF7J76AezZ24i1mCwgWBCU3Itm5pldnLNQoq3Ay0Da28uNR8WVqowiZx6em5pa6GO vYUMo7Tt4BfWdAHTfFnpywuFRKq3DqECKVqMuzyCSeI7BEGb4l+gt4bw6YNryzGGeDh6 IJqg== X-Gm-Message-State: AOAM533jpc33i/LHvW3+taGZKboIOgi3E93+OBZG0L3KeE666NOUFeVD GG0Lz0TTKILSJ9nvN2yWEBocg0GUgltb1vhOgWU= X-Google-Smtp-Source: ABdhPJyWXUwlptnKLNKUgPZ5c4yt2pGeDsaW0BL0e9tQG1U5qRfW/2fwg2F9+DbCmMAFnVCQUJJNygKtc2Eb5Ly86Tk= X-Received: by 2002:a05:6000:154a:: with SMTP id 10mr12094785wry.494.1643509991430; Sat, 29 Jan 2022 18:33:11 -0800 (PST) MIME-Version: 1.0 References: <20220130042309.5e590857@sovereign> In-Reply-To: From: fwefew 4t4tg <7532yahoo@gmail.com> Date: Sat, 29 Jan 2022 21:33:00 -0500 Message-ID: Subject: Re: allocating a mempool w/ rte_pktmbuf_pool_create() To: Dmitry Kozlyuk , users@dpdk.org Content-Type: multipart/alternative; boundary="000000000000395e4405d6c37d4d" X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: users-bounces@dpdk.org --000000000000395e4405d6c37d4d Content-Type: text/plain; charset="UTF-8" Apologies reader: I realize too late that my reference to private_data_size in reference to rte_mempool_create() is a typo. I meant cache_size for which the doc reads: If cache_size is non-zero, the rte_mempool library will try to limit the accesses to the common lockless pool, by maintaining a per-lcore object cache. This argument must be lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose cache_size to have "n modulo cache_size == 0": if this is not the case, some elements will always stay in the pool and will never be used. The access to the per-lcore table is of course faster than the multi-producer/consumer pool. The cache can be disabled if the cache_size argument is set to 0; it can be useful to avoid losing objects in cache. On Sat, Jan 29, 2022 at 9:29 PM fwefew 4t4tg <7532yahoo@gmail.com> wrote: > Dmitry, > > "On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc." > > Thanks. This is a crucial correction to my erroneous statement. > > I'd like to press-on then with one of my questions that, after some > additional thought > is answered however implicitly. For the benefit of other programmers who > are new > to this work. I'll explain. If wrong, please hammer on it. > > The other crucial insight is: so long as memory is allocated on the same > NUMA > node as the RXQ/TXQ runs that ultimately uses it, there is only marginal > performance > advantage to having per-core caching of mbufs in a mempool as provided by > the > private_data_size formal argument in rte_mempool_create() here: > > > https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd > > In fact the API doc should really point out the advantage; perhaps it > eliminates some > cache sloshing to get the last few percent of performance. It probably is > not a major > factor in latency or bandwidth with or without private_data_size==0. > > Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly > unchanged > to any other distinct lcore y != x provided y also runs on N *and the > memory was allocated* > *for N*. Therefore, lcore affinity to a mempool is pretty much a red > herring. > > Consider this code which originally I used as indicative of good mempool > creation, > but upon further thinking got me confused: > > > https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 > > for (size_t i = 0; i < kMaxQueuesPerPort; i++) { > > const std::string pname = get_mempool_name(phy_port, i); > > rte_mempool *mempool = > > rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, > > 0 /* priv size */, kMbufSize, numa_node); > > This has the appearance of creating one mempool per each RXQ and each TXQ. > And in > fact this is what it does. The programmer here ensures the numa_node > passed in as the > last argument is the same numa_node the RXQ/TXQ eventually runs. Since > each lcore > has its own mempool and because rte_pktmbuf_create never calls into > rte_mempool_create() > with a non-zero private_data_size, per lcore caching doesn't arise. (I > briefly checked > mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is > irrelevant* provided the RXQ > for a given mempool runs on the same numa_node as specified in the last > argument to > rte_pktmbuf_pool_create. > > Let's turn then to a larger issue: what happens if different RXQ/TXQs have > radically different > needs? > > As the code above illustrates, one merely allocates a size appropriate to > an individual RXQ/TXQ > by changing the count and size of mbufs ---- which is as simple as it can > get. You have 10 queues > each with their own memory needs? OK, then allocate one memory pool for > each. None of the other > 9 queues will have that mempool pointer. Each queue will use the mempool > only that was specified > for it. To beat a dead horse just make sure the numa_node in the > allocation and the numa node which > will ultimately run the RXQ/TXQ are the same. > > > On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk > wrote: > >> 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: >> [...] >> > 1. Does cache_size include or exclude data_room_size? >> > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? >> > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? >> >> Cache size is measured in the number of elements, irrelevant of their >> size. >> It is not a memory size, so the questions above are not really meaningful. >> >> > 4. What lcore is the allocated memory pinned to? >> >> Memory is associated with a NUMA node (DPDK calls it "socket"), not an >> lcore. >> Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). >> >> > The lcore of the caller >> > when this method is run? The answer here is important. If it's the >> lcore of >> > the caller when called, this routine should be called in the lcore's >> entry >> > point so it's on the right lcore the memory is intended. Calling it on >> the >> > lcore that happens to be running main, for example, could have a bad >> side >> > effect if it's different from where the memory will be ultimately used. >> >> The NUMA node is controlled by "socket_id" parameter. >> Your considerations are correct, often you should create separate mempools >> for each NUMA node to avoid this performance issue. (You should also >> consider which NUMA node each device belongs to.) >> >> > 5. Which one of the formal arguments represents tail room indicated in >> > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 >> [...] >> > 5. Unknown. Perhaps if you want private data which corresponds to tail >> room >> > in the diagram above one has to call rte_mempool_create() instead and >> focus >> > on private_data_size. >> >> Incorrect; tail room is simply an unused part at the end of the data room. >> Private data is for the entire mempool, not for individual mbufs. >> >> > Mempool creation is like malloc: you request the total number of >> absolute >> > bytes required. The API will not add or remove bytes to the number you >> > specify. Therefore the number you give must be inclusive of all needs >> > including your payload, any DPDK overheader, headroom, tailroom, and so >> on. >> > DPDK is not adding to the number you give for its own purposes. Clearer? >> > Perhaps ... but what needs? Read on ... >> >> On the contrary: rte_pktmbuf_pool_create() takes the amount >> of usable memory (dataroom) and adds space for rte_mbuf and the headroom. >> Furthermore, the underlying rte_mempool_create() ensures element (mbuf) >> alignment, may spread the elements between pages, etc. >> >> [...] >> > No. I might not. I might have half my TXQ and RXQs dealing with tiny >> > mbufs/packets, and the other half dealing with completely different >> traffic >> > of a completely different size and structure. So I might want memory >> pool >> > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK >> > doesn't seem to permit this. >> >> You can create different mempools for each purpose >> and specify the proper mempool to rte_eth_rx_queue_setup(). >> When creating them, you can and should also take NUMA into account. >> Take a look at init_mem() function of examples/l3fwd. >> > --000000000000395e4405d6c37d4d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Apologies=C2=A0reader: I realize too late that my referenc= e to=C2=A0private_data_size in reference to rte_mempool_create() is a typo.=
I meant=C2=A0cache_size for which the doc reads:

<= /tbody>

If cache_size is non-ze= ro, the=C2=A0rte_mempool= =C2=A0library will try to limit the accesses to the common lockless pool, b= y maintaining a per-lcore object cache. This argument must be lower or equa= l to RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. It is advised to choose cache_= size to have "n modulo cache_size =3D=3D 0": if this is not the c= ase, some elements will always stay in the pool and will never be used. The= access to the per-lcore table is of course faster than the multi-producer/= consumer pool. The cache can be disabled if the cache_size argument is set = to 0; it can be useful to avoid losing objects in cache.


On Sat, Jan 29, 2022 at 9:29 PM fwefew 4t4tg <7532yahoo@gmail.com> wrote:
Dmitry,
"On the contrary: rte_pktmbuf_pool_create() takes the amount
o= f usable memory (dataroom) and adds space for rte_mbuf and the headroom.Furthermore, the underlying rte_mempool_create() ensures element (mbuf)alignment, may spread the elements between pages, etc."

Thanks= . This is a crucial correction to my erroneous statement.=C2=A0

I'd like to press-on then with one of my questions that, after some ad= ditional thought
is answered however implicitly. For the benefit = of other programmers who are new=C2=A0
to this work. I'll exp= lain. If wrong, please hammer on it.

The other crucial insight=C2=A0= is: so long as memory=C2=A0is allocated on the same NUMA
node as = the RXQ/TXQ runs that ultimately uses it, there is only marginal performanc= e
advantage to having per-core caching of mbufs in a mempool as p= rovided by the
private_data_size formal argument in rte_mempool_create()= here:

https://doc.dpdk.org/api/r= te__mempool_8h.html#a503f2f889043a48ca9995878846db2fd

In fact th= e API doc should really point out the advantage; perhaps it eliminates some=
cache sloshing to get the last few percent of performance. It pr= obably is not a major
factor in latency or bandwidth with or with= out private_data_size=3D=3D0.

Memory access from an lcore x (= aka H/W thread, vCPU) on NUMA N is fairly unchanged
to any other = distinct lcore y !=3D x provided y also runs on N and the memory was all= ocated
for N. Therefore, lcore affinity to a mempool i= s pretty much a red herring.

Consider this code wh= ich originally I used as indicative of good mempool creation,
but= upon further thinking got me confused:

https://github.com/erpc-io/eRPC/blob/master/src/transpor= t_impl/dpdk/dpdk_init.cc#L76

=C2=A0=C2=A0for (size_t i= =3D 0; i < kMaxQueuesPerPort; i++) {

=C2=A0 =C2=A0 const std::= string pname =3D get_mempool_name(phy_port, i);

=C2=A0 =C2=A0 rte_mempool= *mempool =3D

=C2=A0 =C2=A0 =C2=A0 =C2=A0 rte_pktmbuf_pool_create(pname.c_str(), kNu= mMbufs, 0 /* cache */,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 0 /* priv size */, kMbufSize, numa_node);


= This has the appearance of creating one mempool per each RXQ and each TXQ. = And in
fact this is what it does. The programmer here ensures=C2= =A0the=C2=A0numa_node passed in as the
last argument is the same = numa_node the RXQ/TXQ eventually runs. Since each lcore
has its o= wn mempool and because rte_pktmbuf_create never calls into rte_mempool_crea= te()
with a non-zero private_data_size, per lcore caching doesn&#= 39;t arise. (I briefly checked=C2=A0
mbuf/rte_mbuf.c to confirm).= Indeed lcore v. mempool affinity is irrelevant provided the RXQ=C2= =A0
for a given mempool runs on the same numa_node as specified i= n the last argument to
rte_pktmbuf_pool_create.

Let's turn then to a larger issue: what happens if different RX= Q/TXQs have radically different
needs?=C2=A0

=
As the code above illustrates, one merely allocates a size appro= priate to an individual RXQ/TXQ
by changing the count and size of mbufs = ---- which is as simple as it can get. You have 10 queues
each wi= th their own memory needs? OK, then allocate one memory pool for each. None= of the other
9 queues will have that mempool pointer. Each queue= will use the mempool only that was specified
for it. To beat a d= ead horse just make sure the numa_node in the allocation and the numa node = which=C2=A0
will ultimately run the RXQ/TXQ are the same.


On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com= > wrote:
= 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg:
[...]
> 1. Does cache_size include or exclude data_room_size?
> 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)?
> 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM?

Cache size is measured in the number of elements, irrelevant of their size.=
It is not a memory size, so the questions above are not really meaningful.<= br>
> 4. What lcore is the allocated memory pinned to?

Memory is associated with a NUMA node (DPDK calls it "socket"), n= ot an lcore.
Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id().

> The lcore of the caller
> when this method is run? The answer here is important. If it's the= lcore of
> the caller when called, this routine should be called in the lcore'= ;s entry
> point so it's on the right lcore the memory is intended. Calling i= t on the
> lcore that happens to be running main, for example, could have a bad s= ide
> effect if it's different from where the memory will be ultimately = used.

The NUMA node is controlled by "socket_id" parameter.
Your considerations are correct, often you should create separate mempools<= br> for each NUMA node to avoid this performance issue. (You should also
consider which NUMA node each device belongs to.)

> 5. Which one of the formal arguments represents tail room indicated in=
> https://doc.dpdk.org/guides/pr= og_guide/mbuf_lib.html#figure-mbuf1
[...]
> 5. Unknown. Perhaps if you want private data which corresponds to tail= room
> in the diagram above one has to call rte_mempool_create() instead and = focus
> on private_data_size.

Incorrect; tail room is simply an unused part at the end of the data room.<= br> Private data is for the entire mempool, not for individual mbufs.

> Mempool creation is like malloc: you request the total number of absol= ute
> bytes required. The API will not add or remove bytes to the number you=
> specify. Therefore the number you give must be inclusive of all needs<= br> > including your payload, any DPDK overheader, headroom, tailroom, and s= o on.
> DPDK is not adding to the number you give for its own purposes. Cleare= r?
> Perhaps ... but what needs? Read on ...

On the contrary: rte_pktmbuf_pool_create() takes the amount
of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf)
alignment, may spread the elements between pages, etc.

[...]
> No. I might not. I might have half my TXQ and RXQs dealing with tiny > mbufs/packets, and the other half dealing with completely different tr= affic
> of a completely different size and structure. So I might want memory p= ool
> allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK<= br> > doesn't seem to permit this.

You can create different mempools for each purpose
and specify the proper mempool to rte_eth_rx_queue_setup().
When creating them, you can and should also take NUMA into account.
Take a look at init_mem() function of examples/l3fwd.
--000000000000395e4405d6c37d4d--