From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 50D1EA0351; Sat, 18 Dec 2021 11:50:36 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C6B6440140; Sat, 18 Dec 2021 11:50:35 +0100 (CET) Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by mails.dpdk.org (Postfix) with ESMTP id 5CFB34013F for ; Sat, 18 Dec 2021 11:50:33 +0100 (CET) Received: from dggeme756-chm.china.huawei.com (unknown [172.30.72.53]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4JGBWC6hlzz8vrT; Sat, 18 Dec 2021 11:42:31 +0800 (CST) Received: from [10.67.103.128] (10.67.103.128) by dggeme756-chm.china.huawei.com (10.3.19.102) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2308.20; Sat, 18 Dec 2021 11:44:47 +0800 Subject: Re: [PATCH 3/7] net/bonding: change mbuf pool and ring allocation To: "Sanford, Robert" , Robert Sanford , "dev@dpdk.org" CC: "chas3@att.com" References: <1639592401-56845-1-git-send-email-rsanford@akamai.com> <1639592401-56845-4-git-send-email-rsanford@akamai.com> <3eb682a7-74db-5bc6-cbd0-7dbbc4177abd@huawei.com> <7CE0C72F-5CFD-4C75-8B03-5739A0339092@akamai.com> From: "Min Hu (Connor)" Message-ID: Date: Sat, 18 Dec 2021 11:44:47 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1 MIME-Version: 1.0 In-Reply-To: <7CE0C72F-5CFD-4C75-8B03-5739A0339092@akamai.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.103.128] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggeme756-chm.china.huawei.com (10.3.19.102) X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi, Sanford, Thanks for your detailed description, some questions as follows. 在 2021/12/18 3:49, Sanford, Robert 写道: > Hello Connor, > > Thank you for the questions and comments. I will repeat the questions, followed by my answers. > > Q: Could you be more detailed, why is mbuf pool caching not needed? > > A: The short answer: under certain conditions, we can run out of > buffers from that small, LACPDU-mempool. We actually saw this occur > in production, on mostly-idle links. > > For a long explanation, let's assume the following: > 1. 1 tx-queue per bond and underlying ethdev ports. > 2. 256 tx-descriptors (per ethdev port). > 3. 257 mbufs in each port's LACPDU-pool, as computed by > bond_mode_8023ad_activate_slave(), and cache-size 32. > 4. The "app" xmits zero packets to this bond for a long time. > 5. In EAL intr thread context, LACP tx_machine() allocates 1 mbuf > (LACPDU) per second from the pool, and puts it into LACP tx-ring. > 6. Every second, another thread, let's call it the tx-core, calls > tx-burst (with zero packets to xmit), finds 1 mbuf on LACP tx-ring, > and underlying ethdev PMD puts mbuf data into a tx-desc. > 7. PMD tx-burst configured not to clean up used tx-descs until > there are almost none free, e.g., less than pool's cache-size * > CACHE_FLUSH_THRESH_MULTIPLIER (1.5). > 8. When cleaning up tx-descs, we may leave up to 47 mbufs in the > tx-core's LACPDU-pool cache (not accessible from intr thread). > > When the number of used tx-descs (0..255) + number of mbufs in the > cache (0..47) reaches 257, then allocation fails. > > If I understand the LACP tx-burst code correctly, it would be > worse if nb_tx_queues > 1, because (assuming multiple tx-cores) > any queue/lcore could xmit an LACPDU. Thus, up to nb_tx_queues * > 47 mbufs could be cached, and not accessible from tx_machine(). > > You would not see this problem if the app xmits other (non-LACP) > mbufs on a regular basis, to expedite the clean-up of tx-descs > including LACPDU mbufs (unless nb_tx_queues tx-core caches > could hold all LACPDU mbufs). > I think, we could not see this problem only because the mempool can offer much more mbufs than cache size on no-LACP circumstance. > If we make mempool's cache size 0, then allocation will not fail. How about enlarge the size of mempool, i.e., up to 4096 ? I think it can also avoid this bug. > > A mempool cache for LACPDUs does not offer much additional speed: > during alloc, the intr thread does not have default mempool caches Why? as I know, all the core has its own default mempool caches ? > (AFAIK); and the average time between frees is either 1 second (LACP > short timeouts) or 10 seconds (long timeouts), i.e., infrequent. > > -------- > > Q: Why reserve one additional slot in the rx and tx rings? > > A: rte_ring_create() requires the ring size N, to be a power of 2, > but it can only store N-1 items. Thus, if we want to store X items, Hi, Robert, could you describe it for me? I cannot understand why it "only store N -1 items". I check the source code, It writes: "The real usable ring size is *count-1* instead of *count* to differentiate a free ring from an empty ring." But I still can not get what it wrote. > we need to ask for (at least) X+1. Original code fails when the real > desired size is a power of 2, because in such a case, align32pow2 > does not round up. > > For example, say we want a ring to hold 4: > > rte_ring_create(... rte_align32pow2(4) ...) > > rte_align32pow2(4) returns 4, and we end up with a ring that only > stores 3 items. > > rte_ring_create(... rte_align32pow2(4+1) ...) > > rte_align32pow2(5) returns 8, and we end up with a ring that > stores up to 7 items, more than we need, but acceptable. To fix the bug, how about just setting the flags "RING_F_EXACT_SZ" > > -------- > > Q: I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is > wrong, could you fix it in this patch? > > A: Yes, I will fix it in the next version of the patch. Thanks. > > -- > Regards, > Robert Sanford > > > On 12/16/21, 4:01 AM, "Min Hu (Connor)" wrote: > > Hi, Robert, > > 在 2021/12/16 2:19, Robert Sanford 写道: > > - Turn off mbuf pool caching to avoid mbufs lingering in pool caches. > > At most, we transmit one LACPDU per second, per port. > Could you be more detailed, why does mbuf pool caching is not needed? > > > - Fix calculation of ring sizes, taking into account that a ring of > > size N holds up to N-1 items. > Same to that, why should resvere another items ? > > > By the way, I found the comment for BOND_MODE_8023AX_SLAVE_RX_PKTS is > is wrong, could you fix it in this patch? > > Signed-off-by: Robert Sanford > > --- > > drivers/net/bonding/rte_eth_bond_8023ad.c | 14 ++++++++------ > > 1 file changed, 8 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c > > index 43231bc..83d3938 100644 > > --- a/drivers/net/bonding/rte_eth_bond_8023ad.c > > +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c > > @@ -1101,9 +1101,7 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev, > > } > > > > snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id); > > - port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, > > - RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ? > > - 32 : RTE_MEMPOOL_CACHE_MAX_SIZE, > > + port->mbuf_pool = rte_pktmbuf_pool_create(mem_name, total_tx_desc, 0, > > 0, element_size, socket_id); > > > > /* Any memory allocation failure in initialization is critical because > > @@ -1113,19 +1111,23 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev, > > slave_id, mem_name, rte_strerror(rte_errno)); > > } > > > > + /* Add one extra because ring reserves one. */ > > snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_rx", slave_id); > > port->rx_ring = rte_ring_create(mem_name, > > - rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS), socket_id, 0); > > + rte_align32pow2(BOND_MODE_8023AX_SLAVE_RX_PKTS + 1), > > + socket_id, 0); > > > > if (port->rx_ring == NULL) { > > rte_panic("Slave %u: Failed to create rx ring '%s': %s\n", slave_id, > > mem_name, rte_strerror(rte_errno)); > > } > > > > - /* TX ring is at least one pkt longer to make room for marker packet. */ > > + /* TX ring is at least one pkt longer to make room for marker packet. > > + * Add one extra because ring reserves one. */ > > snprintf(mem_name, RTE_DIM(mem_name), "slave_%u_tx", slave_id); > > port->tx_ring = rte_ring_create(mem_name, > > - rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 1), socket_id, 0); > > + rte_align32pow2(BOND_MODE_8023AX_SLAVE_TX_PKTS + 2), > > + socket_id, 0); > > > > if (port->tx_ring == NULL) { > > rte_panic("Slave %u: Failed to create tx ring '%s': %s\n", slave_id, > > >