From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from proxy.6wind.com (host.76.145.23.62.rev.coltfrance.com [62.23.145.76]) by dpdk.org (Postfix) with ESMTP id 30B114B79 for ; Wed, 24 Aug 2016 18:14:10 +0200 (CEST) Received: from [10.16.0.195] (unknown [10.16.0.195]) by proxy.6wind.com (Postfix) with ESMTP id 047492B205; Wed, 24 Aug 2016 18:14:09 +0200 (CEST) To: "Sanford, Robert" , "dev@dpdk.org" References: <1470084176-79932-1-git-send-email-rsanford@akamai.com> <1470084176-79932-4-git-send-email-rsanford@akamai.com> <57BC6723.7020106@6wind.com> <08CB1F99-B657-44F3-A7DF-7ED6713EDB5C@akamai.com> Cc: "declan.doherty@intel.com" , "pablo.de.lara.guarch@intel.com" From: Olivier MATZ Message-ID: <57BDC7CF.9060502@6wind.com> Date: Wed, 24 Aug 2016 18:14:07 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <08CB1F99-B657-44F3-A7DF-7ED6713EDB5C@akamai.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH 3/4] net/bonding: another fix to LACP mempool size X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 24 Aug 2016 16:14:10 -0000 Hi Robert, On 08/23/2016 10:01 PM, Sanford, Robert wrote: > Hi Olivier, > > On 8/23/16, 11:09 AM, "Olivier MATZ" wrote: > > Hi Robert, > > On 08/01/2016 10:42 PM, Robert Sanford wrote: > > The following log message may appear after a slave is idle (or nearly > > idle) for a few minutes: "PMD: Failed to allocate LACP packet from > > pool". > > > > Problem: All mbufs from a slave's private pool (used exclusively for > > transmitting LACPDUs) have been allocated and are still sitting in > > the device's tx descriptor ring and other cores' mempool caches. > > > > Solution: Ensure that each slaves' tx (LACPDU) mempool owns more than > > n-tx-queues * (n-tx-descriptors + per-core-mempool-flush-threshold) > > mbufs. > > > > Note that the LACP tx machine function is the only code that allocates > > from a slave's private pool. It runs in the context of the interrupt > > thread, and thus it has no mempool cache of its own. > > > > Signed-off-by: Robert Sanford > > --- > > drivers/net/bonding/rte_eth_bond_8023ad.c | 10 +++++++--- > > 1 files changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c > > index 2f7ae70..1207896 100644 > > --- a/drivers/net/bonding/rte_eth_bond_8023ad.c > > +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c > > @@ -854,6 +854,8 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev, uint8_t slave_id) > > char mem_name[RTE_ETH_NAME_MAX_LEN]; > > int socket_id; > > unsigned element_size; > > + unsigned cache_size; > > + unsigned cache_flushthresh; > > uint32_t total_tx_desc; > > struct bond_tx_queue *bd_tx_q; > > uint16_t q_id; > > @@ -890,19 +892,21 @@ bond_mode_8023ad_activate_slave(struct rte_eth_dev *bond_dev, uint8_t slave_id) > > > > element_size = sizeof(struct slow_protocol_frame) + sizeof(struct rte_mbuf) > > + RTE_PKTMBUF_HEADROOM; > > + cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ? > > + 32 : RTE_MEMPOOL_CACHE_MAX_SIZE; > > + cache_flushthresh = RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(cache_size); > > > > /* The size of the mempool should be at least: > > * the sum of the TX descriptors + BOND_MODE_8023AX_SLAVE_TX_PKTS */ > > total_tx_desc = BOND_MODE_8023AX_SLAVE_TX_PKTS; > > for (q_id = 0; q_id < bond_dev->data->nb_tx_queues; q_id++) { > > bd_tx_q = (struct bond_tx_queue*)bond_dev->data->tx_queues[q_id]; > > - total_tx_desc += bd_tx_q->nb_tx_desc; > > + total_tx_desc += bd_tx_q->nb_tx_desc + cache_flushthresh; > > } > > > > snprintf(mem_name, RTE_DIM(mem_name), "slave_port%u_pool", slave_id); > > port->mbuf_pool = rte_mempool_create(mem_name, > > - total_tx_desc, element_size, > > - RTE_MEMPOOL_CACHE_MAX_SIZE >= 32 ? 32 : RTE_MEMPOOL_CACHE_MAX_SIZE, > > + total_tx_desc, element_size, cache_size, > > sizeof(struct rte_pktmbuf_pool_private), rte_pktmbuf_pool_init, > > NULL, rte_pktmbuf_init, NULL, socket_id, MEMPOOL_F_NO_SPREAD); > > > > > > I'm not very familiar with bonding code, so maybe your patch is correct. > I think the size of the mempool should be: > > BOND_MODE_8023AX_SLAVE_TX_PKTS + > n_cores * RTE_MEMPOOL_CALC_CACHE_FLUSHTHRESH(cache_size) > > With n_cores = number of cores that can dequeue from the mempool. > > The safest thing to do would be to have n_cores = RTE_MAX_LCORE. I don't > know if bond_dev->data->nb_tx_queue corresponds to this definition, if > yes you can ignore my comment ;) > > > Regards, > Olivier > > -- > > Only one, non-EAL thread dequeues from a per-slave, private (LACPDU) mempool. The thread calls mbuf-alloc in the context of an rte_eal_alarm_set( ) callback function, and enqueues the mbuf to a multi-consumer ring. > > The consumer(s) of the mbufs can be any core(s) that invoke the bond PMD’s tx-burst function. As we know, mbufs can sit around on TX descriptor rings long after we transmit them, often until we need to reuse the descriptors for subsequent TX packets. When the TX cores finally free some of the mbufs, some of the mbufs get left in a pool’s per-core cache. So, the maximum number of mbufs that may be lying around doing nothing is: the “done” mbufs in the TX desc ring and in the per-core cache, times the number of TX cores, or approximately nb_tx_queues * (nb_tx_desc + CACHE_FLUSHTHRESH(cache_size)). (I probably should also update the comment above the for-loop that calculates total_tx_desc.) > > When we include normal TX mbufs and private LACPDU mbufs displacing each other in the TX desc ring, there are many (if not infinite) possibilities. I wrote a program to simulate all this interaction, trying different values for cache_size, nb_tx_desc, etc., because it took a long time to run out of LACPDU mbufs (or it never happened) just using long-interval pings. > > Anyway, I hope that this is a nice, long way of saying that I will ignore your comment. ;-) OK, thanks for the detailed explanation. > Another idea, related to mempools (but not this LACP patch) ... > Back when I first investigated the LACP pool problem, I got an idea related to the mempool per-core caches. It seems like a waste that in some usage patterns (like above), a core may “put” objects to a pool, but rarely/never “get” objects from the pool, the result being that it rarely/never reuses objects stuck in the per-core cache. We could recognize this kind of behavior by adding a small counter to the rte_mempool_cache struct. During a put, we increment the counter, and during a get, we set the counter to zero. If, during a put that is going to flush some of the objects to the global part of the pool, the counter is at or above some threshold (8, 16, 32, or whatever), then we flush ALL of the objects, since this core hasn’t been using them. What do you think? While I understand the need, it looks a bit complicated to me. When I read your first patch, I was thinking about automatically increase the size of the pool in mempool_create(), according to the number of cores and size of the cache. But it would have too many drawbacks: - waste of memory in many cases, especially for large objects - would not take the external caches (new 16.07 feature) in account - behavior change