From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 82758A04FD; Thu, 10 Nov 2022 11:15:30 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6BF9940150; Thu, 10 Nov 2022 11:15:30 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 05E40400EF for ; Thu, 10 Nov 2022 11:15:28 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC] mempool: zero-copy cache put bulk Date: Thu, 10 Nov 2022 11:15:27 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D874AB@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC] mempool: zero-copy cache put bulk Thread-Index: AdjxHCA/ROrKGJlPRZqF8NgKVJugtwAS1RvwABAiw0AArl49MAAB1pgQAAgXhFAAE1H3AA== References: <98CBD80474FA8B44BF855DF32C47DC35D87489@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D8748A@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D874A9@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Honnappa Nagarahalli" , , , , "Kamalakshitha Aligeri" , "Bruce Richardson" Cc: "nd" , "nd" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Wednesday, 9 November 2022 23.46 > > > > +To: Bruce also showed interest in this topic, and might have more > insights. > > > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > > > Sent: Wednesday, 9 November 2022 18.58 > > > > > > > > > > > > > > > > > > From: Honnappa Nagarahalli > [mailto:Honnappa.Nagarahalli@arm.com] > > > > > Sent: Sunday, 6 November 2022 00.11 > > > > > > > > > > + Akshitha, she is working on similar patch > > > > > > > > > > Few comments inline > > > > > > > > > > > From: Morten Br=F8rup > > > > > > Sent: Saturday, November 5, 2022 8:40 AM > > > > > > > > > > > > Zero-copy access to the mempool cache is beneficial for PMD > > > > > performance, > > > > > > and must be provided by the mempool library to fix [Bug = 1052] > > > > > > without > > > > > a > > > > > > performance regression. > > > > > > > > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=3D1052 > > > > > > > > > > > > > > > > > > This RFC offers a conceptual zero-copy put function, where > the > > > > > application > > > > > > promises to store some objects, and in return gets an = address > > > where > > > > > to store > > > > > > them. > > > > > > > > > > > > I would like some early feedback. > > > > > > > > > > > > Notes: > > > > > > * Allowing the 'cache' parameter to be NULL, and getting it > from > > > the > > > > > > mempool instead, was inspired by rte_mempool_cache_flush(). > > > > > I am not sure why the 'cache' parameter is required for this > API. > > > This > > > > > API should take the mem pool as the parameter. > > > > > > > > > > We have based our API on 'rte_mempool_do_generic_put' and > removed > > > > the > > > > > 'cache' parameter. > > > > > > > > I thoroughly considered omitting the 'cache' parameter, but > included > > > it for > > > > two reasons: > > > > > > > > 1. The function is a "mempool cache" function (i.e. primarily > > > > working > > > on the > > > > mempool cache), not a "mempool" function. > > > > > > > > So it is appropriate to have a pointer directly to the structure > it > > > is working on. > > > > Following this through, I also made 'cache' the first parameter > and > > > 'mp' the > > > > second, like in rte_mempool_cache_flush(). > > > I am wondering if the PMD should be aware of the cache or not. For > ex: > > > in the case of pipeline mode, the RX and TX side of the PMD are > > > running on different cores. > > > > In that example, the PMD can store two cache pointers, one for each > of the > > RX and TX side. > I did not understand this. If RX core and TX core have their own per- > core caches the logic would not work. For ex: the RX core cache would > not get filled. >=20 > In the case of pipeline mode, there will not be a per-core cache. The > buffers would be allocated and freed from a global ring or a global > lockless stack. Aha... Now I understand what you mean: You are referring to use cases = where the mempool is configured to *not* have a mempool cache. For a mempool without a mempool cache, the proposed "mempool cache" = zero-copy functions can obviously not be used. We need "mempool" zero-copy functions for the mempools that have no = mempool cache. However, those functions depend on the mempool's underlying backing = store. E.g. zero-copy access to a ring has certain requirements [1]. [1]: = http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-copy-a= pi For a stack, I think it is possible to locklessly zero-copy pop objects. = But it is impossible to locklessly zero-copy push elements to a stack; = another thread can race to pop some objects from the stack before the = pushing thread has finished writing them into the stack. Furthermore, the ring zero-copy get function cannot return a consecutive = array of objects when wrapping, and PMD functions using vector = instructions usually rely on handling chunks of e.g. 8 objects. Just for a second, let me theorize into the absurd: Even worse, if a = mempool's underlying backing store does not use an array of pointers as = its internal storage structure, it is impossible to use a pointer to an = array of pointers for zero-copy transactions. E.g. if the backing store = uses a list or a tree structure for its storage, a pointer to somewhere = in the list or tree structure is not an array of objects pointers. Anyway, we could consider designing a generic API for zero-copy mempool = get/put; but it should be compatible with all underlying backing stores = - or return failure, so the PMD can fall back to the standard functions, = if the mempool is in a state where zero-copy access to a contiguous = burst cannot be provided. E.g. zero-copy get from a ring can return = failure when zero-copy access to the ring is temporarily unavailable due = to being at a point where it would wrap. Here is a conceptual proposal for such an API. /* Mempool zero-copy transaction state. Opaque outside the mempool API. = */ struct rte_mempool_zc_transaction_state { char opaque[RTE_CACHE_LINE_SIZE]; }; /** Start zero-copy get/put bulk transaction. * * @param[in] mp * Pointer to the mempool. * @param[out] obj_table_ptr * Where to store the pointer to * the zero-copy array of objects in the mempool. * @param[in] n * Number of objects in the transaction. * @param[in] cache * Pointer to the mempool cache. May be NULL if unknown. * @param[out] transaction * Where to store the opaque transaction information. * Used internally by the mempool library. * @return * - 1: Transaction completed; * '_finish' must not be called. * - 0: Transaction started; * '_finish' must be called to complete the transaction. * - <0: Error; failure code. */ static __rte_always_inline int rte_mempool_get/put_zc_bulk_start( struct rte_mempool *mp, void ***obj_table_ptr, unsigned int n, struct rte_mempool_cache *cache, rte_mempool_zc_transaction_state *transaction); /** Finish zero-copy get/put bulk transaction. * * @param[in] mp * Pointer to mempool. * @param[in] obj_table_ptr * Pointer to the zero-copy array of objects in the mempool, * returned by the 'start' function. * @param[in] n * Number of objects in the transaction. * Must be the same as for the 'start' function. * @param[in] transaction * Opaque transaction information, * returned by the 'start' function. * Used internally by the mempool library. */ static __rte_always_inline void rte_mempool_get/put_zc_bulk_finish( struct rte_mempool *mp, void **obj_table, unsigned int n, rte_mempool_zc_transaction_state *transaction); Note that these are *bulk* functions, so 'n' has to remain the same for = a 'finish' call as it was for the 'start' call of a transaction. And then the underlying backing stores would need to provide callbacks = that implement these functions, if they offer zero-copy functionality. The mempool implementation of these could start by checking for a = mempool cache, and use the "mempool cache" zero-copy if present. Some internal state information (from the mempool library or the = underlying mempool backing store) may need to be carried over from the = 'start' to the 'finish' function, so I have added a transaction state = parameter. The transaction state must be held by the application for = thread safety reasons. Think of this like the 'flags' parameter to the = Linux kernel's spin_lock_irqsave/irqrestore() functions. We could omit the 'obj_table' and 'n' parameters from the 'finish' = functions and store them in the transaction state if needed; but we = might possibly achieve higher performance by passing them as parameters = instead. >=20 > > > > And if the PMD is unaware of the cache pointer, it can look it up at > runtime > > using rte_lcore_id(), like it does in the current Intel PMDs. > > > > > However, since the rte_mempool_cache_flush API is provided, may be > > > that decision is already done? Interestingly, > rte_mempool_cache_flush > > > is called by just a single PMD. > > > > I intentionally aligned this RFC with rte_mempool_cache_flush() to > maintain > > consistency. > > > > However, the API is not set in stone. It should always be acceptable > to > > consider improved alternatives. > > > > > > > > So, the question is, should we allow zero-copy only for per-core > cache > > > or for other cases as well. > > > > I suppose that the mempool library was designed to have a mempool > > associated with exactly one mempool cache per core. (Alternatively, > the > > mempool can be configured with no mempool caches at all.) > > > > We should probably stay loyal to that design concept, and only allow > zero- > > copy for per-core cache. > > > > If you can come up with an example of the opposite, I would like to > explore > > that option too... I can't think of a good example myself, and > perhaps I'm > > overlooking a relevant use case. > The use case I am talking about is the pipeline mode as I mentioned > above. Let me know if you agree. I see what you mean, and I don't object. :-) However, I still think the "mempool cache" zero-copy functions could be = useful. They would be needed for the generic "mempool" zero-copy functions = anyway. And the "mempool cache" zero-copy functions are much simpler to design, = implement and use than the "mempool" zero-copy functions, so it is a = good first step. >=20 > > > > > > > > > > > > > 2. In most cases, the function only accesses the mempool > structure > > > > in > > > order to > > > > get the cache pointer. Skipping this step improves performance. > > > > > > > > And since the cache is created along with the mempool itself = (and > > > thus never > > > > changes for a mempool), it would be safe for the PMD to store = the > > > 'cache' > > > > pointer along with the 'mp' pointer in the PMD's queue = structure. > > > Agreed > > > > > > > > > > > E.g. in the i40e PMD the i40e_rx_queue structure could include a > > > "struct > > > > rte_mempool_cache *cache" field, which could be used > > > > i40e_rxq_rearm() > > > [1] > > > > instead of "cache =3D rte_mempool_default_cache(rxq->mp, > > > rte_lcore_id())". > > > > > > > > [1] https://elixir.bootlin.com/dpdk/v22.11- > > > > rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31 > > > > > > > > > This new API, on success, returns the pointer to memory where > the > > > > > objects are copied. On failure it returns NULL and the caller > has > > > to > > > > > call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new = API > > > could > > > > > do this as well and PMD does not need to do anything if it = gets > a > > > NULL > > > > > pointer. > > > > > > > > Yes, we agree about these two details: > > > > > > > > 1. The function should return a pointer, not an integer. > > > > It would be a waste to use a another CPU register to convey a > > > success/error > > > > integer value, when the success/failure information is just as > > > > easily > > > conveyed > > > > by the pointer return value (non-NULL/NULL), and rte_errno for > > > various error > > > > values in the unlikely cases. > > > > > > > > 2. The function should leave it up to the PMD what to do if > direct > > > access to > > > > the cache is unavailable. > > > Just wondering about the advantage of this. I do not think PMD's > have > > > much of a choice other than calling 'rte_mempool_ops_enqueue_bulk' > > > > I agree, but that was not my point. Let me try to rephrase: > > > > The PMD is more likely to know how to efficiently build the array of > mbufs to > > pass to rte_mempool_ops_enqueue_bulk() than the mempool library - > many > > PMDs already implement a variety of vector instruction variants to = do > exactly > > that. So we should not try to be clever and add a fallback path - > this job > > belongs in the PMD. > > > > The PMD might not even have the array of mbufs lined up when calling > > rte_mempool_cache_put_bulk_promise(). The PMD could have an array of > > internal structures, where the mbuf pointer is an element in that > structure. > Agree, you are correct. We should leave it to PMD to handle the = failure > case. >=20 > > > > > > > > > > > > > > > > > > > We should think about providing similar API on the RX side to > > > > > keep > > > it > > > > > symmetric. > > > > > > > > I sent an RFC for that too: > > > > > > http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@ > > > > smartserver.smartshare.dk/T/#u > > > > > > > > > > > > > > > > > > > * Asserting that the 'mp' parameter is not NULL is not done > by > > > other > > > > > > functions, so I omitted it here too. > > > > > > > > > > > > NB: Please ignore formatting. Also, this code has not even > been > > > > > compile > > > > > > tested. > > > > > We are little bit ahead, tested the changes with i40e PF PMD, > > > > > wrote unit test cases, going through internal review, will = send > > > > > out RFC > > > on > > > > > Monday > > > > > > > > Sounds good. Looking forward to review. > > > > > > > > > > > > > > > > > > > > > /** > > > > > > * Promise to put objects in a mempool via zero-copy access > to a > > > > > user-owned > > > > > > mempool cache. > > > > > > * > > > > > > * @param cache > > > > > > * A pointer to the mempool cache. > > > > > > * @param mp > > > > > > * A pointer to the mempool. > > > > > > * @param n > > > > > > * The number of objects to be put in the mempool cache. > > > > > > * @return > > > > > > * The pointer to where to put the objects in the mempool > > > cache. > > > > > > * NULL on error > > > > > > * with rte_errno set appropriately. > > > > > > */ > > > > > > static __rte_always_inline void * > > > > > > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache > > > > *cache, > > > > > > struct rte_mempool *mp, > > > > > > unsigned int n) > > > > > > { > > > > > > void **cache_objs; > > > > > > > > > > > > if (cache =3D=3D NULL) > > > > > > cache =3D rte_mempool_default_cache(mp, > rte_lcore_id()); > > > Any reason we need this? If we are expecting the PMD to store the > > > pointer to cache and a NULL is passed, it would mean it is a > mempool > > > with no per-core cache? > > > We could also leave the NULL check to the PMD. > > > > Personally, I would strongly prefer requiring the cache pointer to = be > valid, > > and use RTE_ASSERT() here, instead of allowing a NULL pointer as a > special > > case to look it up inside the function. I consider this special NULL > case useless > > bloat, which does not belong in a fast path library function. > > > > But I copied this approach from rte_mempool_cache_flush(). > The API definition does not bind it to do this check. We might be able > to delete the check in rte_mempool_cache_flush. >=20 > > > > We could expose an "unsafe" function where is not allowed to pass > NULL > > pointers, and a "safe" function (fixing the cache pointer if NULL) > for > > consistency. > > > > If the rte_mempool_cache_flush() function is popular, we could also > expose > > an "unsafe" variant where passing NULL pointers are disallowed. > > > > I wonder if there are any examples of such safe/unsafe variants in > DPDK? It > > would be nice with a common naming convention for such function > variants. > > > > > > > > > > > if (cache =3D=3D NULL) { > > > > > > rte_errno =3D EINVAL; > > > > > > return NULL; > > > > > > } > > > > > > > > > > > > rte_mempool_trace_cache_put_bulk_promise(cache, mp, n); > > > > > > > > > > > > /* The request itself is too big for the cache */ > > > > > > if (unlikely(n > cache->flushthresh)) { > > > > > > rte_errno =3D EINVAL; > > > > > > return NULL; > > > > > > } > > > > > > > > > > > > /* > > > > > > * The cache follows the following algorithm: > > > > > > * 1. If the objects cannot be added to the cache > without > > > > > crossing > > > > > > * the flush threshold, flush the cache to the > backend. > > > > > > * 2. Add the objects to the cache. > > > > > > */ > > > > > > > > > > > > if (cache->len + n <=3D cache->flushthresh) { > > > > > > cache_objs =3D &cache->objs[cache->len]; > > > > > > cache->len +=3D n; > > > > > > } else { > > > > > > cache_objs =3D &cache->objs[0]; > > > > > > rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache- > >len); > > > > > > cache->len =3D n; > > > > > > } > > > > > > > > > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1); > > > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n); > > > These are new stats. Do these break ABI compatibility (though = these > > > are under DEBUG flag)? > > > > They are not mempool cache stats, they are only kept in the cache > structure > > to provide alternative (i.e. faster) update access to some (i.e. the > most often > > updated) of the existing mempool stats. The patch is [1], and part = of > a series > > currently being discussed if should go into 22.11-rc3 or not [2]. > > > > [1]: > > https://patchwork.dpdk.org/project/dpdk/patch/20221109181852.109856- > 3- > > mb@smartsharesystems.com/ > > [2]: > > http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D874A6 > > @smartserver.smartshare.dk/T/#m41bf4e8bd886db49f11c8dbd63741b35327 > > 7082f > > > > > > > > > > > > > > > > > return cache_objs; > > > > > > } > > > > > > > > > > > > > > > > > > Med venlig hilsen / Kind regards, -Morten Br=F8rup > > > > > > > > > > > > > > >=20