From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by dpdk.org (Postfix) with ESMTP id 770AD2B95 for ; Sat, 19 Jan 2019 01:00:38 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Jan 2019 16:00:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,492,1539673200"; d="scan'208";a="127210661" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by orsmga002.jf.intel.com with ESMTP; 18 Jan 2019 16:00:36 -0800 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.99]) by FMSMSX105.amr.corp.intel.com ([169.254.4.102]) with mapi id 14.03.0415.000; Fri, 18 Jan 2019 16:00:35 -0800 From: "Eads, Gage" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , "Gavin Hu (Arm Technology China)" , nd , nd Thread-Topic: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Thread-Index: AQHUrnqu5jX8tbxycUCltqyIk7tCgKWzwdxQgAGj7xCAAFADQA== Date: Sat, 19 Jan 2019 00:00:35 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E541C9612@FMSMSX108.amr.corp.intel.com> References: <20190116151835.22424-1-gage.eads@intel.com> <20190117153659.28477-1-gage.eads@intel.com> <20190117153659.28477-3-gage.eads@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMmYzZmQ3ZWQtZjViOC00MTVmLWJkZjktMDM5OGFiNjg0MjEyIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoid25XSEZnVHFTSlJLTHhLcjBYZkxrRkVQeTg2Y25GMXBWc0ttc1dHZzdFNlowZzlobk5Rd0plYnRGcXU5ekhuciJ9 x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.1.200.106] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Jan 2019 00:00:39 -0000 > -----Original Message----- > From: Eads, Gage > Sent: Friday, January 18, 2019 2:10 PM > To: 'Honnappa Nagarahalli' ; dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > ; Ananyev, Konstantin > ; Gavin Hu (Arm Technology China) > ; nd ; nd > Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking > stack mempool >=20 >=20 >=20 > > -----Original Message----- > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > > Sent: Thursday, January 17, 2019 11:05 PM > > To: Eads, Gage ; dev@dpdk.org > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, > > Bruce ; Ananyev, Konstantin > > ; Gavin Hu (Arm Technology China) > > ; nd ; Honnappa Nagarahalli > > ; nd > > Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add > > non-blocking stack mempool > > > > Hi Gage, > > Thank you for your contribution on non-blocking data structures. > > I think they are important to extend DPDK into additional use cases. > > >=20 > Glad to hear it. Be sure to check out my non-blocking ring patchset as we= ll, if > you haven't already: http://mails.dpdk.org/archives/dev/2019- > January/123774.html >=20 > > I am wondering if it makes sense to decouple the NB stack data > > structure from mempool driver (similar to rte_ring)? I see that stack > > based mempool implements the stack data structure in the driver. But, > > NB stack might not be such a trivial data structure. It might be > > useful for the applications or other use cases as well. > > >=20 > I agree -- and you're not the first to suggest this :). >=20 > I'm going to defer that work to a later patchset; creating a new lib/ dir= ectory > requires tech board approval (IIRC), which would unnecessarily slow down = this > mempool handler from getting merged. >=20 > > I also suggest that we use C11 __atomic_xxx APIs for memory > > operations. The rte_atomic64_xxx APIs use __sync_xxx APIs which do not > > provide the capability to express memory orderings. > > >=20 > Ok, I will add those (dependent on RTE_USE_C11_MEM_MODEL). >=20 > > Please find few comments inline. > > > > > > > > This commit adds support for non-blocking (linked list based) stack > > > mempool handler. The stack uses a 128-bit compare-and-swap > > > instruction, and thus is limited to x86_64. The 128-bit CAS > > > atomically updates the stack top pointer and a modification counter, > > > which protects against the ABA problem. > > > > > > In mempool_perf_autotest the lock-based stack outperforms the non- > > > blocking handler*, however: > > > - For applications with preemptible pthreads, a lock-based stack's > > > worst-case performance (i.e. one thread being preempted while > > > holding the spinlock) is much worse than the non-blocking stack's. > > > - Using per-thread mempool caches will largely mitigate the performan= ce > > > difference. > > > > > > *Test setup: x86_64 build with default config, dual-socket Xeon > > > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The > > > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's. > > > > > > Signed-off-by: Gage Eads > > > Acked-by: Andrew Rybchenko > > > --- > > > MAINTAINERS | 4 + > > > config/common_base | 1 + > > > doc/guides/prog_guide/env_abstraction_layer.rst | 5 + > > > drivers/mempool/Makefile | 3 + > > > drivers/mempool/meson.build | 3 +- > > > drivers/mempool/nb_stack/Makefile | 23 ++++ > > > drivers/mempool/nb_stack/meson.build | 6 + > > > drivers/mempool/nb_stack/nb_lifo.h | 147 > > > +++++++++++++++++++++ > > > drivers/mempool/nb_stack/rte_mempool_nb_stack.c | 125 > > > ++++++++++++++++++ > > > .../nb_stack/rte_mempool_nb_stack_version.map | 4 + > > > mk/rte.app.mk | 7 +- > > > 11 files changed, 325 insertions(+), 3 deletions(-) create mode > > > 100644 drivers/mempool/nb_stack/Makefile create mode 100644 > > > drivers/mempool/nb_stack/meson.build > > > create mode 100644 drivers/mempool/nb_stack/nb_lifo.h > > > create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > > create mode 100644 > > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > > > > > > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323 > > > 100644 > > > --- a/MAINTAINERS > > > +++ b/MAINTAINERS > > > @@ -416,6 +416,10 @@ M: Artem V. Andreev > > > > > > M: Andrew Rybchenko > > > F: drivers/mempool/bucket/ > > > > > > +Non-blocking stack memory pool > > > +M: Gage Eads > > > +F: drivers/mempool/nb_stack/ > > > + > > > > > > Bus Drivers > > > ----------- > > > diff --git a/config/common_base b/config/common_base index > > > 964a6956e..8a51f36b1 100644 > > > --- a/config/common_base > > > +++ b/config/common_base > > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=3Dn # > > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=3Dy > > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=3D64 > > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=3Dy > > > CONFIG_RTE_DRIVER_MEMPOOL_RING=3Dy > > > CONFIG_RTE_DRIVER_MEMPOOL_STACK=3Dy > > > > > > diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst > > > b/doc/guides/prog_guide/env_abstraction_layer.rst > > > index 929d76dba..9497b879c 100644 > > > --- a/doc/guides/prog_guide/env_abstraction_layer.rst > > > +++ b/doc/guides/prog_guide/env_abstraction_layer.rst > > > @@ -541,6 +541,11 @@ Known Issues > > > > > > 5. It MUST not be used by multi-producer/consumer pthreads, whose > > > scheduling policies are SCHED_FIFO or SCHED_RR. > > > > > > + Alternatively, x86_64 applications can use the non-blocking stack > > > mempool handler. When considering this handler, note that: > > > + > > > + - it is limited to the x86_64 platform, because it uses an > > > + instruction (16- > > > byte compare-and-swap) that is not available on other platforms. > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Arm architecture supports similar > > instructions. I suggest to simplify this statement to indicate that 'nb= _stack > feature is available for x86_64 platforms currently' > > >=20 > Will do. >=20 > > > + - it has worse average-case performance than the non-preemptive > > > rte_ring, but software caching (e.g. the mempool cache) can mitigate > > > this by reducing the number of handler operations. > > > + > > > + rte_timer > > > > > > Running ``rte_timer_manage()`` on a non-EAL pthread is not allowe= d. > > > However, resetting/stopping the timer from a non-EAL pthread is allow= ed. > > > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile > > > index > > > 28c2e8360..895cf8a34 100644 > > > --- a/drivers/mempool/Makefile > > > +++ b/drivers/mempool/Makefile > > > @@ -10,6 +10,9 @@ endif > > > ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy) > > > DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) +=3D dpaa2 endif > > > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y) > > > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=3D nb_stack endif > > > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) +=3D ring > > > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) +=3D stack > > > DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) +=3D octeontx diff --git > > > a/drivers/mempool/meson.build b/drivers/mempool/meson.build index > > > 4527d9806..220cfaf63 100644 > > > --- a/drivers/mempool/meson.build > > > +++ b/drivers/mempool/meson.build > > > @@ -1,7 +1,8 @@ > > > # SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2017 Intel > > > Corporation > > > > > > -drivers =3D ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack'] > > > +drivers =3D ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', > > > +'ring', 'stack'] > > > + > > > std_deps =3D ['mempool'] > > > config_flag_fmt =3D 'RTE_LIBRTE_@0@_MEMPOOL' > > > driver_name_fmt =3D 'rte_mempool_@0@' > > > diff --git a/drivers/mempool/nb_stack/Makefile > > > b/drivers/mempool/nb_stack/Makefile > > > new file mode 100644 > > > index 000000000..318b18283 > > > --- /dev/null > > > +++ b/drivers/mempool/nb_stack/Makefile > > > @@ -0,0 +1,23 @@ > > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel > > > +Corporation > > > + > > > +include $(RTE_SDK)/mk/rte.vars.mk > > > + > > > +# > > > +# library name > > > +# > > > +LIB =3D librte_mempool_nb_stack.a > > > + > > > +CFLAGS +=3D -O3 > > > +CFLAGS +=3D $(WERROR_FLAGS) > > > + > > > +# Headers > > > +LDLIBS +=3D -lrte_eal -lrte_mempool > > > + > > > +EXPORT_MAP :=3D rte_mempool_nb_stack_version.map > > > + > > > +LIBABIVER :=3D 1 > > > + > > > +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=3D > > > rte_mempool_nb_stack.c > > > + > > > +include $(RTE_SDK)/mk/rte.lib.mk > > > diff --git a/drivers/mempool/nb_stack/meson.build > > > b/drivers/mempool/nb_stack/meson.build > > > new file mode 100644 > > > index 000000000..7dec72242 > > > --- /dev/null > > > +++ b/drivers/mempool/nb_stack/meson.build > > > @@ -0,0 +1,6 @@ > > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel > > > +Corporation > > > + > > > +build =3D dpdk_conf.has('RTE_ARCH_X86_64') > > > + > > > +sources =3D files('rte_mempool_nb_stack.c') > > > diff --git a/drivers/mempool/nb_stack/nb_lifo.h > > > b/drivers/mempool/nb_stack/nb_lifo.h > > > new file mode 100644 > > > index 000000000..ad4a3401f > > > --- /dev/null > > > +++ b/drivers/mempool/nb_stack/nb_lifo.h > > > @@ -0,0 +1,147 @@ > > > +/* SPDX-License-Identifier: BSD-3-Clause > > > + * Copyright(c) 2019 Intel Corporation */ > > > + > > > +#ifndef _NB_LIFO_H_ > > > +#define _NB_LIFO_H_ > > > + > > > +struct nb_lifo_elem { > > > + void *data; > > > + struct nb_lifo_elem *next; > > > +}; > > > + > > > +struct nb_lifo_head { > > > + struct nb_lifo_elem *top; /**< Stack top */ > > > + uint64_t cnt; /**< Modification counter */ }; > > Minor comment, mentioning ABA problem in the comments for 'cnt' will > > be helpful. > > >=20 > Sure. >=20 > > > + > > > +struct nb_lifo { > > > + volatile struct nb_lifo_head head __rte_aligned(16); > > > + rte_atomic64_t len; > > > +} __rte_cache_aligned; > > > + > > > +static __rte_always_inline void > > > +nb_lifo_init(struct nb_lifo *lifo) > > > +{ > > > + memset(lifo, 0, sizeof(*lifo)); > > > + rte_atomic64_set(&lifo->len, 0); > > > +} > > > + > > > +static __rte_always_inline unsigned int nb_lifo_len(struct nb_lifo > > > +*lifo) { > > > + /* nb_lifo_push() and nb_lifo_pop() do not update the list's > > > contents > > > + * and lifo->len atomically, which can cause the list to appear > > > shorter > > > + * than it actually is if this function is called while other threa= ds > > > + * are modifying the list. > > > + * > > > + * However, given the inherently approximate nature of the > > > get_count > > > + * callback -- even if the list and its size were updated atomicall= y, > > > + * the size could change between when get_count executes and > > > when the > > > + * value is returned to the caller -- this is acceptable. > > > + * > > > + * The lifo->len updates are placed such that the list may appear t= o > > > + * have fewer elements than it does, but will never appear to have > > > more > > > + * elements. If the mempool is near-empty to the point that this is= a > > > + * concern, the user should consider increasing the mempool size. > > > + */ > > > + return (unsigned int)rte_atomic64_read(&lifo->len); > > > +} > > > + > > > +static __rte_always_inline void > > > +nb_lifo_push(struct nb_lifo *lifo, > > > + struct nb_lifo_elem *first, > > > + struct nb_lifo_elem *last, > > > + unsigned int num) > > > +{ > > > + while (1) { > > > + struct nb_lifo_head old_head, new_head; > > > + > > > + old_head =3D lifo->head; > > > + > > > + /* Swing the top pointer to the first element in the list and > > > + * make the last element point to the old top. > > > + */ > > > + new_head.top =3D first; > > > + new_head.cnt =3D old_head.cnt + 1; > > > + > > > + last->next =3D old_head.top; > > > + > > > + if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head, > > > + (uint64_t *)&old_head, > > > + (uint64_t *)&new_head)) > > > + break; > > > + } > > Minor comment, this can be a do-while loop (for ex: similar to the one > > in > > __rte_ring_move_prod_head) > > >=20 > Sure. >=20 > > > + > > > + rte_atomic64_add(&lifo->len, num); } > > > + > > > +static __rte_always_inline void > > > +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)= { > > > + nb_lifo_push(lifo, elem, elem, 1); } > > > + > > > +static __rte_always_inline struct nb_lifo_elem * nb_lifo_pop(struct > > > +nb_lifo *lifo, > > > + unsigned int num, > > > + void **obj_table, > > > + struct nb_lifo_elem **last) > > > +{ > > > + struct nb_lifo_head old_head; > > > + > > > + /* Reserve num elements, if available */ > > > + while (1) { > > > + uint64_t len =3D rte_atomic64_read(&lifo->len); > > > + > > > + /* Does the list contain enough elements? */ > > > + if (len < num) > > > + return NULL; > > > + > > > + if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len, > > > + len, len - num)) > > > + break; > > > + } > > > + > > > + /* Pop num elements */ > > > + while (1) { > > > + struct nb_lifo_head new_head; > > > + struct nb_lifo_elem *tmp; > > > + unsigned int i; > > > + > > > + old_head =3D lifo->head; > > > + > > > + tmp =3D old_head.top; > > > + > > > + /* Traverse the list to find the new head. A next pointer will > > > + * either point to another element or NULL; if a thread > > > + * encounters a pointer that has already been popped, the > > > CAS > > > + * will fail. > > > + */ > > > + for (i =3D 0; i < num && tmp !=3D NULL; i++) { > > > + if (obj_table) > > This 'if' check can be outside the for loop. May be use RTE_ASSERT in > > the beginning of the function? > > >=20 > A NULL obj_table pointer isn't an error -- nb_stack_enqueue() calls this = function > with NULL because it doesn't need the popped elements added to a table. W= hen > the compiler inlines this function into nb_stack_enqueue(), it can use co= nstant > propagation to optimize away the if-statement. >=20 > I don't think that's possible for the other caller, nb_stack_dequeue, tho= ugh, > unless we add a NULL pointer check to the beginning of that function. The= n it > would be guaranteed that obj_table is non-NULL, and the compiler can opti= mize > away the if-statement. I'll add that. >=20 > > > + obj_table[i] =3D tmp->data; > > > + if (last) > > > + *last =3D tmp; > > > + tmp =3D tmp->next; > > > + } > > > + > > > + /* If NULL was encountered, the list was modified while > > > + * traversing it. Retry. > > > + */ > > > + if (i !=3D num) > > > + continue; > > > + > > > + new_head.top =3D tmp; > > > + new_head.cnt =3D old_head.cnt + 1; > > > + > > > + if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head, > > > + (uint64_t *)&old_head, > > > + (uint64_t *)&new_head)) > > > + break; > > > + } > > > + > > > + return old_head.top; > > > +} > > > + > > > +#endif /* _NB_LIFO_H_ */ > > > diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > > b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > > new file mode 100644 > > > index 000000000..1818a2cfa > > > --- /dev/null > > > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > > @@ -0,0 +1,125 @@ > > > +/* SPDX-License-Identifier: BSD-3-Clause > > > + * Copyright(c) 2019 Intel Corporation */ > > > + > > > +#include > > > +#include > > > +#include > > > + > > > +#include "nb_lifo.h" > > > + > > > +struct rte_mempool_nb_stack { > > > + uint64_t size; > > > + struct nb_lifo used_lifo; /**< LIFO containing mempool pointers */ > > > + struct nb_lifo free_lifo; /**< LIFO containing unused LIFO > > > +elements > > > */ > > > +}; > > > + > > > +static int > > > +nb_stack_alloc(struct rte_mempool *mp) { > > > + struct rte_mempool_nb_stack *s; > > > + struct nb_lifo_elem *elems; > > > + unsigned int n =3D mp->size; > > > + unsigned int size, i; > > > + > > > + size =3D sizeof(*s) + n * sizeof(struct nb_lifo_elem); > > IMO, the allocation of the stack elements can be moved under > > nb_lifo_init API, it would make the nb stack code modular. > > >=20 > (see below) >=20 > > > + > > > + /* Allocate our local memory structure */ > > > + s =3D rte_zmalloc_socket("mempool-nb_stack", > > > + size, > > > + RTE_CACHE_LINE_SIZE, > > > + mp->socket_id); > > > + if (s =3D=3D NULL) { > > > + RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n"); > > > + return -ENOMEM; > > > + } > > > + > > > + s->size =3D n; > > > + > > > + nb_lifo_init(&s->used_lifo); > > > + nb_lifo_init(&s->free_lifo); > > > + > > > + elems =3D (struct nb_lifo_elem *)&s[1]; > > > + for (i =3D 0; i < n; i++) > > > + nb_lifo_push_single(&s->free_lifo, &elems[i]); > > This also can be added to nb_lifo_init API. > > >=20 > Sure, good suggestions. I'll address this. >=20 On second thought, moving this push code into nb_lifo_init() doesn't work s= ince we do it for one LIFO and not the other. I'll think on it, but I'm not going to spend too many cycles -- modularizin= g nb_lifo can be deferred to the patchset that moves it to a separate libra= ry.