From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 732C4288C for ; Fri, 18 Jan 2019 21:09:37 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Jan 2019 12:09:36 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,491,1539673200"; d="scan'208";a="128939486" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by orsmga001.jf.intel.com with ESMTP; 18 Jan 2019 12:09:36 -0800 Received: from fmsmsx124.amr.corp.intel.com (10.18.125.39) by FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS) id 14.3.408.0; Fri, 18 Jan 2019 12:09:36 -0800 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.99]) by fmsmsx124.amr.corp.intel.com ([169.254.8.215]) with mapi id 14.03.0415.000; Fri, 18 Jan 2019 12:09:36 -0800 From: "Eads, Gage" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , "Gavin Hu (Arm Technology China)" , nd , nd Thread-Topic: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Thread-Index: AQHUrnqu5jX8tbxycUCltqyIk7tCgKWzwdxQgAGj7xA= Date: Fri, 18 Jan 2019 20:09:34 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E541C9383@FMSMSX108.amr.corp.intel.com> References: <20190116151835.22424-1-gage.eads@intel.com> <20190117153659.28477-1-gage.eads@intel.com> <20190117153659.28477-3-gage.eads@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMmYzZmQ3ZWQtZjViOC00MTVmLWJkZjktMDM5OGFiNjg0MjEyIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoid25XSEZnVHFTSlJLTHhLcjBYZkxrRkVQeTg2Y25GMXBWc0ttc1dHZzdFNlowZzlobk5Rd0plYnRGcXU5ekhuciJ9 x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.1.200.106] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2019 20:09:38 -0000 > -----Original Message----- > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] > Sent: Thursday, January 17, 2019 11:05 PM > To: Eads, Gage ; dev@dpdk.org > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce > ; Ananyev, Konstantin > ; Gavin Hu (Arm Technology China) > ; nd ; Honnappa Nagarahalli > ; nd > Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking > stack mempool >=20 > Hi Gage, > Thank you for your contribution on non-blocking data structures. I t= hink they > are important to extend DPDK into additional use cases. >=20 Glad to hear it. Be sure to check out my non-blocking ring patchset as well= , if you haven't already: http://mails.dpdk.org/archives/dev/2019-January/1= 23774.html > I am wondering if it makes sense to decouple the NB stack data structure = from > mempool driver (similar to rte_ring)? I see that stack based mempool > implements the stack data structure in the driver. But, NB stack might no= t be > such a trivial data structure. It might be useful for the applications or= other use > cases as well. >=20 I agree -- and you're not the first to suggest this :). I'm going to defer that work to a later patchset; creating a new lib/ direc= tory requires tech board approval (IIRC), which would unnecessarily slow do= wn this mempool handler from getting merged. > I also suggest that we use C11 __atomic_xxx APIs for memory operations. T= he > rte_atomic64_xxx APIs use __sync_xxx APIs which do not provide the capabi= lity > to express memory orderings. >=20 Ok, I will add those (dependent on RTE_USE_C11_MEM_MODEL). > Please find few comments inline. >=20 > > > > This commit adds support for non-blocking (linked list based) stack > > mempool handler. The stack uses a 128-bit compare-and-swap > > instruction, and thus is limited to x86_64. The 128-bit CAS atomically > > updates the stack top pointer and a modification counter, which > > protects against the ABA problem. > > > > In mempool_perf_autotest the lock-based stack outperforms the non- > > blocking handler*, however: > > - For applications with preemptible pthreads, a lock-based stack's > > worst-case performance (i.e. one thread being preempted while > > holding the spinlock) is much worse than the non-blocking stack's. > > - Using per-thread mempool caches will largely mitigate the performance > > difference. > > > > *Test setup: x86_64 build with default config, dual-socket Xeon > > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The > > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's. > > > > Signed-off-by: Gage Eads > > Acked-by: Andrew Rybchenko > > --- > > MAINTAINERS | 4 + > > config/common_base | 1 + > > doc/guides/prog_guide/env_abstraction_layer.rst | 5 + > > drivers/mempool/Makefile | 3 + > > drivers/mempool/meson.build | 3 +- > > drivers/mempool/nb_stack/Makefile | 23 ++++ > > drivers/mempool/nb_stack/meson.build | 6 + > > drivers/mempool/nb_stack/nb_lifo.h | 147 > > +++++++++++++++++++++ > > drivers/mempool/nb_stack/rte_mempool_nb_stack.c | 125 > > ++++++++++++++++++ > > .../nb_stack/rte_mempool_nb_stack_version.map | 4 + > > mk/rte.app.mk | 7 +- > > 11 files changed, 325 insertions(+), 3 deletions(-) create mode > > 100644 drivers/mempool/nb_stack/Makefile create mode 100644 > > drivers/mempool/nb_stack/meson.build > > create mode 100644 drivers/mempool/nb_stack/nb_lifo.h > > create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > create mode 100644 > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map > > > > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323 > > 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -416,6 +416,10 @@ M: Artem V. Andreev > > M: Andrew Rybchenko > > F: drivers/mempool/bucket/ > > > > +Non-blocking stack memory pool > > +M: Gage Eads > > +F: drivers/mempool/nb_stack/ > > + > > > > Bus Drivers > > ----------- > > diff --git a/config/common_base b/config/common_base index > > 964a6956e..8a51f36b1 100644 > > --- a/config/common_base > > +++ b/config/common_base > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=3Dn # > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=3Dy > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=3D64 > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=3Dy > > CONFIG_RTE_DRIVER_MEMPOOL_RING=3Dy > > CONFIG_RTE_DRIVER_MEMPOOL_STACK=3Dy > > > > diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst > > b/doc/guides/prog_guide/env_abstraction_layer.rst > > index 929d76dba..9497b879c 100644 > > --- a/doc/guides/prog_guide/env_abstraction_layer.rst > > +++ b/doc/guides/prog_guide/env_abstraction_layer.rst > > @@ -541,6 +541,11 @@ Known Issues > > > > 5. It MUST not be used by multi-producer/consumer pthreads, whose > > scheduling policies are SCHED_FIFO or SCHED_RR. > > > > + Alternatively, x86_64 applications can use the non-blocking stack > > mempool handler. When considering this handler, note that: > > + > > + - it is limited to the x86_64 platform, because it uses an > > + instruction (16- > > byte compare-and-swap) that is not available on other platforms. > ^^^^^^^^^^^^^^^^^^^^^^^^= ^^^^^^^^ Arm > architecture supports similar instructions. I suggest to simplify this st= atement to > indicate that 'nb_stack feature is available for x86_64 platforms current= ly' >=20 Will do. > > + - it has worse average-case performance than the non-preemptive > > rte_ring, but software caching (e.g. the mempool cache) can mitigate > > this by reducing the number of handler operations. > > + > > + rte_timer > > > > Running ``rte_timer_manage()`` on a non-EAL pthread is not allowed. > > However, resetting/stopping the timer from a non-EAL pthread is allowed= . > > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile index > > 28c2e8360..895cf8a34 100644 > > --- a/drivers/mempool/Makefile > > +++ b/drivers/mempool/Makefile > > @@ -10,6 +10,9 @@ endif > > ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy) > > DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) +=3D dpaa2 endif > > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y) > > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=3D nb_stack endif > > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) +=3D ring > > DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) +=3D stack > > DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) +=3D octeontx diff --git > > a/drivers/mempool/meson.build b/drivers/mempool/meson.build index > > 4527d9806..220cfaf63 100644 > > --- a/drivers/mempool/meson.build > > +++ b/drivers/mempool/meson.build > > @@ -1,7 +1,8 @@ > > # SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2017 Intel > > Corporation > > > > -drivers =3D ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack'] > > +drivers =3D ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', 'ring'= , > > +'stack'] > > + > > std_deps =3D ['mempool'] > > config_flag_fmt =3D 'RTE_LIBRTE_@0@_MEMPOOL' > > driver_name_fmt =3D 'rte_mempool_@0@' > > diff --git a/drivers/mempool/nb_stack/Makefile > > b/drivers/mempool/nb_stack/Makefile > > new file mode 100644 > > index 000000000..318b18283 > > --- /dev/null > > +++ b/drivers/mempool/nb_stack/Makefile > > @@ -0,0 +1,23 @@ > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel > > +Corporation > > + > > +include $(RTE_SDK)/mk/rte.vars.mk > > + > > +# > > +# library name > > +# > > +LIB =3D librte_mempool_nb_stack.a > > + > > +CFLAGS +=3D -O3 > > +CFLAGS +=3D $(WERROR_FLAGS) > > + > > +# Headers > > +LDLIBS +=3D -lrte_eal -lrte_mempool > > + > > +EXPORT_MAP :=3D rte_mempool_nb_stack_version.map > > + > > +LIBABIVER :=3D 1 > > + > > +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=3D > > rte_mempool_nb_stack.c > > + > > +include $(RTE_SDK)/mk/rte.lib.mk > > diff --git a/drivers/mempool/nb_stack/meson.build > > b/drivers/mempool/nb_stack/meson.build > > new file mode 100644 > > index 000000000..7dec72242 > > --- /dev/null > > +++ b/drivers/mempool/nb_stack/meson.build > > @@ -0,0 +1,6 @@ > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel > > +Corporation > > + > > +build =3D dpdk_conf.has('RTE_ARCH_X86_64') > > + > > +sources =3D files('rte_mempool_nb_stack.c') > > diff --git a/drivers/mempool/nb_stack/nb_lifo.h > > b/drivers/mempool/nb_stack/nb_lifo.h > > new file mode 100644 > > index 000000000..ad4a3401f > > --- /dev/null > > +++ b/drivers/mempool/nb_stack/nb_lifo.h > > @@ -0,0 +1,147 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2019 Intel Corporation */ > > + > > +#ifndef _NB_LIFO_H_ > > +#define _NB_LIFO_H_ > > + > > +struct nb_lifo_elem { > > + void *data; > > + struct nb_lifo_elem *next; > > +}; > > + > > +struct nb_lifo_head { > > + struct nb_lifo_elem *top; /**< Stack top */ > > + uint64_t cnt; /**< Modification counter */ }; > Minor comment, mentioning ABA problem in the comments for 'cnt' will be > helpful. >=20 Sure. > > + > > +struct nb_lifo { > > + volatile struct nb_lifo_head head __rte_aligned(16); > > + rte_atomic64_t len; > > +} __rte_cache_aligned; > > + > > +static __rte_always_inline void > > +nb_lifo_init(struct nb_lifo *lifo) > > +{ > > + memset(lifo, 0, sizeof(*lifo)); > > + rte_atomic64_set(&lifo->len, 0); > > +} > > + > > +static __rte_always_inline unsigned int nb_lifo_len(struct nb_lifo > > +*lifo) { > > + /* nb_lifo_push() and nb_lifo_pop() do not update the list's > > contents > > + * and lifo->len atomically, which can cause the list to appear > > shorter > > + * than it actually is if this function is called while other threads > > + * are modifying the list. > > + * > > + * However, given the inherently approximate nature of the > > get_count > > + * callback -- even if the list and its size were updated atomically, > > + * the size could change between when get_count executes and > > when the > > + * value is returned to the caller -- this is acceptable. > > + * > > + * The lifo->len updates are placed such that the list may appear to > > + * have fewer elements than it does, but will never appear to have > > more > > + * elements. If the mempool is near-empty to the point that this is a > > + * concern, the user should consider increasing the mempool size. > > + */ > > + return (unsigned int)rte_atomic64_read(&lifo->len); > > +} > > + > > +static __rte_always_inline void > > +nb_lifo_push(struct nb_lifo *lifo, > > + struct nb_lifo_elem *first, > > + struct nb_lifo_elem *last, > > + unsigned int num) > > +{ > > + while (1) { > > + struct nb_lifo_head old_head, new_head; > > + > > + old_head =3D lifo->head; > > + > > + /* Swing the top pointer to the first element in the list and > > + * make the last element point to the old top. > > + */ > > + new_head.top =3D first; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + last->next =3D old_head.top; > > + > > + if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head, > > + (uint64_t *)&old_head, > > + (uint64_t *)&new_head)) > > + break; > > + } > Minor comment, this can be a do-while loop (for ex: similar to the one in > __rte_ring_move_prod_head) >=20 Sure. > > + > > + rte_atomic64_add(&lifo->len, num); > > +} > > + > > +static __rte_always_inline void > > +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem) { > > + nb_lifo_push(lifo, elem, elem, 1); > > +} > > + > > +static __rte_always_inline struct nb_lifo_elem * nb_lifo_pop(struct > > +nb_lifo *lifo, > > + unsigned int num, > > + void **obj_table, > > + struct nb_lifo_elem **last) > > +{ > > + struct nb_lifo_head old_head; > > + > > + /* Reserve num elements, if available */ > > + while (1) { > > + uint64_t len =3D rte_atomic64_read(&lifo->len); > > + > > + /* Does the list contain enough elements? */ > > + if (len < num) > > + return NULL; > > + > > + if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len, > > + len, len - num)) > > + break; > > + } > > + > > + /* Pop num elements */ > > + while (1) { > > + struct nb_lifo_head new_head; > > + struct nb_lifo_elem *tmp; > > + unsigned int i; > > + > > + old_head =3D lifo->head; > > + > > + tmp =3D old_head.top; > > + > > + /* Traverse the list to find the new head. A next pointer will > > + * either point to another element or NULL; if a thread > > + * encounters a pointer that has already been popped, the > > CAS > > + * will fail. > > + */ > > + for (i =3D 0; i < num && tmp !=3D NULL; i++) { > > + if (obj_table) > This 'if' check can be outside the for loop. May be use RTE_ASSERT in the > beginning of the function? >=20 A NULL obj_table pointer isn't an error -- nb_stack_enqueue() calls this fu= nction with NULL because it doesn't need the popped elements added to a tab= le. When the compiler inlines this function into nb_stack_enqueue(), it can= use constant propagation to optimize away the if-statement. I don't think that's possible for the other caller, nb_stack_dequeue, thoug= h, unless we add a NULL pointer check to the beginning of that function. Th= en it would be guaranteed that obj_table is non-NULL, and the compiler can = optimize away the if-statement. I'll add that. > > + obj_table[i] =3D tmp->data; > > + if (last) > > + *last =3D tmp; > > + tmp =3D tmp->next; > > + } > > + > > + /* If NULL was encountered, the list was modified while > > + * traversing it. Retry. > > + */ > > + if (i !=3D num) > > + continue; > > + > > + new_head.top =3D tmp; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head, > > + (uint64_t *)&old_head, > > + (uint64_t *)&new_head)) > > + break; > > + } > > + > > + return old_head.top; > > +} > > + > > +#endif /* _NB_LIFO_H_ */ > > diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > new file mode 100644 > > index 000000000..1818a2cfa > > --- /dev/null > > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c > > @@ -0,0 +1,125 @@ > > +/* SPDX-License-Identifier: BSD-3-Clause > > + * Copyright(c) 2019 Intel Corporation */ > > + > > +#include > > +#include > > +#include > > + > > +#include "nb_lifo.h" > > + > > +struct rte_mempool_nb_stack { > > + uint64_t size; > > + struct nb_lifo used_lifo; /**< LIFO containing mempool pointers */ > > + struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements > > */ > > +}; > > + > > +static int > > +nb_stack_alloc(struct rte_mempool *mp) { > > + struct rte_mempool_nb_stack *s; > > + struct nb_lifo_elem *elems; > > + unsigned int n =3D mp->size; > > + unsigned int size, i; > > + > > + size =3D sizeof(*s) + n * sizeof(struct nb_lifo_elem); > IMO, the allocation of the stack elements can be moved under nb_lifo_init= API, > it would make the nb stack code modular. >=20 (see below) > > + > > + /* Allocate our local memory structure */ > > + s =3D rte_zmalloc_socket("mempool-nb_stack", > > + size, > > + RTE_CACHE_LINE_SIZE, > > + mp->socket_id); > > + if (s =3D=3D NULL) { > > + RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n"); > > + return -ENOMEM; > > + } > > + > > + s->size =3D n; > > + > > + nb_lifo_init(&s->used_lifo); > > + nb_lifo_init(&s->free_lifo); > > + > > + elems =3D (struct nb_lifo_elem *)&s[1]; > > + for (i =3D 0; i < n; i++) > > + nb_lifo_push_single(&s->free_lifo, &elems[i]); > This also can be added to nb_lifo_init API. >=20 Sure, good suggestions. I'll address this. Appreciate the feedback! Thanks, Gage