From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 279C111A4 for ; Fri, 29 Mar 2019 20:24:48 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Mar 2019 12:24:46 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,285,1549958400"; d="scan'208";a="218839590" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by orsmga001.jf.intel.com with ESMTP; 29 Mar 2019 12:24:46 -0700 Received: from fmsmsx101.amr.corp.intel.com (10.18.124.199) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.408.0; Fri, 29 Mar 2019 12:24:45 -0700 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.216]) by fmsmsx101.amr.corp.intel.com ([169.254.1.183]) with mapi id 14.03.0415.000; Fri, 29 Mar 2019 12:24:45 -0700 From: "Eads, Gage" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , nd , Thomas Monjalon Thread-Topic: [PATCH v3 6/8] stack: add C11 atomic implementation Thread-Index: AQHU1CtbzfcoFG40VUqutGZl2oW8e6Yhl7KggAFxNMA= Date: Fri, 29 Mar 2019 19:24:45 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E5420D940@FMSMSX108.amr.corp.intel.com> References: <20190305164256.2367-1-gage.eads@intel.com> <20190306144559.391-1-gage.eads@intel.com> <20190306144559.391-7-gage.eads@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYTFmZDAzMzUtODlhMS00YjNlLWFiYTctODQ5NzY5MWEzYzljIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiQ0ZNVzFQZzdUeUFDNkZJS0hNXC94SnVqakxyYUJQNzRreXFyaHRkSk1qcURBZSs4MGZYYUJyVFpRcW00dlBXRUIifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.1.200.108] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v3 6/8] stack: add C11 atomic implementation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Mar 2019 19:24:49 -0000 [snip] > > +static __rte_always_inline void > > +__rte_stack_lf_push(struct rte_stack_lf_list *list, > > + struct rte_stack_lf_elem *first, > > + struct rte_stack_lf_elem *last, > > + unsigned int num) > > +{ > > +#ifndef RTE_ARCH_X86_64 > > + RTE_SET_USED(first); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + old_head =3D list->head; > This can be a torn read (same as you have mentioned in > __rte_stack_lf_pop). I suggest we use acquire thread fence here as well > (please see the comments in __rte_stack_lf_pop). Agreed. I'll add the acquire fence. > > + > > + do { > > + struct rte_stack_lf_head new_head; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here (please see > the comments in __rte_stack_lf_pop). Will add the fence here. > > + /* Swing the top pointer to the first element in the list and > > + * make the last element point to the old top. > > + */ > > + new_head.top =3D first; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + last->next =3D old_head.top; > > + > > + /* Use the release memmodel to ensure the writes to the LF > > LIFO > > + * elements are visible before the head pointer write. > > + */ > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_RELEASE, > > + __ATOMIC_RELAXED); > Success memory order can be RELAXED as the store to list->len.cnt is > RELEASE. The RELEASE success order here ensures that the store to 'last->next' is vi= sible before the head update. The RELEASE in the list->len.cnt store only g= uarantees that the preceding stores are visible before list->len.cnt's stor= e, but doesn't guarantee any ordering between the 'last->next' store and th= e head update, so we can't rely on that. > > + } while (success =3D=3D 0); > > + > > + /* Ensure the stack modifications are not reordered with respect > > + * to the LIFO len update. > > + */ > > + __atomic_add_fetch(&list->len.cnt, num, __ATOMIC_RELEASE); > > #endif } > > + > > +static __rte_always_inline struct rte_stack_lf_elem * > > +__rte_stack_lf_pop(struct rte_stack_lf_list *list, > > + unsigned int num, > > + void **obj_table, > > + struct rte_stack_lf_elem **last) { #ifndef > RTE_ARCH_X86_64 > > + RTE_SET_USED(obj_table); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > + > > + return NULL; > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + /* Reserve num elements, if available */ > > + while (1) { > > + uint64_t len =3D __atomic_load_n(&list->len.cnt, > > + __ATOMIC_ACQUIRE); > This can be outside the loop. Good idea. I'll move this out of the loop and change the atomic_compare_exc= hange_n's failure memory order to ACQUIRE. > > + > > + /* Does the list contain enough elements? */ > > + if (unlikely(len < num)) > > + return NULL; > > + > > + if (__atomic_compare_exchange_n(&list->len.cnt, > > + &len, len - num, > > + 0, __ATOMIC_RELAXED, > > + __ATOMIC_RELAXED)) > > + break; > > + } > > + > > +#ifndef RTE_ARCH_X86_64 > > + /* Use the acquire memmodel to ensure the reads to the LF LIFO > > elements > > + * are properly ordered with respect to the head pointer read. > > + * > > + * Note that for aarch64, GCC's implementation of __atomic_load_16 > > in > > + * libatomic uses locks, and so this function should be replaced by > > + * a new function (e.g. "rte_atomic128_load()"). > aarch64 does not have 'pure' atomic 128b load instructions. They have to = be > implemented using load/store exclusives. > > > + */ > > + __atomic_load((volatile __int128 *)&list->head, > > + &old_head, > > + __ATOMIC_ACQUIRE); > Since, we know of x86/aarch64 (power?) that cannot implement pure atomic > 128b loads, should we just use relaxed reads and assume the possibility o= f > torn reads for all architectures? Then, we can use acquire fence to preve= nt > the reordering (see below) That's a cleaner solution. I'll implement that, dropping the architecture d= istinction. > > +#else > > + /* x86-64 does not require an atomic load here; if a torn read > > +occurs, > IMO, we should not make architecture specific distinctions as this algori= thm is > based on C11 memory model. > > > + * the CAS will fail and set old_head to the correct/latest value. > > + */ > > + old_head =3D list->head; > > +#endif > > + > > + /* Pop num elements */ > > + do { > > + struct rte_stack_lf_head new_head; > > + struct rte_stack_lf_elem *tmp; > > + unsigned int i; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here. Will do. > > + rte_prefetch0(old_head.top); > > + > > + tmp =3D old_head.top; > > + > > + /* Traverse the list to find the new head. A next pointer will > > + * either point to another element or NULL; if a thread > > + * encounters a pointer that has already been popped, the > > CAS > > + * will fail. > > + */ > > + for (i =3D 0; i < num && tmp !=3D NULL; i++) { > > + rte_prefetch0(tmp->next); > > + if (obj_table) > > + obj_table[i] =3D tmp->data; > > + if (last) > > + *last =3D tmp; > > + tmp =3D tmp->next; > > + } > > + > > + /* If NULL was encountered, the list was modified while > > + * traversing it. Retry. > > + */ > > + if (i !=3D num) > > + continue; > > + > > + new_head.top =3D tmp; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_ACQUIRE, > > + __ATOMIC_ACQUIRE); > The success order should be __ATOMIC_RELEASE as the write to list->len.cn= t > is relaxed. > The failure order can be __ATOMIC_RELAXED if the thread fence is added. Agreed on both counts. Will address. > > + } while (success =3D=3D 0); > > + > > + return old_head.top; > > +#endif > > +} > > + > > +#endif /* _RTE_STACK_C11_MEM_H_ */ > > -- > > 2.13.6 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 24D55A05D3 for ; Fri, 29 Mar 2019 20:24:51 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id D6F201DB8; Fri, 29 Mar 2019 20:24:50 +0100 (CET) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 279C111A4 for ; Fri, 29 Mar 2019 20:24:48 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Mar 2019 12:24:46 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,285,1549958400"; d="scan'208";a="218839590" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by orsmga001.jf.intel.com with ESMTP; 29 Mar 2019 12:24:46 -0700 Received: from fmsmsx101.amr.corp.intel.com (10.18.124.199) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.408.0; Fri, 29 Mar 2019 12:24:45 -0700 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.216]) by fmsmsx101.amr.corp.intel.com ([169.254.1.183]) with mapi id 14.03.0415.000; Fri, 29 Mar 2019 12:24:45 -0700 From: "Eads, Gage" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: "olivier.matz@6wind.com" , "arybchenko@solarflare.com" , "Richardson, Bruce" , "Ananyev, Konstantin" , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , nd , Thomas Monjalon Thread-Topic: [PATCH v3 6/8] stack: add C11 atomic implementation Thread-Index: AQHU1CtbzfcoFG40VUqutGZl2oW8e6Yhl7KggAFxNMA= Date: Fri, 29 Mar 2019 19:24:45 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E5420D940@FMSMSX108.amr.corp.intel.com> References: <20190305164256.2367-1-gage.eads@intel.com> <20190306144559.391-1-gage.eads@intel.com> <20190306144559.391-7-gage.eads@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYTFmZDAzMzUtODlhMS00YjNlLWFiYTctODQ5NzY5MWEzYzljIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiQ0ZNVzFQZzdUeUFDNkZJS0hNXC94SnVqakxyYUJQNzRreXFyaHRkSk1qcURBZSs4MGZYYUJyVFpRcW00dlBXRUIifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.1.200.108] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v3 6/8] stack: add C11 atomic implementation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Message-ID: <20190329192445.UDcuSMiqIB3GSv1bf9zhHz0zh9vkHP2QfKT8X2-G4qg@z> [snip] > > +static __rte_always_inline void > > +__rte_stack_lf_push(struct rte_stack_lf_list *list, > > + struct rte_stack_lf_elem *first, > > + struct rte_stack_lf_elem *last, > > + unsigned int num) > > +{ > > +#ifndef RTE_ARCH_X86_64 > > + RTE_SET_USED(first); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + old_head =3D list->head; > This can be a torn read (same as you have mentioned in > __rte_stack_lf_pop). I suggest we use acquire thread fence here as well > (please see the comments in __rte_stack_lf_pop). Agreed. I'll add the acquire fence. > > + > > + do { > > + struct rte_stack_lf_head new_head; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here (please see > the comments in __rte_stack_lf_pop). Will add the fence here. > > + /* Swing the top pointer to the first element in the list and > > + * make the last element point to the old top. > > + */ > > + new_head.top =3D first; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + last->next =3D old_head.top; > > + > > + /* Use the release memmodel to ensure the writes to the LF > > LIFO > > + * elements are visible before the head pointer write. > > + */ > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_RELEASE, > > + __ATOMIC_RELAXED); > Success memory order can be RELAXED as the store to list->len.cnt is > RELEASE. The RELEASE success order here ensures that the store to 'last->next' is vi= sible before the head update. The RELEASE in the list->len.cnt store only g= uarantees that the preceding stores are visible before list->len.cnt's stor= e, but doesn't guarantee any ordering between the 'last->next' store and th= e head update, so we can't rely on that. > > + } while (success =3D=3D 0); > > + > > + /* Ensure the stack modifications are not reordered with respect > > + * to the LIFO len update. > > + */ > > + __atomic_add_fetch(&list->len.cnt, num, __ATOMIC_RELEASE); > > #endif } > > + > > +static __rte_always_inline struct rte_stack_lf_elem * > > +__rte_stack_lf_pop(struct rte_stack_lf_list *list, > > + unsigned int num, > > + void **obj_table, > > + struct rte_stack_lf_elem **last) { #ifndef > RTE_ARCH_X86_64 > > + RTE_SET_USED(obj_table); > > + RTE_SET_USED(last); > > + RTE_SET_USED(list); > > + RTE_SET_USED(num); > > + > > + return NULL; > > +#else > > + struct rte_stack_lf_head old_head; > > + int success; > > + > > + /* Reserve num elements, if available */ > > + while (1) { > > + uint64_t len =3D __atomic_load_n(&list->len.cnt, > > + __ATOMIC_ACQUIRE); > This can be outside the loop. Good idea. I'll move this out of the loop and change the atomic_compare_exc= hange_n's failure memory order to ACQUIRE. > > + > > + /* Does the list contain enough elements? */ > > + if (unlikely(len < num)) > > + return NULL; > > + > > + if (__atomic_compare_exchange_n(&list->len.cnt, > > + &len, len - num, > > + 0, __ATOMIC_RELAXED, > > + __ATOMIC_RELAXED)) > > + break; > > + } > > + > > +#ifndef RTE_ARCH_X86_64 > > + /* Use the acquire memmodel to ensure the reads to the LF LIFO > > elements > > + * are properly ordered with respect to the head pointer read. > > + * > > + * Note that for aarch64, GCC's implementation of __atomic_load_16 > > in > > + * libatomic uses locks, and so this function should be replaced by > > + * a new function (e.g. "rte_atomic128_load()"). > aarch64 does not have 'pure' atomic 128b load instructions. They have to = be > implemented using load/store exclusives. > > > + */ > > + __atomic_load((volatile __int128 *)&list->head, > > + &old_head, > > + __ATOMIC_ACQUIRE); > Since, we know of x86/aarch64 (power?) that cannot implement pure atomic > 128b loads, should we just use relaxed reads and assume the possibility o= f > torn reads for all architectures? Then, we can use acquire fence to preve= nt > the reordering (see below) That's a cleaner solution. I'll implement that, dropping the architecture d= istinction. > > +#else > > + /* x86-64 does not require an atomic load here; if a torn read > > +occurs, > IMO, we should not make architecture specific distinctions as this algori= thm is > based on C11 memory model. > > > + * the CAS will fail and set old_head to the correct/latest value. > > + */ > > + old_head =3D list->head; > > +#endif > > + > > + /* Pop num elements */ > > + do { > > + struct rte_stack_lf_head new_head; > > + struct rte_stack_lf_elem *tmp; > > + unsigned int i; > > + > We can add __atomic_thread_fence(__ATOMIC_ACQUIRE) here. Will do. > > + rte_prefetch0(old_head.top); > > + > > + tmp =3D old_head.top; > > + > > + /* Traverse the list to find the new head. A next pointer will > > + * either point to another element or NULL; if a thread > > + * encounters a pointer that has already been popped, the > > CAS > > + * will fail. > > + */ > > + for (i =3D 0; i < num && tmp !=3D NULL; i++) { > > + rte_prefetch0(tmp->next); > > + if (obj_table) > > + obj_table[i] =3D tmp->data; > > + if (last) > > + *last =3D tmp; > > + tmp =3D tmp->next; > > + } > > + > > + /* If NULL was encountered, the list was modified while > > + * traversing it. Retry. > > + */ > > + if (i !=3D num) > > + continue; > > + > > + new_head.top =3D tmp; > > + new_head.cnt =3D old_head.cnt + 1; > > + > > + success =3D rte_atomic128_cmp_exchange( > > + (rte_int128_t *)&list->head, > > + (rte_int128_t *)&old_head, > > + (rte_int128_t *)&new_head, > > + 1, __ATOMIC_ACQUIRE, > > + __ATOMIC_ACQUIRE); > The success order should be __ATOMIC_RELEASE as the write to list->len.cn= t > is relaxed. > The failure order can be __ATOMIC_RELAXED if the thread fence is added. Agreed on both counts. Will address. > > + } while (success =3D=3D 0); > > + > > + return old_head.top; > > +#endif > > +} > > + > > +#endif /* _RTE_STACK_C11_MEM_H_ */ > > -- > > 2.13.6