From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 2522CA0471 for ; Mon, 17 Jun 2019 17:33:57 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id F3C461BE91; Mon, 17 Jun 2019 17:33:56 +0200 (CEST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 1143D2AB for ; Mon, 17 Jun 2019 17:33:54 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Jun 2019 08:33:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.63,385,1557212400"; d="scan'208";a="185776333" Received: from vmedvedk-mobl.ger.corp.intel.com (HELO [10.237.220.98]) ([10.237.220.98]) by fmsmga002.fm.intel.com with ESMTP; 17 Jun 2019 08:33:52 -0700 To: "Ruifeng Wang (Arm Technology China)" , Honnappa Nagarahalli , "bruce.richardson@intel.com" Cc: "dev@dpdk.org" , "Gavin Hu (Arm Technology China)" , nd References: <20190605055451.30473-1-ruifeng.wang@arm.com> From: "Medvedkin, Vladimir" Message-ID: <10f3353e-5dab-0bd3-3e4c-b42080e76fd0@intel.com> Date: Mon, 17 Jun 2019 16:33:51 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Subject: Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Wang, On 17/06/2019 16:27, Ruifeng Wang (Arm Technology China) wrote: > Hi Vladimir, > > > From: Medvedkin, Vladimir > Sent: Monday, June 10, 2019 23:23 > To: Honnappa Nagarahalli ; Ruifeng Wang (Arm Technology China) ; bruce.richardson@intel.com > Cc: dev@dpdk.org; Gavin Hu (Arm Technology China) ; nd > Subject: Re: [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 > > Hi Honnappa, Wang, > > On 05/06/2019 20:23, Honnappa Nagarahalli wrote: > > Hi Wang, > > On 05/06/2019 06:54, Ruifeng Wang wrote: > When a tbl8 group is getting attached to a tbl24 entry, lookup might > fail even though the entry is configured in the table. > > For ex: consider a LPM table configured with 10.10.10.1/24. > When a new entry 10.10.10.32/28 is being added, a new tbl8 group is > allocated and tbl24 entry is changed to point to the tbl8 group. If > the tbl24 entry is written without the tbl8 group entries updated, a > lookup on 10.10.10.9 will return failure. > > Correct memory orderings are required to ensure that the store to > tbl24 does not happen before the stores to tbl8 group entries > complete. > > The orderings have impact on LPM performance test. > On Arm A72 platform, delete operation has 2.7% degradation, while > add / lookup has no notable performance change. > On x86 E5 platform, add operation has 4.3% degradation, delete > operation has 2.2% - 10.2% degradation, lookup has no performance > change. > > I think it is possible to avoid add/del performance degradation > My understanding was that the degradation on x86, is happening because of the additional compiler barriers this patch introduces. For Arm platform the degradation is caused by the store-release memory barriers. > Just made some tests on skylake and sandy bridge. On the Skylake there is no performance degradation after applying this patchset. On the Sandybridge there is performance drop for rte_lpm_add() (from 460k cycles to 530k cycles in lpm_performance unit test). This is caused by 1 chunk of this patchset  (add_depth_small_v1604() ). And it looks like after uninlining of this function performance get back to original 460k cycles it was before patch. > > [Ruifeng] Are you suggesting to un-inline add_depth_small_v1604()? I'm OK with such change since the function is too big and is not necessary to be inlined. That's right. Try to uninline it (and maybe add_depth_big(), it depends on your set of prefixes) and  run your performance tests. > > > 1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned > The ' rte_lpm_tbl_entry' is already 32b, shouldn't it be aligned on 4-byte boundary already? > > > 2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write > > 3. Use rte_wmb() after memory write > (It would be good to point the locations in the patch). I assume you are referring to __atomic_store(__ATOMIC_RELEASE). I am wondering if rte_wmb() is required? My understanding is that x86 would require just a compiler barrier. So, should it be rte_smp_wmb()? __atomic_store(__ATOMIC_RELEASE) just adds a compiler barrier for x86. > You right, it needs just a compiller barrier for x86 and a memory barrier instruction (dmb ?) for arm, so rte_smp_wmb() looks appropriate here as well as __atomic_store(__ATOMIC_RELEASE). > > > > Thanks for your suggestions. > Point 1 & 2 make sense. > > For point 3, are you suggesting using rte_wmb() instead of __atomic_store()? > rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store() > with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise? > IMO, code becomes difficult to manage. > > > > Signed-off-by: Honnappa Nagarahalli mailto:honnappa.nagarahalli@arm.com > Signed-off-by: Ruifeng Wang mailto:ruifeng.wang@arm.com > --- > lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++------- > lib/librte_lpm/rte_lpm.h | 4 ++++ > 2 files changed, 29 insertions(+), 7 deletions(-) > > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c > index > 6b7b28a2e..6ec450a08 100644 > --- a/lib/librte_lpm/rte_lpm.c > +++ b/lib/librte_lpm/rte_lpm.c > @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, > uint32_t ip, uint8_t depth, > /* Setting tbl24 entry in one go to avoid race > * conditions > */ > - lpm->tbl24[i] = new_tbl24_entry; > + __atomic_store(&lpm->tbl24[i], &new_tbl24_entry, > + __ATOMIC_RELEASE); > I don't see reordering issue here in this patch chunk. However direct assignment was translated to 2 MOV ops > mov    (%rdi,%rcx,4),%edx  <-- get lpm->tbl24[i] > and    $0xff000000,%edx    <-- clean .next_hop > or     %r9d,%edx        <-- save new next_hop > mov    %edx,(%rdi,%rcx,4)  <-- save an entry with new next_hop but old depth and valid bitfields > mov    %r11b,0x3(%rdi,%rcx,4)  <-- save new depth and valid bitfields > so agree with __atomic_store() here. > > > continue; > } > @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, > uint32_t ip_masked, uint8_t depth, > .depth = 0, > }; > > - lpm->tbl24[tbl24_index] = new_tbl24_entry; > + /* The tbl24 entry must be written only after the > + * tbl8 entries are written. > + */ > + __atomic_store(&lpm->tbl24[tbl24_index], > &new_tbl24_entry, > + __ATOMIC_RELEASE); > > } /* If valid entry but not extended calculate the index into Table8. */ > else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7 > +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t > ip_masked, uint8_t depth, > .depth = 0, > }; > > - lpm->tbl24[tbl24_index] = new_tbl24_entry; > + /* The tbl24 entry must be written only after the > + * tbl8 entries are written. > + */ > + __atomic_store(&lpm->tbl24[tbl24_index], > &new_tbl24_entry, > + __ATOMIC_RELEASE); > > } else { /* > * If it is valid, extended entry calculate the index into tbl8. > @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, > uint32_t ip_masked, > /* Calculate the range and index into Table24. */ > tbl24_range = depth_to_range(depth); > tbl24_index = (ip_masked >> 8); > + struct rte_lpm_tbl_entry zero_tbl24_entry = {0}; > > /* > * Firstly check the sub_rule_index. A -1 indicates no > replacement rule @@ -1405,7 +1415,8 @@ > delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked, > > if (lpm->tbl24[i].valid_group == 0 && > lpm->tbl24[i].depth <= depth) { > - lpm->tbl24[i].valid = INVALID; > + __atomic_store(&lpm->tbl24[i], > + &zero_tbl24_entry, > __ATOMIC_RELEASE); > } else if (lpm->tbl24[i].valid_group == 1) { > /* > * If TBL24 entry is extended, then there has > @@ -1450,7 +1461,8 > @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked, > > if (lpm->tbl24[i].valid_group == 0 && > lpm->tbl24[i].depth <= depth) { > - lpm->tbl24[i] = new_tbl24_entry; > + __atomic_store(&lpm->tbl24[i], > &new_tbl24_entry, > + __ATOMIC_RELEASE); > } else if (lpm->tbl24[i].valid_group == 1) { > /* > * If TBL24 entry is extended, then there has > @@ -1713,8 > +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t > ip_masked, > tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, > tbl8_group_start); > > if (tbl8_recycle_index == -EINVAL) { > - /* Set tbl24 before freeing tbl8 to avoid race condition. */ > + /* Set tbl24 before freeing tbl8 to avoid race condition. > + * Prevent the free of the tbl8 group from hoisting. > + */ > lpm->tbl24[tbl24_index].valid = 0; > + __atomic_thread_fence(__ATOMIC_RELEASE); > tbl8_free_v1604(lpm->tbl8, tbl8_group_start); > } else if (tbl8_recycle_index > -1) { > /* Update tbl24 entry. */ > @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, > uint32_t ip_masked, > .depth = lpm->tbl8[tbl8_recycle_index].depth, > }; > > - /* Set tbl24 before freeing tbl8 to avoid race condition. */ > + /* Set tbl24 before freeing tbl8 to avoid race condition. > + * Prevent the free of the tbl8 group from hoisting. > + */ > lpm->tbl24[tbl24_index] = new_tbl24_entry; > + __atomic_thread_fence(__ATOMIC_RELEASE); > tbl8_free_v1604(lpm->tbl8, tbl8_group_start); > } > #undef group_idx > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h > index b886f54b4..6f5704c5c 100644 > --- a/lib/librte_lpm/rte_lpm.h > +++ b/lib/librte_lpm/rte_lpm.h > @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t > ip, > uint32_t *next_hop) > ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]); > tbl_entry = *ptbl; > > + /* Memory ordering is not required in lookup. Because dataflow > + * dependency exists, compiler or HW won't be able to re-order > + * the operations. > + */ > /* Copy tbl8 entry (only if needed) */ > if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) == > RTE_LPM_VALID_EXT_ENTRY_BITMASK)) { > > -- > Regards, > Vladimir > > Regards, > /Ruifeng -- Regards, Vladimir