From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 149602082 for ; Wed, 7 Nov 2018 03:15:35 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Nov 2018 18:15:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,474,1534834800"; d="scan'208";a="105903874" Received: from orsmsx106.amr.corp.intel.com ([10.22.225.133]) by fmsmga001.fm.intel.com with ESMTP; 06 Nov 2018 18:15:34 -0800 Received: from orsmsx163.amr.corp.intel.com (10.22.240.88) by ORSMSX106.amr.corp.intel.com (10.22.225.133) with Microsoft SMTP Server (TLS) id 14.3.408.0; Tue, 6 Nov 2018 18:15:33 -0800 Received: from orsmsx105.amr.corp.intel.com ([169.254.2.237]) by ORSMSX163.amr.corp.intel.com ([169.254.9.34]) with mapi id 14.03.0415.000; Tue, 6 Nov 2018 18:15:33 -0800 From: "Wang, Yipeng1" To: Honnappa Nagarahalli , Jerin Jacob CC: "Richardson, Bruce" , "De Lara Guarch, Pablo" , "dev@dpdk.org" , "Dharmik Thakkar" , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , "Yigit, Ferruh" , "hemant.agrawal@nxp.com" , "chaozhu@linux.vnet.ibm.com" , nd , "Gobriel, Sameh" Thread-Topic: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Thread-Index: AQHUbO4HEDQNyUvBLkKHIsF7sioIpqU+cv8AgAA/tYCABCebgIAAxvZA Date: Wed, 7 Nov 2018 02:15:33 +0000 Message-ID: References: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com> <1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com> <20181103115240.GA3608@jerin> <20181103154039.GA25488@jerin> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-ctpclassification: CTP_NT x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiN2Q3ZjliZDgtYzEwYi00MzE4LWE2MzUtNGEyNjlkNDA4NTQxIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiNUVHWE9jcElRZXJlbW90WkY1YVNEeks2WmRrSlwvV29DQ1A2Sm1keDhRWFR3aWVGZ0VzcHhTUUttZ0puZ2hrRTIifQ== x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Nov 2018 02:15:36 -0000 >-----Original Message----- >From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] >Sent: Monday, November 5, 2018 10:08 PM >To: Jerin Jacob >Cc: Richardson, Bruce ; De Lara Guarch, Pablo = ; dev@dpdk.org; Wang, >Yipeng1 ; Dharmik Thakkar ; Gavin Hu (Arm Technology China) >; nd ; thomas@monjalon.net; Yigit, Ferruh ; >hemant.agrawal@nxp.com; chaozhu@linux.vnet.ibm.com; nd >Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write conc= urrency >> > >> > 9) Does anyone else facing this problem? >Any data on x86? > [Wang, Yipeng]=20 I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bu= lk and SIMD instruction so there is no obvious throughput drop on both hit and miss cases (for hit case, there is about 2.5% drop tho= ugh). I manually changed l3fwd to do single packet lookup instead of bulk. For h= it case there is no throughput drop. For miss case, there is 10% throughput drop. I dig into it, as expected, atomic load indeed translates to regular mov on= x86.=20 But since the reordering of the instruction, the compiler(gcc 5.4) cannot unroll the for loop to a switch-case like assembly as before.=20 So I believe the reason of performance drops on x86 is because compiler can= not optimize the code as well as previously. I guess this is totally different reason from why your performance drop on non-TSO machine. On non-TSO machine, probably the = excessive number of atomic load causes a lot of overhead. A quick fix I found useful on x86 is to read all index together. I am no ex= pert on the use of atomic intinsics, but I assume By adding a fence should still maintain the correct ordering? - uint32_t key_idx; + uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES]; void *pdata; struct rte_hash_key *k, *keys =3D h->key_store; + memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES); + __atomic_thread_fence(__ATOMIC_ACQUIRE); + for (i =3D 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { - key_idx =3D __atomic_load_n(&bkt->key_idx[i], - __ATOMIC_ACQUIRE); - if (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_SL= OT) { + if (bkt->sig_current[i] =3D=3D sig && key_idx[i] !=3D EMPTY= _SLOT){ Yipeng