From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM02-CY1-obe.outbound.protection.outlook.com (mail-cys01nam02on0070.outbound.protection.outlook.com [104.47.37.70]) by dpdk.org (Postfix) with ESMTP id 44530239 for ; Tue, 6 Nov 2018 10:10:10 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=CAVIUMNETWORKS.onmicrosoft.com; s=selector1-cavium-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=sbpdqme69e+P2wVk/4gKaIJKLq3A/ObVIM+vItFHrZk=; b=B3O15XQHE06A9BL7/6ZlFHUrj6DBYdVE50b9zA+6wRQ7G9alJ2yCrYZNLzDgFOumi3IYJpGhVMx7FgLiN6Wb0y0JzQVKU8ulLSQTo5pAIE3lWnwktzB4ABSllwW2CWF54LYUiOufIsdJvyg39uwaLpM46tYME9geI6XGe9t/SiI= Received: from BYAPR07MB4997.namprd07.prod.outlook.com (52.135.238.214) by BYAPR07MB4711.namprd07.prod.outlook.com (52.135.205.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1294.21; Tue, 6 Nov 2018 09:10:08 +0000 Received: from BYAPR07MB4997.namprd07.prod.outlook.com ([fe80::2d56:eab:242f:fdfc]) by BYAPR07MB4997.namprd07.prod.outlook.com ([fe80::2d56:eab:242f:fdfc%3]) with mapi id 15.20.1294.034; Tue, 6 Nov 2018 09:10:08 +0000 From: Jerin Jacob To: Honnappa Nagarahalli CC: "bruce.richardson@intel.com" , "pablo.de.lara.guarch@intel.com" , "dev@dpdk.org" , "yipeng1.wang@intel.com" , Dharmik Thakkar , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , "ferruh.yigit@intel.com" , "hemant.agrawal@nxp.com" , "chaozhu@linux.vnet.ibm.com" , "Kapoor, Prasun" Thread-Topic: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Thread-Index: AQHUbO4hC1hG0V5Pe0K6GLHhoSzYDKU+WcgAgAA/tACAA7q3gIAAMuiA Date: Tue, 6 Nov 2018 09:10:08 +0000 Message-ID: <20181106090953.GA5671@jerin> References: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com> <1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com> <20181103115240.GA3608@jerin> <20181103154039.GA25488@jerin> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [111.93.218.67] x-clientproxiedby: BM1PR01CA0072.INDPRD01.PROD.OUTLOOK.COM (2603:1096:b00:1::12) To BYAPR07MB4997.namprd07.prod.outlook.com (2603:10b6:a03:5b::22) authentication-results: spf=none (sender IP is ) smtp.mailfrom=Jerin.JacobKollanukkaran@cavium.com; x-ms-exchange-messagesentrepresentingtype: 1 x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; BYAPR07MB4711; 6:f7D8Vypc6a2hvI4U9QKiKfEZ/dqrBWwMV/bLefZxORJYl+Eb0ckBHMkbw8R6joSIwVa2usDra3lTwB0BTvdPae4sCVxyVglTIDJt9e4nee/KQDeCjLEvCvyyBAJvu+UAXJfAq9bIvZW9zgEsSUv4PHKqgrEhKpx+1ryvs2H2p/9FYBdVV9zQ3J6bZvCj9mIY3hBIBPlVUG9DUsuDvWHyUHh60/6CQ4gk7c88uY/VrKecZhWrt3Xm/yBORqJwPqJY1SvRIbMFdBPNjpbFsctI5s1J+in/uUzq+zcllb8XzHTuNpysaI8jpaYJkyskwZfDavV7p4ab/H5vNDjitb1mPOHng45fZ4gPMGqv0QjqmJF5XQklbg5v+thf3ZFgZG/iIkPIhY9tGrM8HUmJTk4LEORiRjJlGA6pIYnMfGyANoWFtQrqmMY3jrFQ+awblBsAOZkL9beTNRPLpjRO4JK73A==; 5:W1hbzvK2NLEuxaZPHxIEgh+JBMBiStNo3miuFeHQDjzg5qu1Q05gLe7RzcwObc06f5n/+M35NAK6KxXWj3As1ImEDnJlzWI/bnAQ6SakCdBX1/r4Zkr3WqtT3jupIArXAuRK9pR5F58Vrarijlcg385dk4lFudjzPCsF2pLFVrM=; 7:yODXMK3tCWaZurHUlbbNuB1OQBzss9X8MPKinTnZGTmVeYbv9VZFoUZFpUMkjNMLPI4n+5+/jS1PWpr6FuTOSRPPF+XK81u9OdHktMkWfgA8lDMQlYClqt9u2d8xJFSXZ/yRheucFpTDSTg/1oCq3Q== x-ms-office365-filtering-correlation-id: 3124488d-d3b8-4651-aafd-08d643c7a5ae x-microsoft-antispam: BCL:0; PCL:0; RULEID:(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(2017052603328)(7153060)(7193020); SRVR:BYAPR07MB4711; x-ms-traffictypediagnostic: BYAPR07MB4711: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(228905959029699)(185117386973197)(104084551191319)(163750095850)(180628864354917); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040522)(2401047)(8121501046)(5005006)(93006095)(10201501046)(3002001)(3231382)(944501410)(52105095)(148016)(149066)(150057)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123562045)(20161123558120)(201708071742011)(7699051)(76991095); SRVR:BYAPR07MB4711; BCL:0; PCL:0; RULEID:; SRVR:BYAPR07MB4711; x-forefront-prvs: 0848C1A6AA x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(7916004)(39850400004)(366004)(396003)(136003)(346002)(376002)(13464003)(189003)(199004)(66066001)(33896004)(102836004)(6436002)(6512007)(9686003)(446003)(7736002)(476003)(6116002)(3846002)(4744004)(54906003)(99286004)(71200400001)(71190400001)(186003)(4326008)(6486002)(52116002)(345774005)(6506007)(386003)(1076002)(7416002)(6916009)(76176011)(97736004)(25786009)(53936002)(105586002)(8676002)(72206003)(81166006)(2900100001)(106356001)(33716001)(14454004)(5660300001)(305945005)(33656002)(486006)(8936002)(229853002)(42882007)(2906002)(11346002)(78486014)(26005)(256004)(81156014)(68736007)(478600001)(14444005)(107886003)(93886005)(6246003)(316002); DIR:OUT; SFP:1101; SCL:1; SRVR:BYAPR07MB4711; H:BYAPR07MB4997.namprd07.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: cavium.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: rgU1EpbjhUdhK1Egj0w9vTUvxf1eS1DHAUtC7uWqzPGEkwXcGyBSFmYRcqaoh9zHxN/HivYN3Htt0eFjFGumqgce219L8FkI2tGGhstV3q5hEvJCxx6eFbtsFvyLcij/7ET2SYeaxM2LB3shTT0fu8Q18+vCdKWSs8K2C21AgXpN/sLSa7+7DAayA81p/4pHJ38RIAsjxVDPSGajaodF8/s5VWhrDKPHZ79FyJ2/Ja/E48sMYdTmK31d4maxxYGYrqdLYzzz7TnlImsVaXkNz4AHNbDyKLMuwoNdJxEeTnh8atOIAReixJ1d/NDm+6gVehmf0KzAQ6AXcePNH8cCWkrAEf+j4XlYssCrKwxaNvY= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-ID: <2801AAEBD4705247A75621837D8B30EE@namprd07.prod.outlook.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: caviumnetworks.com X-MS-Exchange-CrossTenant-Network-Message-Id: 3124488d-d3b8-4651-aafd-08d643c7a5ae X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Nov 2018 09:10:08.4109 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 711e4ccf-2e9b-4bcf-a551-4094005b6194 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR07MB4711 Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Nov 2018 09:10:10 -0000 -----Original Message----- > Date: Tue, 6 Nov 2018 06:07:43 +0000 > From: Honnappa Nagarahalli > To: Jerin Jacob > CC: "bruce.richardson@intel.com" , > "pablo.de.lara.guarch@intel.com" , > "dev@dpdk.org" , "yipeng1.wang@intel.com" > , Dharmik Thakkar , > "Gavin Hu (Arm Technology China)" , nd , > "thomas@monjalon.net" , "ferruh.yigit@intel.com" > , "hemant.agrawal@nxp.com" > , "chaozhu@linux.vnet.ibm.com" > , nd > Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write > concurrency >=20 >=20 > > > > > > > > Add lock-free read-write concurrency. This is achieved by the > > > > following changes. > > > > > > > > 1) Add memory ordering to avoid race conditions. The only race > > > > condition that can occur is - using the key store element before > > > > the key write is completed. Hence, while inserting the element the > > > > release memory order is used. Any other race condition is caught by > > > > the key comparison. Memory orderings are added only where needed. > > > > For ex: reads in the writer's context do not need memory ordering a= s > > > > there is a single writer. > > > > > > > > key_idx in the bucket entry and pdata in the key store element are > > > > used for synchronisation. key_idx is used to release an inserted > > > > entry in the bucket to the reader. Use of pdata for synchronisation > > > > is required due to updation of an existing entry where-in only the > > > > pdata is updated without updating key_idx. > > > > > > > > 2) Reader-writer concurrency issue, caused by moving the keys to > > > > their alternative locations during key insert, is solved by > > > > introducing a global counter(tbl_chng_cnt) indicating a change in > > > > table. > > > > > > > > 3) Add the flag to enable reader-writer concurrency during run time= . > > > > > > > > Signed-off-by: Honnappa Nagarahalli > > > > > > Hi Honnappa, > > > > Jerin, thank you for running this test and all the analysis. I have not r= un this test. I was focused on simultaneous reads and writes. You can look = at file test_hash_readwrite_lf.c to look for the kind of the use cases. >=20 > I am trying to reproduce this, I will get back with more details soon. OK. Since RC2 approaching, bit worried about next steps if we cannot avoid the regression with this new feature for this release. 24% drop of performance regression not seen in past for a specific usecase.So not sure what we did past for similar situations. Thomas, Any thought of this? >=20 > > > This patch is causing _~24%_ performance regression on mpps/core with > > > 64B packet with l3fwd in EM mode with octeontx. > > > > > > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E= ) > > > > Have you run with more cores (8, 16)? Tried with 4, 8, 16 cores, performance drop is linear. >=20 > > > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config=3D"(0, 0, 23),(1= , 0, 22)" > > > > > > Observations: > > > 1) When hash lookup is _success_ then regression is only 3%. Which is > > > kind of make sense because additional new atomic instructions > > > > > > What I meant by lookup is _success_ is: > > > Configuring traffic gen like below to match lookup as defined > > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c > > > > > > dest.ip port0 201.0.0.0 > > > src.ip port0 200.20.0.1 > > > dest.port port0 102 > > > src.port port0 12 > > > > > > dest.ip port1 101.0.0.0 > > > src.ip port1 100.10.0.1 > > > dest.port port1 101 > > > src.port port1 11 > > > > > > tx.type IPv4+TCP > > > > > > > > > > > > 2) When hash lookup _fails_ the per core mpps regression comes around= 24% > > with 64B packet size. > > > > > > What I meant by lookup is _failure_ is: > > > Configuring traffic gen not to hit the 5 tuples defined in > > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c > > > > > > > > > 3) perf top _without_ this patch > > > 37.30% l3fwd [.] em_main_loop > > > 22.40% l3fwd [.] rte_hash_lookup > > > 13.05% l3fwd [.] nicvf_recv_pkts_cksum > > > 9.70% l3fwd [.] nicvf_xmit_pkts > > > 6.18% l3fwd [.] ipv4_hash_crc > > > 4.77% l3fwd [.] nicvf_fill_rbdr > > > 4.50% l3fwd [.] nicvf_single_pool_free_xmited_buffers > > > 1.16% libc-2.28.so [.] memcpy > > > 0.47% l3fwd [.] common_ring_mp_enqueue > > > 0.44% l3fwd [.] common_ring_mc_dequeue > > > 0.03% l3fwd [.] strerror_r@plt > > > > > > 4) perf top with this patch > > > > > > 47.41% l3fwd [.] rte_hash_lookup > > > 23.55% l3fwd [.] em_main_loop > > > 9.53% l3fwd [.] nicvf_recv_pkts_cksum > > > 6.95% l3fwd [.] nicvf_xmit_pkts > > > 4.63% l3fwd [.] ipv4_hash_crc > > > 3.30% l3fwd [.] nicvf_fill_rbdr > > > 3.29% l3fwd [.] nicvf_single_pool_free_xmited_buffers > > > 0.76% libc-2.28.so [.] memcpy > > > 0.30% l3fwd [.] common_ring_mp_enqueue > > > 0.25% l3fwd [.] common_ring_mc_dequeue > > > 0.04% l3fwd [.] strerror_r@plt > > > > > > > > > 5) Based on assembly, most of the cycles spends in rte_hash_lookup > > > around key_idx =3D __atomic_load_n(&bkt->key_idx[i](whose LDAR) and = "if > > > (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_SLOT) {" > > > > > > > > > 6) Since this patch is big and does 3 things are mentioned above, it > > > is difficult to pin point what is causing the exact issue. > > > > > > But, my primary analysis shows the item (1)(adding the atomic barrier= s). > > > But I need to spend more cycles to find out the exact causes. > > > > > > + Adding POWERPC maintainer as mostly POWERPC also impacted on this > > patch. > > Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instructi= on > > on x86, so x86 may not be much impacted. > > > > I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIR= E) > > issue. > > > > The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n > > operation which causing only around 1% regression. > > > > But since this patch has "two" __atomic_load_n in each > > search_one_bucket() and in the worst case it is looping around 16 time.= s i.e > > "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3% > > drop in lookup success case. > Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them sho= uld be called during a lookup miss. Ideally, the second one should not be c= alled as it is expected that the 'if' statement on the top is supposed to b= lock it. Are you seeing the 2nd atomic load in your perf output? > Is your failure traffic having multiple flows or single flow? l3fwd has only 4 entries. So yes, it is single flow. >=20 > > > > So this patch's regression will be based on how many cycles an LDAR tak= es on > > given ARMv8 platform and on how many issue(s) it can issue LDAR > > instructions at given point of time. > Agree. There are multiple micro-architectures in Arm eco-system. We shoul= d establish few simple rules to make sure algorithms perform well on all th= e available platforms. I established few rules in VPP and they are working = fine so far. Can you share that rules for everyone's benefit? >=20 > > > > IMO, This scheme won't work. I think, we are introducing such performan= ce > > critical feature, we need to put under function pointer scheme so that = if an > > application does not need such feature it can use plain loads. > > > IMO, we should do some more debugging before going into exploring other o= ptions. Yes. But, how do we align with v18.11 release? >=20 > > Already we have a lot of flags in the hash library to define the runtim= e > > behavior, I think, it makes sense to select function pointer based on s= uch flags > > and have a performance effective solution based on application requirem= ents. > > > IMO, many flags can be removed by doing some cleanup exercise. >=20 > > Just to prove the above root cause analysis, the following patch can fi= x the > > performance issue. I know, it is NOT correct in the context of this pat= ch. Just > > pasting in case, someone want to see the cost of LD vs > > __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform. > > > > > > On a different note, I think, it makes sense to use RCU based structure= in these > > case to avoid performance issue. liburcu has a good hash library for su= ch > > cases. (very less write and more read cases) > > > I think we miss a point here. These changes are based on customer feedbac= k and requests. DPDK is not only providing the implementation, it is provid= ing the APIs as well. IMO, we should make sure the code we have in DPDK sup= ports multiple use cases rather than addressing a narrow use case and point= ing somewhere else for others. Sure. But Introducing a use case should NOT harm other exiting use case used current example applications. >=20 > > > > > > > > The use case like lwfwd in hash mode, where writer does not update > > > stuff in fastpath(aka insert op) will be impact with this patch. > > > > I do not think 'l3fwd in hash mode' application is practical. If the aim = of this application is to showcase performance, it should represent real-li= fe use case. It should have hash inserts from control plane along with look= ups. We also have to note that, the algorithm without lock-free patch was n= ot usable for certain use cases on all platforms. It was not usable for eve= n more use cases on Arm platforms. So, I am not sure if we should be compar= ing the numbers from previous algorithm. In worst case, we need to make compile time or runtime with function pointer or whatever it takes to fix. > However, there seem to be some scope for improvement with the data struct= ure layout. I am looking into this further. OK >=20 > > > 7) Have you checked the l3fwd lookup failure use case in your environ= ment? > > > if so, please share your observation and if not, could you please che= ck it? > > > > > > 8) IMO, Such performance regression is not acceptable for l3fwd use > > > case where hash insert op will be done in slowpath. > What is missing in this application is continuous hash adds from slow pat= h. Yes. That's one use case for hash library. Some application may not need su= ch use case so it should be driven from application to define the requirements= , especially a feature impacting another usecase's performance. >=20 > > > > > > 9) Does anyone else facing this problem? > Any data on x86? >=20 > > > > > >