From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-HE1-obe.outbound.protection.outlook.com (mail-eopbgr10040.outbound.protection.outlook.com [40.107.1.40]) by dpdk.org (Postfix) with ESMTP id 48E0D2BE2 for ; Tue, 6 Nov 2018 07:07:45 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=0AvtFHGF87xrA5cphH0rqqF0QakFfQ2Kj3WHhSTwpHs=; b=Q0ll7+tBiL9ot2w0YiAOLVD5FN5k6WODJTVgNzHNn1vZFFO53J2Zh/O6/0vepQyahv19d4anN8q1MF+88MPlVF0xoSj+KHCOT3EJAjg3OdJpVp2vno4CDQsnG8kenabocovnkDpuhkFOtfADnWxVupoLFg5BZwjYMVkAImQgvpM= Received: from AM6PR08MB3672.eurprd08.prod.outlook.com (20.177.115.29) by AM6PR08MB3414.eurprd08.prod.outlook.com (20.177.113.11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1294.33; Tue, 6 Nov 2018 06:07:43 +0000 Received: from AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::c1a0:51bf:cd33:2b27]) by AM6PR08MB3672.eurprd08.prod.outlook.com ([fe80::c1a0:51bf:cd33:2b27%6]) with mapi id 15.20.1294.032; Tue, 6 Nov 2018 06:07:43 +0000 From: Honnappa Nagarahalli To: Jerin Jacob CC: "bruce.richardson@intel.com" , "pablo.de.lara.guarch@intel.com" , "dev@dpdk.org" , "yipeng1.wang@intel.com" , Dharmik Thakkar , "Gavin Hu (Arm Technology China)" , nd , "thomas@monjalon.net" , "ferruh.yigit@intel.com" , "hemant.agrawal@nxp.com" , "chaozhu@linux.vnet.ibm.com" , nd Thread-Topic: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Thread-Index: AQHUbO4hC1hG0V5Pe0K6GLHhoSzYDKU9/ZaAgAA/xYCAA9sPQA== Date: Tue, 6 Nov 2018 06:07:43 +0000 Message-ID: References: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com> <1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com> <20181103115240.GA3608@jerin> <20181103154039.GA25488@jerin> In-Reply-To: <20181103154039.GA25488@jerin> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Honnappa.Nagarahalli@arm.com; x-originating-ip: [217.140.111.135] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM6PR08MB3414; 6:yBUPmjFFgeSvw2SfxwOWCMiGLfLUoaVAHrkWzwSU1h8ze98dM1WwU6tcbqrY/t4WOOLPcIIRJnEMFyvxSKPC7FtZXnW9Ty8xUO0KqjKVxJ9a8pYxdeYnlJLIeWBPxg76udcKdCWJAPA6qbm9M89jPf0iuJDGEXh3UcctBH/b1kurFGto6cGr4ecgddl/4E4eX59IelOhc2p/NIv0zh0ZsWjlCH0Wcy5x1ZlMCP1FUf5NJBIIhAA1z/MGpesGa9L5sroKP2Vgb4c/5bRsUB7MZsWQ7pMUOXD61xZADN7aZwkQz2VB2rjK0cyElK8XCCDfxVrUDmTCGA9r9c+Wl2bcqUU/6uA62X0b2CAiks7I7OsNZa1jAzbHzbvSEneC0ms/AkCMZf8EpKwiFGsUW9u1LBkTltCWnA9DP2Whq/z+TAjclOyc7Lp71OMgsHLH/0eV7kTf+jr8Zjfwz2amBNIIrw==; 5:Ih+B/aswYrXu0RY23fEXXrrZcqu3OFlBhLOrYwUQpnYQCXO+YyBaqYRA7G9b7uovvwqQG+BcR8V7M23DC7RY0+q59AWPv0n9xI28UTIDnnA8HrAJOFNtLWfzWQA9Kqzef9D3/P+M2jX9UZeQbCia452EMBs4blGGPBjoALn2lSU=; 7:CIxKyTxJjMxR3Gdq9avSH2xFFzy6tO9ro5dG9IgGQUfW+a+rkTQVm0Wfndwl+U6+ZdG5chkAiRyYb0Foi1lNtD90weqQb/q6H2AkhOmcP/FHpDLVXJZF27gGSjPfaXRrmIU0qDscWZRfAZKjJL/GEg== x-ms-exchange-antispam-srfa-diagnostics: SOS;SOR; x-ms-office365-filtering-correlation-id: 188185a8-7f5c-46a1-5569-08d643ae2a7a x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: BCL:0; PCL:0; RULEID:(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020); SRVR:AM6PR08MB3414; x-ms-traffictypediagnostic: AM6PR08MB3414: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(163750095850)(180628864354917); x-ms-exchange-senderadcheck: 1 x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040522)(2401047)(8121501046)(5005006)(3231382)(944501410)(52105095)(3002001)(10201501046)(93006095)(93001095)(6055026)(148016)(149066)(150057)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123562045)(20161123564045)(20161123560045)(20161123558120)(201708071742011)(7699051)(76991095); SRVR:AM6PR08MB3414; BCL:0; PCL:0; RULEID:; SRVR:AM6PR08MB3414; x-forefront-prvs: 0848C1A6AA x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(396003)(366004)(39860400002)(136003)(376002)(346002)(189003)(199004)(256004)(14444005)(2900100001)(72206003)(478600001)(7696005)(66066001)(99286004)(33656002)(86362001)(5660300001)(4326008)(345774005)(81156014)(8676002)(14454004)(71200400001)(71190400001)(68736007)(8936002)(486006)(476003)(446003)(11346002)(81166006)(229853002)(2906002)(106356001)(6436002)(186003)(6506007)(6916009)(26005)(102836004)(9686003)(55016002)(105586002)(97736004)(25786009)(6246003)(54906003)(53936002)(93886005)(76176011)(316002)(3846002)(6116002)(305945005)(74316002)(7736002); DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB3414; H:AM6PR08MB3672.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: 5rtkVYDeHNq+d3s8RNGdV2a/Om9uTSpi4+j8Ltu187m3a9YrEfuv3aQSIq5CbTmaxpiFAJ+y2jIpf4kmHLWZZjxf9MCVdA6MDaYwKVsaQ/uoE407Zv9V5j68zJ5dohMqwwA907YIMI9DqTfqITFow1z9PVFGUaaFx13/LuPlprgbl4w3nQl/IwAWMNNPeqagMxIJBjtBU13TwaXV3FGG7AnxY3EKYnqRF8aGOqMkt9fOxDkHY7L2zDKdutWNOuMagVCS5UXeECOerrytLmCy6s/v2KbSiQJ873z9i+QsAWBQJ3AeAz5t2ezv2WH53qmD3mv4scj4JbniNqEPaXci3a+o537FUoBVrJ/6mID4TQs= spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: 188185a8-7f5c-46a1-5569-08d643ae2a7a X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Nov 2018 06:07:43.3773 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3414 Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Nov 2018 06:07:45 -0000 > > > > > > Add lock-free read-write concurrency. This is achieved by the > > > following changes. > > > > > > 1) Add memory ordering to avoid race conditions. The only race > > > condition that can occur is - using the key store element before > > > the key write is completed. Hence, while inserting the element the > > > release memory order is used. Any other race condition is caught by > > > the key comparison. Memory orderings are added only where needed. > > > For ex: reads in the writer's context do not need memory ordering as > > > there is a single writer. > > > > > > key_idx in the bucket entry and pdata in the key store element are > > > used for synchronisation. key_idx is used to release an inserted > > > entry in the bucket to the reader. Use of pdata for synchronisation > > > is required due to updation of an existing entry where-in only the > > > pdata is updated without updating key_idx. > > > > > > 2) Reader-writer concurrency issue, caused by moving the keys to > > > their alternative locations during key insert, is solved by > > > introducing a global counter(tbl_chng_cnt) indicating a change in > > > table. > > > > > > 3) Add the flag to enable reader-writer concurrency during run time. > > > > > > Signed-off-by: Honnappa Nagarahalli > > > > Hi Honnappa, > > Jerin, thank you for running this test and all the analysis. I have not run= this test. I was focused on simultaneous reads and writes. You can look at= file test_hash_readwrite_lf.c to look for the kind of the use cases. I am trying to reproduce this, I will get back with more details soon. > > This patch is causing _~24%_ performance regression on mpps/core with > > 64B packet with l3fwd in EM mode with octeontx. > > > > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E) > > Have you run with more cores (8, 16)? > > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config=3D"(0, 0, 23),(1, = 0, 22)" > > > > Observations: > > 1) When hash lookup is _success_ then regression is only 3%. Which is > > kind of make sense because additional new atomic instructions > > > > What I meant by lookup is _success_ is: > > Configuring traffic gen like below to match lookup as defined > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c > > > > dest.ip port0 201.0.0.0 > > src.ip port0 200.20.0.1 > > dest.port port0 102 > > src.port port0 12 > > > > dest.ip port1 101.0.0.0 > > src.ip port1 100.10.0.1 > > dest.port port1 101 > > src.port port1 11 > > > > tx.type IPv4+TCP > > > > > > > > 2) When hash lookup _fails_ the per core mpps regression comes around 2= 4% > with 64B packet size. > > > > What I meant by lookup is _failure_ is: > > Configuring traffic gen not to hit the 5 tuples defined in > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c > > > > > > 3) perf top _without_ this patch > > 37.30% l3fwd [.] em_main_loop > > 22.40% l3fwd [.] rte_hash_lookup > > 13.05% l3fwd [.] nicvf_recv_pkts_cksum > > 9.70% l3fwd [.] nicvf_xmit_pkts > > 6.18% l3fwd [.] ipv4_hash_crc > > 4.77% l3fwd [.] nicvf_fill_rbdr > > 4.50% l3fwd [.] nicvf_single_pool_free_xmited_buffers > > 1.16% libc-2.28.so [.] memcpy > > 0.47% l3fwd [.] common_ring_mp_enqueue > > 0.44% l3fwd [.] common_ring_mc_dequeue > > 0.03% l3fwd [.] strerror_r@plt > > > > 4) perf top with this patch > > > > 47.41% l3fwd [.] rte_hash_lookup > > 23.55% l3fwd [.] em_main_loop > > 9.53% l3fwd [.] nicvf_recv_pkts_cksum > > 6.95% l3fwd [.] nicvf_xmit_pkts > > 4.63% l3fwd [.] ipv4_hash_crc > > 3.30% l3fwd [.] nicvf_fill_rbdr > > 3.29% l3fwd [.] nicvf_single_pool_free_xmited_buffers > > 0.76% libc-2.28.so [.] memcpy > > 0.30% l3fwd [.] common_ring_mp_enqueue > > 0.25% l3fwd [.] common_ring_mc_dequeue > > 0.04% l3fwd [.] strerror_r@plt > > > > > > 5) Based on assembly, most of the cycles spends in rte_hash_lookup > > around key_idx =3D __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "i= f > > (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_SLOT) {" > > > > > > 6) Since this patch is big and does 3 things are mentioned above, it > > is difficult to pin point what is causing the exact issue. > > > > But, my primary analysis shows the item (1)(adding the atomic barriers)= . > > But I need to spend more cycles to find out the exact causes. >=20 >=20 > + Adding POWERPC maintainer as mostly POWERPC also impacted on this > patch. > Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction > on x86, so x86 may not be much impacted. >=20 > I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE) > issue. >=20 > The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n > operation which causing only around 1% regression. >=20 > But since this patch has "two" __atomic_load_n in each > search_one_bucket() and in the worst case it is looping around 16 time.s = i.e > "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3% > drop in lookup success case. Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them shoul= d be called during a lookup miss. Ideally, the second one should not be cal= led as it is expected that the 'if' statement on the top is supposed to blo= ck it. Are you seeing the 2nd atomic load in your perf output? Is your failure traffic having multiple flows or single flow? >=20 > So this patch's regression will be based on how many cycles an LDAR takes= on > given ARMv8 platform and on how many issue(s) it can issue LDAR > instructions at given point of time. Agree. There are multiple micro-architectures in Arm eco-system. We should = establish few simple rules to make sure algorithms perform well on all the = available platforms. I established few rules in VPP and they are working fi= ne so far. >=20 > IMO, This scheme won't work. I think, we are introducing such performance > critical feature, we need to put under function pointer scheme so that if= an > application does not need such feature it can use plain loads. >=20 IMO, we should do some more debugging before going into exploring other opt= ions. > Already we have a lot of flags in the hash library to define the runtime > behavior, I think, it makes sense to select function pointer based on suc= h flags > and have a performance effective solution based on application requiremen= ts. >=20 IMO, many flags can be removed by doing some cleanup exercise. > Just to prove the above root cause analysis, the following patch can fix = the > performance issue. I know, it is NOT correct in the context of this patch= . Just > pasting in case, someone want to see the cost of LD vs > __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform. >=20 >=20 > On a different note, I think, it makes sense to use RCU based structure i= n these > case to avoid performance issue. liburcu has a good hash library for such > cases. (very less write and more read cases) >=20 I think we miss a point here. These changes are based on customer feedback = and requests. DPDK is not only providing the implementation, it is providin= g the APIs as well. IMO, we should make sure the code we have in DPDK suppo= rts multiple use cases rather than addressing a narrow use case and pointin= g somewhere else for others. > /Jerin >=20 > @@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h, > const void *key, uint16_t sig, > void **data, const struct rte_hash_bucket *bkt) = { > int i; > - uint32_t key_idx; > - void *pdata; > struct rte_hash_key *k, *keys =3D h->key_store; >=20 > for (i =3D 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { > - key_idx =3D __atomic_load_n(&bkt->key_idx[i], > - __ATOMIC_ACQUIRE); > - if (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_= SLOT) > { > + if (bkt->sig_current[i] =3D=3D sig && > + bkt->key_idx[i] !=3D EMPTY_SLOT) { > k =3D (struct rte_hash_key *) ((char *)keys + > - key_idx * h->key_entry_size); > - pdata =3D __atomic_load_n(&k->pdata, > - __ATOMIC_ACQUIRE); > - > + bkt->key_idx[i] * > h->key_entry_size); Does this make a difference for the lookup miss test case? > if (rte_hash_cmp_eq(key, k->key, h) =3D=3D 0) { > if (data !=3D NULL) > - *data =3D pdata; > + *data =3D k->pdata; > /* > * Return index where key is stored, > * subtracting the first dummy index > */ > - return key_idx - 1; > + return bkt->key_idx[i] - 1; > } > } > } >=20 >=20 > > > > The use case like lwfwd in hash mode, where writer does not update > > stuff in fastpath(aka insert op) will be impact with this patch. > > I do not think 'l3fwd in hash mode' application is practical. If the aim of= this application is to showcase performance, it should represent real-life= use case. It should have hash inserts from control plane along with lookup= s. We also have to note that, the algorithm without lock-free patch was not= usable for certain use cases on all platforms. It was not usable for even = more use cases on Arm platforms. So, I am not sure if we should be comparin= g the numbers from previous algorithm. However, there seem to be some scope for improvement with the data structur= e layout. I am looking into this further. > > 7) Have you checked the l3fwd lookup failure use case in your environme= nt? > > if so, please share your observation and if not, could you please check= it? > > > > 8) IMO, Such performance regression is not acceptable for l3fwd use > > case where hash insert op will be done in slowpath. What is missing in this application is continuous hash adds from slow path. > > > > 9) Does anyone else facing this problem? Any data on x86? > > > >