From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Honnappa.Nagarahalli@arm.com>
Received: from EUR02-HE1-obe.outbound.protection.outlook.com
 (mail-eopbgr10040.outbound.protection.outlook.com [40.107.1.40])
 by dpdk.org (Postfix) with ESMTP id 48E0D2BE2
 for <dev@dpdk.org>; Tue,  6 Nov 2018 07:07:45 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; 
 s=selector1-arm-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=0AvtFHGF87xrA5cphH0rqqF0QakFfQ2Kj3WHhSTwpHs=;
 b=Q0ll7+tBiL9ot2w0YiAOLVD5FN5k6WODJTVgNzHNn1vZFFO53J2Zh/O6/0vepQyahv19d4anN8q1MF+88MPlVF0xoSj+KHCOT3EJAjg3OdJpVp2vno4CDQsnG8kenabocovnkDpuhkFOtfADnWxVupoLFg5BZwjYMVkAImQgvpM=
Received: from AM6PR08MB3672.eurprd08.prod.outlook.com (20.177.115.29) by
 AM6PR08MB3414.eurprd08.prod.outlook.com (20.177.113.11) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.1294.33; Tue, 6 Nov 2018 06:07:43 +0000
Received: from AM6PR08MB3672.eurprd08.prod.outlook.com
 ([fe80::c1a0:51bf:cd33:2b27]) by AM6PR08MB3672.eurprd08.prod.outlook.com
 ([fe80::c1a0:51bf:cd33:2b27%6]) with mapi id 15.20.1294.032; Tue, 6 Nov 2018
 06:07:43 +0000
From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
 "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,
 "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"
 <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>, "Gavin
 Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,
 "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@intel.com"
 <ferruh.yigit@intel.com>, "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
 "chaozhu@linux.vnet.ibm.com" <chaozhu@linux.vnet.ibm.com>, nd <nd@arm.com>
Thread-Topic: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
 concurrency
Thread-Index: AQHUbO4hC1hG0V5Pe0K6GLHhoSzYDKU9/ZaAgAA/xYCAA9sPQA==
Date: Tue, 6 Nov 2018 06:07:43 +0000
Message-ID: <AM6PR08MB3672DD10D2BEAEB58599B7BD98CB0@AM6PR08MB3672.eurprd08.prod.outlook.com>
References: <1540532253-112591-1-git-send-email-honnappa.nagarahalli@arm.com>
 <1540532253-112591-5-git-send-email-honnappa.nagarahalli@arm.com>
 <20181103115240.GA3608@jerin> <20181103154039.GA25488@jerin>
In-Reply-To: <20181103154039.GA25488@jerin>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=Honnappa.Nagarahalli@arm.com; 
x-originating-ip: [217.140.111.135]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; AM6PR08MB3414;
 6:yBUPmjFFgeSvw2SfxwOWCMiGLfLUoaVAHrkWzwSU1h8ze98dM1WwU6tcbqrY/t4WOOLPcIIRJnEMFyvxSKPC7FtZXnW9Ty8xUO0KqjKVxJ9a8pYxdeYnlJLIeWBPxg76udcKdCWJAPA6qbm9M89jPf0iuJDGEXh3UcctBH/b1kurFGto6cGr4ecgddl/4E4eX59IelOhc2p/NIv0zh0ZsWjlCH0Wcy5x1ZlMCP1FUf5NJBIIhAA1z/MGpesGa9L5sroKP2Vgb4c/5bRsUB7MZsWQ7pMUOXD61xZADN7aZwkQz2VB2rjK0cyElK8XCCDfxVrUDmTCGA9r9c+Wl2bcqUU/6uA62X0b2CAiks7I7OsNZa1jAzbHzbvSEneC0ms/AkCMZf8EpKwiFGsUW9u1LBkTltCWnA9DP2Whq/z+TAjclOyc7Lp71OMgsHLH/0eV7kTf+jr8Zjfwz2amBNIIrw==;
 5:Ih+B/aswYrXu0RY23fEXXrrZcqu3OFlBhLOrYwUQpnYQCXO+YyBaqYRA7G9b7uovvwqQG+BcR8V7M23DC7RY0+q59AWPv0n9xI28UTIDnnA8HrAJOFNtLWfzWQA9Kqzef9D3/P+M2jX9UZeQbCia452EMBs4blGGPBjoALn2lSU=;
 7:CIxKyTxJjMxR3Gdq9avSH2xFFzy6tO9ro5dG9IgGQUfW+a+rkTQVm0Wfndwl+U6+ZdG5chkAiRyYb0Foi1lNtD90weqQb/q6H2AkhOmcP/FHpDLVXJZF27gGSjPfaXRrmIU0qDscWZRfAZKjJL/GEg==
x-ms-exchange-antispam-srfa-diagnostics: SOS;SOR;
x-ms-office365-filtering-correlation-id: 188185a8-7f5c-46a1-5569-08d643ae2a7a
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020);
 SRVR:AM6PR08MB3414; 
x-ms-traffictypediagnostic: AM6PR08MB3414:
nodisclaimer: True
x-microsoft-antispam-prvs: <AM6PR08MB34143F85FF9CFFBF232B401498CB0@AM6PR08MB3414.eurprd08.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(163750095850)(180628864354917);
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(8211001083)(6040522)(2401047)(8121501046)(5005006)(3231382)(944501410)(52105095)(3002001)(10201501046)(93006095)(93001095)(6055026)(148016)(149066)(150057)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123562045)(20161123564045)(20161123560045)(20161123558120)(201708071742011)(7699051)(76991095);
 SRVR:AM6PR08MB3414; BCL:0; PCL:0; RULEID:; SRVR:AM6PR08MB3414; 
x-forefront-prvs: 0848C1A6AA
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(396003)(366004)(39860400002)(136003)(376002)(346002)(189003)(199004)(256004)(14444005)(2900100001)(72206003)(478600001)(7696005)(66066001)(99286004)(33656002)(86362001)(5660300001)(4326008)(345774005)(81156014)(8676002)(14454004)(71200400001)(71190400001)(68736007)(8936002)(486006)(476003)(446003)(11346002)(81166006)(229853002)(2906002)(106356001)(6436002)(186003)(6506007)(6916009)(26005)(102836004)(9686003)(55016002)(105586002)(97736004)(25786009)(6246003)(54906003)(53936002)(93886005)(76176011)(316002)(3846002)(6116002)(305945005)(74316002)(7736002);
 DIR:OUT; SFP:1101; SCL:1; SRVR:AM6PR08MB3414;
 H:AM6PR08MB3672.eurprd08.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; MX:1; A:1; 
received-spf: None (protection.outlook.com: arm.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: 5rtkVYDeHNq+d3s8RNGdV2a/Om9uTSpi4+j8Ltu187m3a9YrEfuv3aQSIq5CbTmaxpiFAJ+y2jIpf4kmHLWZZjxf9MCVdA6MDaYwKVsaQ/uoE407Zv9V5j68zJ5dohMqwwA907YIMI9DqTfqITFow1z9PVFGUaaFx13/LuPlprgbl4w3nQl/IwAWMNNPeqagMxIJBjtBU13TwaXV3FGG7AnxY3EKYnqRF8aGOqMkt9fOxDkHY7L2zDKdutWNOuMagVCS5UXeECOerrytLmCy6s/v2KbSiQJ873z9i+QsAWBQJ3AeAz5t2ezv2WH53qmD3mv4scj4JbniNqEPaXci3a+o537FUoBVrJ/6mID4TQs=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 188185a8-7f5c-46a1-5569-08d643ae2a7a
X-MS-Exchange-CrossTenant-originalarrivaltime: 06 Nov 2018 06:07:43.3773 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB3414
Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
 concurrency
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Nov 2018 06:07:45 -0000

> > >
> > > Add lock-free read-write concurrency. This is achieved by the
> > > following changes.
> > >
> > > 1) Add memory ordering to avoid race conditions. The only race
> > > condition that can occur is -  using the key store element before
> > > the key write is completed. Hence, while inserting the element the
> > > release memory order is used. Any other race condition is caught by
> > > the key comparison. Memory orderings are added only where needed.
> > > For ex: reads in the writer's context do not need memory ordering as
> > > there is a single writer.
> > >
> > > key_idx in the bucket entry and pdata in the key store element are
> > > used for synchronisation. key_idx is used to release an inserted
> > > entry in the bucket to the reader. Use of pdata for synchronisation
> > > is required due to updation of an existing entry where-in only the
> > > pdata is updated without updating key_idx.
> > >
> > > 2) Reader-writer concurrency issue, caused by moving the keys to
> > > their alternative locations during key insert, is solved by
> > > introducing a global counter(tbl_chng_cnt) indicating a change in
> > > table.
> > >
> > > 3) Add the flag to enable reader-writer concurrency during run time.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Hi Honnappa,
> >
Jerin, thank you for running this test and all the analysis. I have not run=
 this test. I was focused on simultaneous reads and writes. You can look at=
 file test_hash_readwrite_lf.c to look for the kind of the use cases.

I am trying to reproduce this, I will get back with more details soon.

> > This patch is causing _~24%_ performance regression on mpps/core with
> > 64B packet with l3fwd in EM mode with octeontx.
> >
> > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)
> >
Have you run with more cores (8, 16)?

> > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config=3D"(0, 0, 23),(1, =
0, 22)"
> >
> > Observations:
> > 1) When hash lookup is _success_ then regression is only 3%. Which is
> > kind of make sense because additional new atomic instructions
> >
> > What I meant by lookup is _success_ is:
> > Configuring traffic gen like below to match lookup as defined
> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> >
> > dest.ip      port0    201.0.0.0
> > src.ip       port0    200.20.0.1
> > dest.port    port0    102
> > src.port     port0    12
> >
> > dest.ip      port1    101.0.0.0
> > src.ip       port1    100.10.0.1
> > dest.port    port1    101
> > src.port     port1    11
> >
> > tx.type      IPv4+TCP
> >
> >
> >
> > 2) When hash lookup _fails_ the per core mpps regression comes around 2=
4%
> with 64B packet size.
> >
> > What I meant by lookup is _failure_ is:
> > Configuring traffic gen not to hit the 5 tuples defined in
> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> >
> >
> > 3) perf top _without_ this patch
> >   37.30%  l3fwd         [.] em_main_loop
> >   22.40%  l3fwd         [.] rte_hash_lookup
> >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
> >    9.70%  l3fwd         [.] nicvf_xmit_pkts
> >    6.18%  l3fwd         [.] ipv4_hash_crc
> >    4.77%  l3fwd         [.] nicvf_fill_rbdr
> >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> >    1.16%  libc-2.28.so  [.] memcpy
> >    0.47%  l3fwd         [.] common_ring_mp_enqueue
> >    0.44%  l3fwd         [.] common_ring_mc_dequeue
> >    0.03%  l3fwd         [.] strerror_r@plt
> >
> > 4) perf top with this patch
> >
> >   47.41%  l3fwd         [.] rte_hash_lookup
> >   23.55%  l3fwd         [.] em_main_loop
> >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
> >    6.95%  l3fwd         [.] nicvf_xmit_pkts
> >    4.63%  l3fwd         [.] ipv4_hash_crc
> >    3.30%  l3fwd         [.] nicvf_fill_rbdr
> >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> >    0.76%  libc-2.28.so  [.] memcpy
> >    0.30%  l3fwd         [.] common_ring_mp_enqueue
> >    0.25%  l3fwd         [.] common_ring_mc_dequeue
> >    0.04%  l3fwd         [.] strerror_r@plt
> >
> >
> > 5) Based on assembly, most of the cycles spends in rte_hash_lookup
> > around  key_idx =3D __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "i=
f
> > (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_SLOT) {"
> >
> >
> > 6) Since this patch is big and does 3 things are mentioned above, it
> > is difficult to pin point what is causing the exact issue.
> >
> > But, my primary analysis shows the item (1)(adding the atomic barriers)=
.
> > But I need to spend more cycles to find out the exact causes.
>=20
>=20
> + Adding POWERPC maintainer as mostly POWERPC also impacted on this
> patch.
> Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
> on x86, so x86 may not be much impacted.
>=20
> I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)
> issue.
>=20
> The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
> operation which causing only around 1% regression.
>=20
> But since this patch has "two" __atomic_load_n in each
> search_one_bucket() and in the worst case it is looping around 16 time.s =
i.e
> "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%
> drop in lookup success case.
Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them shoul=
d be called during a lookup miss. Ideally, the second one should not be cal=
led as it is expected that the 'if' statement on the top is supposed to blo=
ck it. Are you seeing the 2nd atomic load in your perf output?
Is your failure traffic having multiple flows or single flow?

>=20
> So this patch's regression will be based on how many cycles an LDAR takes=
 on
> given ARMv8 platform and on how many issue(s) it can issue LDAR
> instructions at given point of time.
Agree. There are multiple micro-architectures in Arm eco-system. We should =
establish few simple rules to make sure algorithms perform well on all the =
available platforms. I established few rules in VPP and they are working fi=
ne so far.

>=20
> IMO, This scheme won't work. I think, we are introducing such performance
> critical feature, we need to put under function pointer scheme so that if=
 an
> application does not need such feature it can use plain loads.
>=20
IMO, we should do some more debugging before going into exploring other opt=
ions.

> Already we have a lot of flags in the hash library to define the runtime
> behavior, I think, it makes sense to select function pointer based on suc=
h flags
> and have a performance effective solution based on application requiremen=
ts.
>=20
IMO, many flags can be removed by doing some cleanup exercise.

> Just to prove the above root cause analysis, the following patch can fix =
the
> performance issue. I know, it is NOT correct in the context of this patch=
. Just
> pasting in case, someone want to see the cost of LD vs
> __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.
>=20
>=20
> On a different note, I think, it makes sense to use RCU based structure i=
n these
> case to avoid performance issue. liburcu has a good hash library for such
> cases. (very less write and more read cases)
>=20
I think we miss a point here. These changes are based on customer feedback =
and requests. DPDK is not only providing the implementation, it is providin=
g the APIs as well. IMO, we should make sure the code we have in DPDK suppo=
rts multiple use cases rather than addressing a narrow use case and pointin=
g somewhere else for others.

> /Jerin
>=20
> @@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h,
> const void *key, uint16_t sig,
>                         void **data, const struct rte_hash_bucket *bkt)  =
{
>         int i;
> -       uint32_t key_idx;
> -       void *pdata;
>         struct rte_hash_key *k, *keys =3D h->key_store;
>=20
>         for (i =3D 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> -               key_idx =3D __atomic_load_n(&bkt->key_idx[i],
> -                                         __ATOMIC_ACQUIRE);
> -               if (bkt->sig_current[i] =3D=3D sig && key_idx !=3D EMPTY_=
SLOT)
>                 {
> +               if (bkt->sig_current[i] =3D=3D sig &&
> +                               bkt->key_idx[i] !=3D EMPTY_SLOT) {
>                         k =3D (struct rte_hash_key *) ((char *)keys +
> -                                       key_idx * h->key_entry_size);
> -                       pdata =3D __atomic_load_n(&k->pdata,
> -                                       __ATOMIC_ACQUIRE);
> -
> +                                       bkt->key_idx[i] *
> h->key_entry_size);
Does this make a difference for the lookup miss test case?

>                         if (rte_hash_cmp_eq(key, k->key, h) =3D=3D 0) {
>                                 if (data !=3D NULL)
> -                                       *data =3D pdata;
> +                                       *data =3D k->pdata;
>                                 /*
>                                  * Return index where key is stored,
>                                  * subtracting the first dummy index
>                                  */
> -                               return key_idx - 1;
> +                               return bkt->key_idx[i] - 1;
>                         }
>                 }
>         }
>=20
>=20
> >
> > The use case like lwfwd in hash mode, where writer does not update
> > stuff in fastpath(aka insert op) will be impact with this patch.
> >
I do not think 'l3fwd in hash mode' application is practical. If the aim of=
 this application is to showcase performance, it should represent real-life=
 use case. It should have hash inserts from control plane along with lookup=
s. We also have to note that, the algorithm without lock-free patch was not=
 usable for certain use cases on all platforms. It was not usable for even =
more use cases on Arm platforms. So, I am not sure if we should be comparin=
g the numbers from previous algorithm.
However, there seem to be some scope for improvement with the data structur=
e layout. I am looking into this further.

> > 7) Have you checked the l3fwd lookup failure use case in your environme=
nt?
> > if so, please share your observation and if not, could you please check=
 it?
> >
> > 8) IMO, Such performance regression is not acceptable for l3fwd use
> > case where hash insert op will be done in slowpath.
What is missing in this application is continuous hash adds from slow path.

> >
> > 9) Does anyone else facing this problem?
Any data on x86?

> >
> >