From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id EA207A058E; Thu, 26 Mar 2020 02:50:18 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 58D012C6D; Thu, 26 Mar 2020 02:50:18 +0100 (CET) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 4BD9A2C15 for ; Thu, 26 Mar 2020 02:50:16 +0100 (CET) IronPort-SDR: 1zBRxK7JM/wHHQ3g5tC1ucN7iRviMssOQZjY5oe3uZ3AGy4g02uQSTUT1nnuYVpWCAwGO0rYlu KwGBsqs1vAPQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Mar 2020 18:50:15 -0700 IronPort-SDR: 3VGtdhxesyZG5BWUSpN3aMTB+mT4d4iliHocO5DkBRcThZuEpcxpY3vnZWTzB2rvHmLuB8gk1G OmuXCUIYOMhw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,306,1580803200"; d="scan'208";a="238671379" Received: from orsmsx104.amr.corp.intel.com ([10.22.225.131]) by fmsmga007.fm.intel.com with ESMTP; 25 Mar 2020 18:50:14 -0700 Received: from orsmsx111.amr.corp.intel.com (10.22.240.12) by ORSMSX104.amr.corp.intel.com (10.22.225.131) with Microsoft SMTP Server (TLS) id 14.3.439.0; Wed, 25 Mar 2020 18:50:13 -0700 Received: from ORSEDG002.ED.cps.intel.com (10.7.248.5) by ORSMSX111.amr.corp.intel.com (10.22.240.12) with Microsoft SMTP Server (TLS) id 14.3.439.0; Wed, 25 Mar 2020 18:50:13 -0700 Received: from NAM11-DM6-obe.outbound.protection.outlook.com (104.47.57.168) by edgegateway.intel.com (134.134.137.101) with Microsoft SMTP Server (TLS) id 14.3.439.0; Wed, 25 Mar 2020 18:50:13 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZOI95i+gBOShOxBCnb1KtQSrkT8ysK6zkGocz42sY6ZUyLuuwxIxwsYdRHS+3QuBAFbmeuC9kZF+lcPnSaCpb3Tzjb5/GWKDdfDEOq44qWlHxFDjvOnZfX3DvLY0+PU4zpw7ue4jIT3Vj1mVxvkMlWvFsyVDnvYL7VmwslueABHr+J/FvSCBF2cNbTxJuJhsR4ZjVxVWTn6b0OrjjgyliwATmBnHSCrkesSSStDoS5fM9KoxLZ9JsZEHxsxUffoJEvoD8tR6KjO92mWbkkogOf8Lgf7T9CuNx5MnrrlDEIKbLPTBuQ24dUiddCfQd9QEFIHOftECFoOztgWaP8XXzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=aRevZM4r2YdAC0YbkGEda4pC8PsSymXEtp42Bcm6M5k=; b=k7l7teS7vbMFnIx121rrWsyU4CWqWwzGZ2stZevB443JRXk+gW1cG4iECVFhYYypSqicy4jvNi557xxgmTKpK96jnkntnH140ay6mRUU1ha3i3Pa6gPcr7hlayB7VU5Rt63nc8atnrRN8wUsgR3o5aQsVEmnRTZ4/8Ovj73R00wBzC+frDf1THedEdfOfKedavfwwpj9ZY+NBOBt3OWc5uq38n9Jmxyab/aNO8J0QUupHjaQK8qNjkTf6etIO2HFj15himCiW+5Wo86e/HKYkyANoojsiYdBa1LQQNqiOLdofh+ZlHP5ZzdSlt3MA2Rm7xRkmjrUwk0bi30q5VvSjw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel.onmicrosoft.com; s=selector2-intel-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=aRevZM4r2YdAC0YbkGEda4pC8PsSymXEtp42Bcm6M5k=; b=V7SB7lww7m35Pu5MKQofTUd4J6lY3dRxaacBqEVROb/AGhNxfJbzTa9BT/E3ZPl8QPaA463B08HwVdScXw+AOuIiSXdaZxmyh5RyKWtss51JmGh81CAR8N+ZFCbu5da+ODJb0sCGKrAuqnVEshwBYdnFFGBab7+pM7FsnbMPf80= Received: from BYAPR11MB2549.namprd11.prod.outlook.com (2603:10b6:a02:c4::33) by BYAPR11MB3544.namprd11.prod.outlook.com (2603:10b6:a03:b5::29) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2835.20; Thu, 26 Mar 2020 01:50:12 +0000 Received: from BYAPR11MB2549.namprd11.prod.outlook.com ([fe80::141d:cf28:589e:5f49]) by BYAPR11MB2549.namprd11.prod.outlook.com ([fe80::141d:cf28:589e:5f49%4]) with mapi id 15.20.2856.019; Thu, 26 Mar 2020 01:50:11 +0000 From: "Ananyev, Konstantin" To: Honnappa Nagarahalli , "dev@dpdk.org" CC: "olivier.matz@6wind.com" , nd , nd Thread-Topic: [dpdk-dev] [RFC 0/6] New sync modes for ring Thread-Index: AQHV6waQaBpJG6uFVkOED+rLZZWNPqhZ9guAgAA/XcA= Date: Thu, 26 Mar 2020 01:50:11 +0000 Message-ID: References: <20200224113515.1744-1-konstantin.ananyev@intel.com> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-reaction: no-action dlp-version: 11.2.0.6 authentication-results: spf=none (sender IP is ) smtp.mailfrom=konstantin.ananyev@intel.com; x-originating-ip: [192.198.151.163] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 34e0e2ed-25e2-41a3-ce13-08d7d12805a4 x-ms-traffictypediagnostic: BYAPR11MB3544: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-forefront-prvs: 0354B4BED2 x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(346002)(136003)(366004)(39860400002)(396003)(376002)(8676002)(81156014)(52536014)(71200400001)(6506007)(5660300002)(66446008)(7696005)(478600001)(54906003)(66476007)(66556008)(64756008)(8936002)(66946007)(76116006)(110136005)(2906002)(316002)(9686003)(33656002)(30864003)(81166006)(186003)(966005)(26005)(55016002)(86362001)(4326008); DIR:OUT; SFP:1102; SCL:1; SRVR:BYAPR11MB3544; H:BYAPR11MB2549.namprd11.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: psDe8SsCGKoioO78rC/2DFBBqUH3Oe81z5j1gS6kTt8HyUdF9U8FDIFXUNZ+6GqQbFS+ZJVJrxamx7nkvAppc+oLXPcWanYI15CwI2B6aMVYYpVYoheTfY38xYjpHTER/Xp9wjUJSTHhw94P/l5h+QxaSREI0i8icrW0Iqud/nGJInhT8Kx3rLuDb2aTtn7+PeqU5nuGvMbkR7SIvG0+APHmqiAJtMSo+U+o8veB5silgqMrbettICytCjl0BGp7ikerxowqcjw3+phDA4B3imqRvAQ6gjjOYKknQsFbRZb9XeA8izSNzrm7V/EkqLHjbi99P3nqZEWOhdBdgeB6PVPL/OZWOH1fU/V007vAOMfemFlwuOsyuugpLirHqoe1lGMCrJHtqWYFnD0xNkQrdtLLXZHX/uNG1ytDLoklZSY5n7DNlnZeI6RVVtqfxKZC0ZKbMRg+6c+V+tHZ4jkhZohXt+1/I0guSRVI+KaNAL7+fCzEpcTFtsfo0ETLaKjsxd+MqWhqe6EwoXuqHx7vOQ== x-ms-exchange-antispam-messagedata: tBybJ3UxhBrJqWOB5/EPAOJvX86X+y1R/UkM0mJmwVU2jOb3IkTOiNooI3d2tFIyEAlhzSKtQBjpctu3/PDv1ViFAmF9EZNBjT5Y3i2OQWeqbu0Ua94iAJxm7AQZLZH85ADUntCszwOaUnrhzkKuHw== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-Network-Message-Id: 34e0e2ed-25e2-41a3-ce13-08d7d12805a4 X-MS-Exchange-CrossTenant-originalarrivaltime: 26 Mar 2020 01:50:11.7729 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: VdjM+eRdmkpT+1AzuRDur1aiiFBaRRQeUYPTvUcCQP9rQG7gExzqwf3SKuGPrnPB6PjCCsfMZRjpsUp04KLGPManI1XnPHtIO7cjuywNCn4= X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR11MB3544 X-OriginatorOrg: intel.com Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" >=20 > >=20 > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring > > > > Upfront note - that RFC is not a complete patch. > > It introduces an ABI breakage, plus it doesn't update ring_elem code pr= operly, > As per the current rules, these changes (in the current form) will be acc= epted only for 20.11 release. How do we address this for immediate > requirements like RCU defer APIs? I think I found a way to introduce these new modes without API/ABI breakage= . Working on v1 right now. Plan to submit it by end of that week/start of nex= t one. > I suggest that we move forward with my RFC (taking into consideration you= r feedback) to make progress on RCU APIs. >=20 > > etc. > > I plan to deal with all these things in later versions. > > Right now I seek an initial feedback about proposed ideas. > > Would also ask people to repeat performance tests (see below) on their > > platforms to confirm the impact. > > > > More and more customers use(/try to use) DPDK based apps within > > overcommitted systems (multiple acttive threads over same pysical cores= ): > > VM, container deployments, etc. > > One quite common problem they hit: Lock-Holder-Preemption with rte_ring= . > > LHP is quite a common problem for spin-based sync primitives (spin-lock= s, etc.) > > on overcommitted systems. > > The situation gets much worse when some sort of fair-locking technique = is > > used (ticket-lock, etc.). > > As now not only lock-owner but also lock-waiters scheduling order matte= rs a > > lot. > > This is a well-known problem for kernel within VMs: > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf > > The problem with rte_ring is that while head accusion is sort of un-fai= r locking, > > waiting on tail is very similar to ticket lock schema - tail has to be = updated in > > particular order. > > That makes current rte_ring implementation to perform really pure on so= me > > overcommited scenarios. > > While it is probably not possible to completely resolve this problem in > > userspace only (without some kernel communication/intervention), removi= ng > > fairness in tail update can mitigate it significantly. > > So this RFC proposes two new optional ring synchronization modes: > > 1) Head/Tail Sync (HTS) mode > > In that mode enqueue/dequeue operation is fully serialized: > > only one thread at a time is allowed to perform given op. > > As another enhancement provide ability to split enqueue/dequeue > > operation into two phases: > > - enqueue/dequeue start > > - enqueue/dequeue finish > > That allows user to inspect objects in the ring without removing > > them from it (aka MT safe peek). > IMO, this will not address the problem described above. It does, please see the results produced by ring_stress_*autotest below. Let say for test-case: 8thread@2core(--lcores=3D'6,(10-13)@7,(20-23)@8' it = shows: avg number of cycles per object for enqueue /dequeue: MP/MC: 280314.32 HTS: 294.72 RTS: 318.79 Customer who tried it reported similar level of improvement. Actually if you have time - would be very interesting to see what numbers w= ill be on ARM boxes. To reproduce, just: $cat ring_tests_u4 ring_stress_autotest ring_stress_hts_autotest ring_stress_rts_autotest /app/test/dpdk-test --lcores=3D'6,(10-13)@7,(20-23)@8' -n 4 < ring_tests_u= 4 2>&1 | tee res1 Then look at the ' AGGREGATE' stats. Right now it is a bit too verbose, so probably the easiest thing to extract= same numbers quickly: grep 'cycles/obj' res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9=3D= =3D0) print $(NF);}' 280314.32 1057833.55 294.72 480.10 318.79 461.52 First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS. > For ex: when a producer updates the head and gets scheduled out, other pr= oducers > have to spin. Sure, as I wrote in original cover letter: " While it is probably not possible to completely resolve this problem in userspace only (without some kernel communication/intervention), removing fairness in tail update can mitigate it significantly." Results from the ring_stress_*_autotest confirm that. > The problem is probably worse as with non-HTS case moving of the head and= copying of the ring elements can happen in > parallel between the producers (similarly for consumers). Yes as we serialize the ring, we remove possibility of simultaneous copy. That's why for 'normal' cases (one thread per core) original MP/MC is usual= ly faster. Though on overcommitted scenarios current MP/MC performance degrades dramat= ically. The main problem with current MP/MC implementation is in that tail update have to be done in strict order (sort of fair locking scheme). Which means that we have much-much worse LHP manifestation, then when we use unfair schemes. With serialized ring (HTS) we remove that ordering completely (same idea as switch from fair to unfair locking for PV spin-locks). =20 > IMO, HTS should not be a configurable flag.=20 Why? > In RCU requirement, a MP enqueue and HTS dequeue are required. This is supported, user can specify different modes for consumer and produc= er: (0 | RING_F_MC_HTS_DEQ). Then it is up to the user either to call generic rte_ring_enqueue/rte_ring_= dequeue, or specify mode manually by function name: rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk. >=20 > > 2) Relaxed Tail Sync (RTS) > > The main difference from original MP/MC algorithm is that tail value is > > increased not by every thread that finished enqueue/dequeue, but only b= y the > > last one. > > That allows threads to avoid spinning on ring tail value, leaving actua= l tail value > > change to the last thread in the update queue. > This can be a configurable flag on the ring. > I am not sure how this solves the problem you have stated above completel= y. Updating the count from all intermediate threads is still > required to update the value of the head. But yes, it reduces the severit= y of the problem by not enforcing the order in which the tail is > updated. As I said above, main source of slowdown here - that we have to update tail in particular order. So the main objective (same as for HTS) is to remove that ordering. =20 > I also think it introduces the problem on the other side of the ring beca= use the tail is not updated soon enough (the other side has to wait > longer for the elements to become available). Yes, producer/consumer starvation. That's why we need max allowed Head-Tail-Distance (htd_max) - to limit how far head can go away from tail. > It also introduces another configuration parameter (HTD_MAX_DEF) which th= ey have to deal > with. If user doesn't provide any value, it will be set by default to ring.capaci= ty / 8. >From my measurements works quite well. Though there possibility for the user to set another value, if needed. > Users have to still implement the current hypervisor related solutions. Didn't get what you trying to say with that phrase. > IMO, we should run the benchmark for this on an over committed setup to u= nderstand the benefits. That's why I created ring_stress_*autotest test-cases and collected numbers= provided below. I suppose they clearly show the problem on overcommitted scenarios, and how RTS/HTS improve that situation.=20 Would appreciate if you repeat these tests on your machines.=20 >=20 > > > > Test results on IA (see below) show significant improvements for averag= e > > enqueue/dequeue op times on overcommitted systems. > > For 'classic' DPDK deployments (one thread per core) original MP/MC > > algorithm still shows best numbers, though for 64-bit target RTS number= s are > > not that far away. > > Numbers were produced by ring_stress_*autotest (first patch in these se= ries). > > > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > > DEQ+ENQ average cycles/obj > > > > MP/MC HTS RTS > > 1thread@1core(--lcores=3D6-7) 8.00 8.15 8.= 99 > > 2thread@2core(--lcores=3D6-8) 19.14 19.61 20= .35 > > 4thread@4core(--lcores=3D6-10) 29.43 29.79 31= .82 > > 8thread@8core(--lcores=3D6-14) 110.59 192.81 11= 9.50 > > 16thread@16core(--lcores=3D6-22) 461.03 813.12 49= 5.59 > > 32thread/@32core(--lcores=3D'6-22,55-70') 982.90 1972.38 11= 60.51 > > > > 2thread@1core(--lcores=3D'6,(10-11)@7' 20140.50 23.58 25= .14 > > 4thread@2core(--lcores=3D'6,(10-11)@7,(20-21)@8' 153680.60 76.88 80= .05 > > 8thread@2core(--lcores=3D'6,(10-13)@7,(20-23)@8' 280314.32 294.72 31= 8.79 > > 16thread@2core(--lcores=3D'6,(10-17)@7,(20-27)@8' 643176.59 1144.02 > > 1175.14 32thread@2core(--lcores=3D'6,(10-25)@7,(30-45)@8' 4264238.80 > > 4627.48 4892.68 > > > > 8thread@2core(--lcores=3D'6,(10-17)@(7,8))' 321085.98 298.59 30= 7.47 > > 16thread@4core(--lcores=3D'6,(20-35)@(7-10))' 1900705.61 575.35 67= 8.29 > > 32thread@4core(--lcores=3D'6,(20-51)@(7-10))' 5510445.85 2164.36 27= 14.12 > > > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > > DEQ+ENQ average cycles/obj > > > > MP/MC HTS RTS > > 1thread@1core(--lcores=3D6-7) 7.85 12.13 11= .31 > > 2thread@2core(--lcores=3D6-8) 17.89 24.52 21= .86 > > 8thread@8core(--lcores=3D6-14) 32.58 354.20 54= .58 > > 32thread/@32core(--lcores=3D'6-22,55-70') 813.77 6072.41 21= 69.91 > > > > 2thread@1core(--lcores=3D'6,(10-11)@7' 16095.00 36.06 34= .74 > > 8thread@2core(--lcores=3D'6,(10-13)@7,(20-23)@8' 1140354.54 346.61 36= 1.57 > > 16thread@2core(--lcores=3D'6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 > > 1416.65 > > > > 8thread@2core(--lcores=3D'6,(10-17)@(7,8))' 594358.61 332.70 35= 7.74 > > 32thread@4core(--lcores=3D'6,(20-51)@(7-10))' 5319896.86 2836.44 30= 28.87 > > > > Konstantin Ananyev (6): > > test/ring: add contention stress test > > ring: rework ring layout to allow new sync schemes > > ring: introduce RTS ring mode > > test/ring: add contention stress test for RTS ring > > ring: introduce HTS ring mode > > test/ring: add contention stress test for HTS ring > > > > app/test/Makefile | 3 + > > app/test/meson.build | 3 + > > app/test/test_pdump.c | 6 +- > > app/test/test_ring_hts_stress.c | 28 ++ > > app/test/test_ring_rts_stress.c | 28 ++ > > app/test/test_ring_stress.c | 27 ++ > > app/test/test_ring_stress.h | 477 +++++++++++++++++++ > > lib/librte_pdump/rte_pdump.c | 2 +- > > lib/librte_port/rte_port_ring.c | 12 +- > > lib/librte_ring/Makefile | 4 +- > > lib/librte_ring/meson.build | 4 +- > > lib/librte_ring/rte_ring.c | 84 +++- > > lib/librte_ring/rte_ring.h | 619 +++++++++++++++++++++++-- > > lib/librte_ring/rte_ring_elem.h | 8 +- > > lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++ > > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++ > > 16 files changed, 1721 insertions(+), 52 deletions(-) create mode 100= 644 > > app/test/test_ring_hts_stress.c create mode 100644 > > app/test/test_ring_rts_stress.c create mode 100644 > > app/test/test_ring_stress.c create mode 100644 app/test/test_ring_stre= ss.h > > create mode 100644 lib/librte_ring/rte_ring_hts_generic.h > > create mode 100644 lib/librte_ring/rte_ring_rts_generic.h > > > > -- > > 2.17.1