From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id E4D45A0562; Tue, 31 Mar 2020 18:43:52 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id F414E2C15; Tue, 31 Mar 2020 18:43:51 +0200 (CEST) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by dpdk.org (Postfix) with ESMTP id 4FC5C2BCE for ; Tue, 31 Mar 2020 18:43:49 +0200 (CEST) IronPort-SDR: hRHVw00mEz+fQvpZDK33jnJ+Lp0cegNG0GexxPAxeX5NvjWGtAMED5Z2AeoAA7E4A0JYWfYAcT hjNOfM/w2i1A== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2020 09:43:48 -0700 IronPort-SDR: WJ0D/Rc8W1WZZjivCxzEwFDvhJNjSOPVv5q0+ftNcxihL5ovl9Z7EYd6oYNZLbR1J7hoXTtzIO 3AdyApx2mdSQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,328,1580803200"; d="scan'208";a="252304707" Received: from sivswdev08.ir.intel.com ([10.237.217.47]) by orsmga006.jf.intel.com with ESMTP; 31 Mar 2020 09:43:46 -0700 From: Konstantin Ananyev To: dev@dpdk.org Cc: honnappa.nagarahalli@arm.com, Konstantin Ananyev Date: Tue, 31 Mar 2020 17:43:22 +0100 Message-Id: <20200331164330.28854-1-konstantin.ananyev@intel.com> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20200224113515.1744-1-konstantin.ananyev@intel.com> References: <20200224113515.1744-1-konstantin.ananyev@intel.com> Subject: [dpdk-dev] [PATCH v1 0/8] New sync modes for ring X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" RFC - V1 changes: 1. remove ABI brekage (at least I hope I did) 2. Add support for ring_elem 3. Rework peek related API a bit 4. Rework test to make it less verbose and unite all test-cases in one command 5. Add new test-case for MT peek API TODO list: 1. Add C11 atomics support 2. Update docs These days many customers use(/try to use) DPDK based apps within overcommitted systems (multiple acttive threads over same pysical cores): VM, container deployments, etc. One quite common problem they hit: Lock-Holder-Preemption/Lock-Waiter-Preemption with rte_ring. LHP/LWP are quite a common problem for spin-based sync primitives (spin-locks, etc.) on overcommitted systems. The situation gets much worse when some sort of fair-locking technique is used (ticket-lock, etc.). As now not only lock-owner but also lock-waiters scheduling order matters a lot. This is a well-known problem for kernel within VMs: http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf The problem with rte_ring is that while head accusion is sort of un-fair locking, waiting on tail is very similar to ticket lock schema - tail has to be updated in particular order. That makes current rte_ring implementation to perform really pure on some overcommited scenarios. While it is probably not possible to completely resolve LHP problem in userspace only (without some kernel communication/intervention), removing fairness in tail update can mitigate it significantly. So this RFC proposes two new optional ring synchronization modes: 1) Head/Tail Sync (HTS) mode In that mode enqueue/dequeue operation is fully serialized: only one thread at a time is allowed to perform given op. As another enhancement provide ability to split enqueue/dequeue operation into two phases: - enqueue/dequeue start - enqueue/dequeue finish That allows user to inspect objects in the ring without removing them from it (aka MT safe peek). 2) Relaxed Tail Sync (RTS) The main difference from original MP/MC algorithm is that tail value is increased not by every thread that finished enqueue/dequeue, but only by the last one. That allows threads to avoid spinning on ring tail value, leaving actual tail value change to the last thread in the update queue. Note that these new sync modes are optional. For current rte_ring users nothing should change (both in terms of API/ABI and performance). Existing sync modes MP/MC,SP/SC kept untouched, set up in the same way (via flags and _init_), and MP/MC remains as default one. The only thing that changed: Format of prod/cons now could differ depending on mode selected at _init_. So user has to stick with one sync model through whole ring lifetime. In other words, user can't create a ring for let say SP mode and then in the middle of data-path change his mind and start using MP_RTS mode. For existing modes (SP/MP, SC/MC) format remains the same and user can still use them interchangeably, though of course it is an error prone practice. Test results on IA (see below) show significant improvements for average enqueue/dequeue op times on overcommitted systems. For 'classic' DPDK deployments (one thread per core) original MP/MC algorithm still shows best numbers, though for 64-bit target RTS numbers are not that far away. Numbers were produced by new UT test-case: ring_stress_autotest, i.e.: echo ring_stress_autotest | ./dpdk-test -n 4 --lcores='...' X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz DEQ+ENQ average cycles/obj MP/MC HTS RTS 1thread@1core(--lcores=6-7) 8.00 8.15 8.99 2thread@2core(--lcores=6-8) 19.14 19.61 20.35 4thread@4core(--lcores=6-10) 29.43 29.79 31.82 8thread@8core(--lcores=6-14) 110.59 192.81 119.50 16thread@16core(--lcores=6-22) 461.03 813.12 495.59 32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51 2thread@1core(--lcores='6,(10-11)@7' 20140.50 23.58 25.14 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8' 153680.60 76.88 80.05 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 280314.32 294.72 318.79 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59 1144.02 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68 8thread@2core(--lcores='6,(10-17)@(7,8))' 321085.98 298.59 307.47 16thread@4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35 678.29 32thread@4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36 2714.12 i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz DEQ+ENQ average cycles/obj MP/MC HTS RTS 1thread@1core(--lcores=6-7) 7.85 12.13 11.31 2thread@2core(--lcores=6-8) 17.89 24.52 21.86 8thread@8core(--lcores=6-14) 32.58 354.20 54.58 32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91 2thread@1core(--lcores='6,(10-11)@7' 16095.00 36.06 34.74 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 1140354.54 346.61 361.57 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65 8thread@2core(--lcores='6,(10-17)@(7,8))' 594358.61 332.70 357.74 32thread@4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44 3028.87 Konstantin Ananyev (8): test/ring: add contention stress test ring: prepare ring to allow new sync schemes ring: introduce RTS ring mode test/ring: add contention stress test for RTS ring ring: introduce HTS ring mode test/ring: add contention stress test for HTS ring ring: introduce peek style API test/ring: add stress test for MT peek API app/test/Makefile | 5 + app/test/meson.build | 5 + app/test/test_pdump.c | 6 +- app/test/test_ring_hts_stress.c | 32 ++ app/test/test_ring_mpmc_stress.c | 31 ++ app/test/test_ring_peek_stress.c | 43 +++ app/test/test_ring_rts_stress.c | 32 ++ app/test/test_ring_stress.c | 57 ++++ app/test/test_ring_stress.h | 37 +++ app/test/test_ring_stress_impl.h | 436 +++++++++++++++++++++++++ lib/librte_pdump/rte_pdump.c | 2 +- lib/librte_port/rte_port_ring.c | 12 +- lib/librte_ring/Makefile | 9 +- lib/librte_ring/meson.build | 9 +- lib/librte_ring/rte_ring.c | 114 ++++++- lib/librte_ring/rte_ring.h | 243 ++++++++++++-- lib/librte_ring/rte_ring_c11_mem.h | 44 +++ lib/librte_ring/rte_ring_elem.h | 105 +++++- lib/librte_ring/rte_ring_generic.h | 48 +++ lib/librte_ring/rte_ring_hts.h | 210 ++++++++++++ lib/librte_ring/rte_ring_hts_elem.h | 205 ++++++++++++ lib/librte_ring/rte_ring_hts_generic.h | 235 +++++++++++++ lib/librte_ring/rte_ring_peek.h | 379 +++++++++++++++++++++ lib/librte_ring/rte_ring_rts.h | 316 ++++++++++++++++++ lib/librte_ring/rte_ring_rts_elem.h | 205 ++++++++++++ lib/librte_ring/rte_ring_rts_generic.h | 210 ++++++++++++ 26 files changed, 2974 insertions(+), 56 deletions(-) create mode 100644 app/test/test_ring_hts_stress.c create mode 100644 app/test/test_ring_mpmc_stress.c create mode 100644 app/test/test_ring_peek_stress.c create mode 100644 app/test/test_ring_rts_stress.c create mode 100644 app/test/test_ring_stress.c create mode 100644 app/test/test_ring_stress.h create mode 100644 app/test/test_ring_stress_impl.h create mode 100644 lib/librte_ring/rte_ring_hts.h create mode 100644 lib/librte_ring/rte_ring_hts_elem.h create mode 100644 lib/librte_ring/rte_ring_hts_generic.h create mode 100644 lib/librte_ring/rte_ring_peek.h create mode 100644 lib/librte_ring/rte_ring_rts.h create mode 100644 lib/librte_ring/rte_ring_rts_elem.h create mode 100644 lib/librte_ring/rte_ring_rts_generic.h -- 2.17.1