From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6561842CCA; Thu, 15 Jun 2023 22:13:55 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4DECF41104; Thu, 15 Jun 2023 22:13:52 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id 8FA3440FAE for ; Thu, 15 Jun 2023 22:13:50 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3CF231FB; Thu, 15 Jun 2023 13:14:34 -0700 (PDT) Received: from ampere-altra-2-1.usa.Arm.com (ampere-altra-2-1.usa.arm.com [10.118.91.158]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id EF77D3F663; Thu, 15 Jun 2023 13:13:49 -0700 (PDT) From: Wathsala Vithanage To: honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru, thomas@monjalon.net, ruifeng.wang@arm.com Cc: dev@dpdk.org, nd@arm.com, Wathsala Vithanage Subject: [RFC] ring: Further performance improvements with C11 Date: Thu, 15 Jun 2023 20:13:34 +0000 Message-Id: <20230615201335.919563-1-wathsala.vithanage@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org (1) Replace tail store with RELEASE semantics with a RELEASE fence and load of the tail in producer/consumer head update with ACQUIRE semantics with ACQUIRE fences. ------------------------------------------------------------------------ Use __ATOMIC_ACQUIRE fences between producer/consumer head updates and ring element accesses. CAS instructions always load the memory address it is supposed to conditionally update. An ACQUIRE fence between the CAS (LOAD) and the LOADs to the ring elements prevents reordering of these operations in a producer thread. Similarly, it also prevents reordering of CAS and the STOREs to the ring elements that follow in a consumer thread. An __ATOMIC_RELEASE fence in __rte_ring_update_tail prevents reordering of tail update (STORE) and LOADs to the ring elements in conumer threads. It also prevents reordering of the tail update and the STOREs to the ring in producer threads. This pairing can improve performance by eliminating the loading of the tail with ACQUIRE semantics in the consumer/producer head update loop. (2) The removal of ACQUIRE fence between the Load of the old_head and load of the cons_tail in __rte_ring_move_prod_head and __rte_ring_move_cons_head functions. ----------------------------------------------------------------------- ACQUIRE fence between the two loads established program order to prevent use of an outdated consumer or producer tail values in the computation of free_entries and entries in the two functions respectively. However, (A) the problem described above cannot be solved by establishing the program order by using a fence, and (B) the resulting values for free_entries and entries are safe for use even with an outdated head value. (A) First example below shows an example where load of the old_head and the cons_tail is reordered resulting in a free_entries computation that involves an outdated consumer tail. The second is a counter example where the same outdated consumer tail is used to compute the free_entries even with an ACQUIRE fence. Initial values for both (i) example, and (ii) counter example are capacity = 32 r->cons.tail = 5 r->cons.head = 5 r->prod.head = 10 r-prod.tail = 10 Assume CPU2 adds 2 elements to the ring and CPU3 removes 5 elements from the ring. (i) Example: The two loads are reversed. Program order violated. --------------------------------------------------------------------- cpu1(producer) cpu2(producer) cpu3(consumer) load r->cons.tail=5 ^ | | | load r->prod.head=10 | load r->cons.tail=5 | store r->prod.head(+n=2)=12 stalled <-- enqueue -----> | store r->prod.tail(+n)=12 | | | load r->cons.head = 5 | load r->prod.tail = 12 | store r->cons.head(+n=5)=10 | <...dequeue.....> | store r->cons.tail(+n)=10 | v load r->prod.head=12 Above shows that the value read from r->cons.tail to cons_tail in cpu1 is 5 which is outdated because r->cons.tail was later updated by cpu3. In this instance the state is... CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25 RING: r->cons.tail=10, r->prod.head=12 (ii) Example: ACQUIRE fence between the loads. Program order preserved. ----------------------------------------------------------------------- cpu1(producer) cpu2(producer) cpu3(consumer) | | | load r->prod.head=10 | load r->cons.tail=5 | store r->prod.head(+n=2)=12 | <-- enqueue -----> | store r->prod.tail(+n)=12 | | load r->prod.head=12 ^ | | load r->cons.head = 5 | load r->prod.tail = 12 stalled store r->cons.head(+n=5)=10 | <...dequeue.....> | ^ | | v | load r->cons.tail=5 stalled | | v store r->cons.tail(+n)= 10 Above shows that the value read from r->cons.tail to cons_tail in cpu1 is 5 which is outdated because r->cons.tail was later updated by cpu3. In this instance with the ACQUIRE fence the state is... CPU1: cons_tail=5, old_head=12, free_entries=(32+5-12)=25 RING: r->cons.tail=10, r->prod.head=12 Example ii even with an ACQUIRE fence to prevent reordering of the loads of old_head and cons_tail produces the same result for the free_entries variable, and the system ends up in the same state up to that point as example i. Same can be shown for the consumer threads as well. Therefore, the ACQUIRE fences do not solve the problem they intended to solve. If each thread is required to use the latest r->cons.head,r->cons.tail, r->prod.head, and r->prod.tail values in their computations of free_entries and entries, there should be a real-time ordering of all the threads. Establishing program order in a single thread only provides guarantees of sequential-consistency which does not include real-time ordering of all threads. (B) The values computed for free_entries, and entries are always safe without the ACQUIRE fence, because the resulting values are equal or less than the number of actual free solts or elements in the ring. This can be seen in both example i and ii, where free_entries computation ended up with 25, even when ring has 30 free_entries. Enqueueing 25 elements into a ring that has space for 30 is safe. Same can be shown for consumer threads as well. Therefore, the ACQUIRE fence is not required for safety reasons. Due to (A) and (B) above the ACQUIRE fence between the load of the head and load of the tail in both consumer and producer head updates can be removed to improve the performance of the ring library. Performance on Arm N1 Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 766730 | Total count: 651686 | Total count: 959621 | | | Gain: -15% | Gain: 25% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 (Arm N1) | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 816745 | Total count: 646385 | Total count: 766730 | | | Gain: -21% | Gain: -6% | +-------------------------------------------------------------------+ Performance on x86-64 Cascade Lake Gain relative to generic implementation +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 8 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 165042 | Total count: 168971 | Total count: 168089 | | | Gain: 2.3% | Gain: 1.8% | +-------------------------------------------------------------------+ +-------------------------------------------------------------------+ | Bulk enq/dequeue count on size 32 | +-------------------------------------------------------------------+ | Generic | C11 atomics | C11 atomics improved | +-------------------------------------------------------------------+ | Total count: 150884 | Total count: 153162 | Total count: 159661 | | | Gain: 1.5% | Gain: 5.8% | +-------------------------------------------------------------------+ Wathsala Vithanage (1): ring: improving performance with C11 .mailmap | 1 + lib/ring/rte_ring_c11_pvt.h | 35 ++++++++++++++++++++--------------- 2 files changed, 21 insertions(+), 15 deletions(-) -- 2.25.1