From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 133651B146 for ; Wed, 30 Jan 2019 06:56:45 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 29 Jan 2019 21:56:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,540,1539673200"; d="scan'208";a="113805794" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by orsmga008.jf.intel.com with ESMTP; 29 Jan 2019 21:56:43 -0800 Received: from fmsmsx123.amr.corp.intel.com (10.18.125.38) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.408.0; Tue, 29 Jan 2019 21:56:25 -0800 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.99]) by fmsmsx123.amr.corp.intel.com ([169.254.7.98]) with mapi id 14.03.0415.000; Tue, 29 Jan 2019 21:56:25 -0800 From: "Eads, Gage" To: Ola Liljedahl , Honnappa Nagarahalli , "Richardson, Bruce" , "dev@dpdk.org" CC: nd Thread-Topic: [RFC] lfring: lock-free ring buffer Thread-Index: AQHUtwUD27HnS9OItk2iqwDKLNhf3qXFCdhggAA5FICAAfcNcIAAGQ3X Date: Wed, 30 Jan 2019 05:56:24 +0000 Message-ID: <33CD476B-C5B5-45E4-B8CE-8754735A4385@intel.com> References: <1548678513-14348-1-git-send-email-ola.liljedahl@arm.com> <9184057F7FC11744A2107296B6B8EB1E541CC413@FMSMSX108.amr.corp.intel.com> <1548714375.11472.76.camel@arm.com>, <9184057F7FC11744A2107296B6B8EB1E541CD01D@FMSMSX108.amr.corp.intel.com> In-Reply-To: <9184057F7FC11744A2107296B6B8EB1E541CD01D@FMSMSX108.amr.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 30 Jan 2019 05:56:47 -0000 > On Jan 29, 2019, at 11:17 PM, Eads, Gage wrote: >=20 >=20 >=20 >> -----Original Message----- >> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com] >> Sent: Monday, January 28, 2019 4:26 PM >> To: Honnappa Nagarahalli ; Richardson, >> Bruce ; Eads, Gage ; >> dev@dpdk.org >> Cc: nd >> Subject: Re: [RFC] lfring: lock-free ring buffer >>=20 >>> On Mon, 2019-01-28 at 21:04 +0000, Eads, Gage wrote: >>> Hey Ola, >>>=20 >>>>=20 >>>> -----Original Message----- >>>> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com] >>>> Sent: Monday, January 28, 2019 6:29 AM >>>> To: dev@dpdk.org; Eads, Gage ; Honnappa >>>> Nagarahalli ; Richardson, Bruce >>>> >>>> Cc: nd ; Ola Liljedahl >>>> Subject: [RFC] lfring: lock-free ring buffer >>>>=20 >>>> Lock-free MP/MC ring buffer with SP/SC options. >>>> The ring buffer metadata only maintains one set of head and tail point= ers. >>>> Tail is >>>> updated on enqueue, head is updated on dequeue. >>>> The ring slots between head and tail always contains valid >>>> (unconsumed) slots. >>>> Each ring slot consists of a struct of data pointer ("ptr") and ring >>>> index ("idx"). >>>> Slot.idx is only updated on enqueue with the new ring index. The >>>> slot index is used to ensure that slots are written sequentially >>>> (important for FIFO ordering). >>>> It is also used to synchronise multiple producers. >>>>=20 >>>> MP enqueue scans the ring from the tail until head+size, looking for >>>> an empty slot as indicated by slot.idx equaling the ring index of >>>> the previous lap (index - ring size). The empty slot is written >>>> (.ptr =3D data, .idx =3D index) using CAS. >>>> If CAS >>>> fails, some other thread wrote the slot and the thread continues >>>> with the next slot. >>>>=20 >>>> When all elements have been enqueued or if the ring buffer is full, >>>> the thread updates the ring buffer tail pointer (unless this has not >>>> already been updated by some other thread). Thus this thread will >>>> help other threads that have written ring slots but not yet updated >>>> the tail pointer. >>>>=20 >>>> If a slot with an index different from current or previous lap is >>>> found, it is assumed that the thread has fallen behind (e.g. been >>>> preempted) and the local tail pointer is (conditionally) updated >>>> from the ring buffer metadata in order to quickly catch up. >>>>=20 >>>> MP dequeue reads (up to) N slots from the head index and attempts to >>>> commit the operation using CAS on the head pointer. >>>>=20 >>>> Enqueue N elements: N+1 CAS operations (but the last CAS operation >>>> might not be necessary during contention as producers will help each >>>> other) Dequeue N >>>> elements: 1 CAS operation >>>>=20 >>>> As the lock-free ring supports both MT-safe (MP/MC) and >>>> single-threaded >>>> (SP/SC) operations, there is a question whether the modes should be >>>> chosen only when creating a ring buffer or if the application can >>>> mix MT-safe and >>>> single- >>>> threaded enqueue (or dequeue) calls. To follow the pattern set by >>>> rte_ring, specific functions for MP and SP enqueue and MC and SC >>>> dequeue are made available. The wisdom of this ability can be >>>> questioned. The lock-free design allows threads to be blocked for >>>> long periods of time without interfering with the operation of the >>>> ring buffer but mixing (lock-free) MT-safe and single- threaded >>>> calls requires that there are on such threads that wake up at the >>>> wrong moment (when a single-threaded operation is in progress). >>>>=20 >>> I see this as a user error. The rte_ring documentation is clear that >>> ring_sp_enqueue and ring_sc_dequeue functions are not MT-safe, and (if >>> I'm reading this correctly) this scenario involves having 2+ threads >>> executing an operation in parallel. >> So the individual MP/SP/MC/SC functions should also be directly availabl= e to the >> user? This is what I have implemented here (but not in my original PROGR= ESS64 >> implementation where MP/SP/MC/SC mode is fixed at creation). But I don't= like >> it. >>=20 >>>=20 >>>>=20 >>>> 128-bit lock-free atomic compare exchange operations for AArch64 >>>> (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these >>>> functions to find a different home and possibly a different API. >>>> Note that a 'frail' version atomic_compare_exchange is implemented, >>>> this means that in the case of failure, the old value returned is >>>> not guaranteed to be atomically read (as this is not possible on >>>> ARMv8.0 without also writing to the location). Atomically read >>>> values are often not necessary, it is a successful compare and >>>> exchange operation which provides atomicity. >>>>=20 >>>> Signed-off-by: Ola Liljedahl >>> A few high-level thoughts: >>> 1. The enqueue and dequeue functions are not compatible with the >>> rte_ring's bulk semantics, >> I wasn't considering 100% compatibility with other modules that already = use >> rte_ring, the code rather reflects the requirements of my own use cases.= If we >> want this, then we need to provide the same bulk enqueue/dequeue behavio= ur >> as rte_ring. >>=20 >>> where either all pointers or none are operated on. This is necessary >>> to implement the ring API of course, but in a broader sense I think >>> it's necessary for usability in general. For example, how would a user >>> (or the ring mempool handler) distinguish between partial dequeues >>> that occur because the ring is empty vs. those that occur due to a >>> race? dequeue_bulk_mc could return >>> 0 even though the ring is full, if its first CAS fails and the head >>> has advanced beyond its local tail variable. >> Tail could be re-read on a CAS(&head) failure. This is how I originally = coded ring >> buffer enqueue and dequeue functions. But CAS(&head) failure doesn't by = itself >> mean that *tail* has changed. So I hoisted the read of tail before the C= AS-loop >> which made the code slightly faster since we avoid unnecessarily reading= a >> location which might have been modified and thus could cause a cache mis= s. >> Maybe this microptimisation isn't always a good idea (it could lead to s= purious >> queue full/empty failures). Another solution is to only re-read tail if = we see the >> queue as empty (or we can't acquire as many slots as the user requested)= . >>=20 >> uint64_t head =3D __atomic_load_n(&r->head, __ATOMIC_RELAXED); >> uint64_t tail =3D __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE); >> do { >> actual =3D RTE_MIN((int64_t)(tail - head), (int64_t)nele= ms); >> if (unlikely(actual <=3D 0)) { >> /* Re-load tail only when queue looks empty */ >> tail =3D __atomic_load_n(&r->tail, __ATOMIC_ACQUIRE); >>=20 >> actual =3D RTE_MIN((int64_t)(tail - head), (int64_t)nelems); >> if >> (unlikely(actual <=3D 0)) >> return 0; >> } >> rte_lfring_deq_elems(elems, r->ring, mask, head, actual)= ; >> } while (!__atomic_compare_exchange_n(&r->head, >> &head, >> head + actual, >> /*weak*/false, >> __ATOMIC_RELEASE, >> __ATOMIC_RELAXED)); >>> The user could retry dequeue, of course, but how many times until >>> they feel confident that the ring is truly empty and give up? I don't >>> think we can avoid reserving ring slots prior to enqueueing/dequeueing. >> If tail is (re-) read inside the CAS-loop, there should be no spurious q= ueue empty >> failures and no need to repeatedly call the dequeue function "just to be= sure". >> The bulk dequeue/enqueue behaviour is separate. >>=20 >>>=20 >>> 2. On the plus side, the enqueue function design that allows it to use >>> 1/2 the atomics of mine appears to be independent of reserving ring >>> slots, and should transfer over fairly cleanly. I'm a little concerned >>> about the performance variability it introduces (i.e. if the thread >>> gets into "catch up" mode), >> If a thread has to catch up, it means it is using a stale (head/tail) in= dex (more >> than one ring lap behaind). Better to try to load a fresh value (if one = is >> available) than to iterate over the ring until it has caught up. So I th= ink this is the >> better/faster design. >>=20 >> Catch up mode is not triggered by finding an occupied slot for the curre= nt lap >> (that was just written by some other thread). Or at least this is the id= ea. If we >> find a freshly written slot, we just move to the next slot. >>=20 >>> particularly for larger rings, since real-time software values predicta= bility. >>> What if the reload criteria was instead something like: >>>=20 >>> #define ENQ_RETRY_LIMIT 32 //arbitrary >>>=20 >>> if (old.e.idx !=3D tail - size) { >>> if (++fail_cnt < ENQ_RETRY_LIMIT) { >>> tail++; >>> } else { >>> fail_cnt =3D 0; >>> tail =3D rte_lfring_reload(...); >>> } >>> goto restart; >>> } >>> fail_cnt =3D 0; >> There are three cases (slot index must be between q->tail and q->head + = q- >>> size): >> slot.idx =3D=3D tail - size: slot is free, try to write it slot.idx =3D= =3D tail: slot has just been >> written (by other thread), skip to next slot (tail++) none of the above:= thread is >> behind (at least one lap), re-load tail from q- >>> tail >>=20 >> I think using the retry count actually delays catching up to a fresh pos= ition. >>=20 >=20 > Miscommunication here -- by "catch up", I meant the case where the thread= is behind but by less than one lap (the second case you describe). In the = worst case, the thread would have to read N-1 (N =3D ring size) entries bef= ore reaching the next available entry, and N could easily be in the thousan= ds. That's the performance variability I was referring to, and why I sugges= ted capping the failed slot reads at 32. Maintaining a local fail_cnt varia= ble is a small price to pay (relative to a CAS) to prevent a ring enqueue l= atency spike. >=20 > But you're right that we should still catch the 1+ lap behind case, so th= e reload criteria could be: >=20 > #define ENQ_RETRY_LIMIT 32 //arbitrary >=20 > if (old.e.idx !=3D tail - size) { > if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx =3D=3D tail) { > tail++; > } else { > fail_cnt =3D 0; > tail =3D rte_lfring_reload(...); > } > goto restart; > } > fail_cnt =3D 0; >=20 Scratch that. This proposal breaks lock-freedom. >>>=20 >>> 3. Using a zero-length array to mark the start of the ring is a nice >>> approach >>> -- I'll incorporate that into the patchset. >>>=20 >>> At an algorithm level, I don't see any other differences. >>> Implementation-wise, we'll need to nail the memory ordering flags to >>> best support weak consistency machines, as you pointed out elsewhere. >> There is no pre-acquisition of slots in enqueue and dequeue. That separa= te step >> makes lock-freedom impossible (I think). >>=20 >=20 > Can you elaborate? I don't currently see any other way to support rte_rin= g bulk semantics, which is necessary for creating a non-blocking mempool ha= ndler, so we should clear up any doubt. >=20 > In the NB ring patchset each thread reserves a number of slots before per= forming the enq/deq, but doesn't reserve *specific* slots (unlike rte_ring)= . This reservation is atomic, so that we never over-subscribe valid ring en= tries (for dequeue) or unused ring entries (for enqueue). This guarantees t= hat the enq/deq operation will eventually complete, regardless of the behav= ior of other threads. This is why the enqueue loop doesn't check if space i= s available and the dequeue loop doesn't check if any more valid entries re= main. >=20 > This sort of reservation should be compatible with lfring, but requires c= hanges (e.g. two sets of head/tail pointers). >=20 >> With a merged implementation, I think the NON-BLOCKING/LOCK-FREE mode >> most have separate implementations also for SP/SC. In non-blocking/lock-= free >> mode (whether MP/MC or SP/SC), only one set of head/tail pointers is use= d. In >> blocking mode, two sets of head/tail are used. There isn't a set of SP/S= C code >> that supports both non-blocking/lock-free and blocking behaviours. >>=20 >=20 > Sure, I don't see a problem with that. The NB ring patchset has separate = NB SP/SC code from the blocking SP/SC code as well. >=20 >>>=20 >>> Out of curiosity, is p64_lfring based on the nb ring patchset? I >>> noticed it used a different algorithm until pretty recently. >> I created the original lfring design (which isn't that different I think= ) back in >> November last and it seemed to work but I didn't think it could guarante= e FIFO >> order as elements were enqueued and dequeued wherever a thread found a >> suitable slot and in my use cases, FIFO order is an important characteri= stic (need >> to maintain ingress-to-egress packet order). Already then I was consider= ing >> adding the index to the ring slots to enforce FIFO enqueue/dequeue order= , I >> knew this would be possible on ARMv7a (which has 2x32 bit CAS using >> exclusives) and ARMv8/AArch64 (which has 2x64-bit CAS using exclusives).= I had >> seen this design idea on the Internet before (e.g. >> here https://www.youtube.com/ watch?v=3DHP2InVqgBFM, I actually stopped >> viewing this presentation half-way because I wanted to iron out the deta= ils >> myself), it's not a novel idea (sorry). >> However I didn't have time to do this change before Christmas holidays a= nd just >> came back to work two weeks ago. So I would rather call this simultaneou= s >> "invention" (or publication). Difficult to prove this of course... >>=20 >>>=20 >>> Thanks, >>> Gage >>>=20 >>> >> -- >> Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skyp= e >> ola.liljedahl >=20