From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id E4E525F24 for ; Mon, 28 Jan 2019 22:04:47 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Jan 2019 13:04:46 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,534,1539673200"; d="scan'208";a="120181649" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by fmsmga008.fm.intel.com with ESMTP; 28 Jan 2019 13:04:46 -0800 Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.99]) by FMSMSX106.amr.corp.intel.com ([169.254.5.135]) with mapi id 14.03.0415.000; Mon, 28 Jan 2019 13:04:46 -0800 From: "Eads, Gage" To: Ola Liljedahl , "dev@dpdk.org" , Honnappa Nagarahalli , "Richardson, Bruce" CC: nd Thread-Topic: [RFC] lfring: lock-free ring buffer Thread-Index: AQHUtwUD27HnS9OItk2iqwDKLNhf3qXFCdhg Date: Mon, 28 Jan 2019 21:04:46 +0000 Message-ID: <9184057F7FC11744A2107296B6B8EB1E541CC413@FMSMSX108.amr.corp.intel.com> References: <1548678513-14348-1-git-send-email-ola.liljedahl@arm.com> In-Reply-To: <1548678513-14348-1-git-send-email-ola.liljedahl@arm.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNGJjYjUxNTItMGU5ZC00NjM0LTkxODEtNjMzZmVhN2U1ODE2IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiSW5ZM2pjVWw2WmFKRHd2NFJVQmpBejZ0dVMwcFFzK1R2ZmFKczRNVndKUHh5VEg3XC95NEFZT3hEZFRLbXJxZm8ifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.1.200.107] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Jan 2019 21:04:48 -0000 Hey Ola, > -----Original Message----- > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com] > Sent: Monday, January 28, 2019 6:29 AM > To: dev@dpdk.org; Eads, Gage ; Honnappa Nagarahalli > ; Richardson, Bruce > > Cc: nd ; Ola Liljedahl > Subject: [RFC] lfring: lock-free ring buffer >=20 > Lock-free MP/MC ring buffer with SP/SC options. > The ring buffer metadata only maintains one set of head and tail pointers= . Tail is > updated on enqueue, head is updated on dequeue. > The ring slots between head and tail always contains valid (unconsumed) s= lots. > Each ring slot consists of a struct of data pointer ("ptr") and ring inde= x ("idx"). > Slot.idx is only updated on enqueue with the new ring index. The slot ind= ex is > used to ensure that slots are written sequentially (important for FIFO or= dering). > It is also used to synchronise multiple producers. >=20 > MP enqueue scans the ring from the tail until head+size, looking for an e= mpty > slot as indicated by slot.idx equaling the ring index of the previous lap= (index - > ring size). The empty slot is written (.ptr =3D data, .idx =3D index) usi= ng CAS. If CAS > fails, some other thread wrote the slot and the thread continues with the= next > slot. >=20 > When all elements have been enqueued or if the ring buffer is full, the t= hread > updates the ring buffer tail pointer (unless this has not already been up= dated by > some other thread). Thus this thread will help other threads that have wr= itten > ring slots but not yet updated the tail pointer. >=20 > If a slot with an index different from current or previous lap is found, = it is > assumed that the thread has fallen behind (e.g. been preempted) and the l= ocal > tail pointer is (conditionally) updated from the ring buffer metadata in = order to > quickly catch up. >=20 > MP dequeue reads (up to) N slots from the head index and attempts to comm= it > the operation using CAS on the head pointer. >=20 > Enqueue N elements: N+1 CAS operations (but the last CAS operation might = not > be necessary during contention as producers will help each other) Dequeue= N > elements: 1 CAS operation >=20 > As the lock-free ring supports both MT-safe (MP/MC) and single-threaded > (SP/SC) operations, there is a question whether the modes should be chose= n > only when creating a ring buffer or if the application can mix MT-safe an= d single- > threaded enqueue (or dequeue) calls. To follow the pattern set by rte_rin= g, > specific functions for MP and SP enqueue and MC and SC dequeue are made > available. The wisdom of this ability can be questioned. The lock-free de= sign > allows threads to be blocked for long periods of time without interfering= with > the operation of the ring buffer but mixing (lock-free) MT-safe and singl= e- > threaded calls requires that there are on such threads that wake up at th= e wrong > moment (when a single-threaded operation is in progress). >=20 I see this as a user error. The rte_ring documentation is clear that ring_s= p_enqueue and ring_sc_dequeue functions are not MT-safe, and (if I'm readin= g this correctly) this scenario involves having 2+ threads executing an ope= ration in parallel. > 128-bit lock-free atomic compare exchange operations for AArch64 > (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these functio= ns to > find a different home and possibly a different API. > Note that a 'frail' version atomic_compare_exchange is implemented, this > means that in the case of failure, the old value returned is not guarante= ed to be > atomically read (as this is not possible on ARMv8.0 without also writing = to the > location). Atomically read values are often not necessary, it is a succes= sful > compare and exchange operation which provides atomicity. >=20 > Signed-off-by: Ola Liljedahl A few high-level thoughts: 1. The enqueue and dequeue functions are not compatible with the rte_ring's= bulk semantics, where either all pointers or none are operated on. This is= necessary to implement the ring API of course, but in a broader sense I th= ink it's necessary for usability in general. For example, how would a user = (or the ring mempool handler) distinguish between partial dequeues that occ= ur because the ring is empty vs. those that occur due to a race? dequeue_bu= lk_mc could return 0 even though the ring is full, if its first CAS fails a= nd the head has advanced beyond its local tail variable. The user could ret= ry dequeue, of course, but how many times until they feel confident that th= e ring is truly empty and give up? I don't think we can avoid reserving rin= g slots prior to enqueueing/dequeueing. 2. On the plus side, the enqueue function design that allows it to use 1/2 = the atomics of mine appears to be independent of reserving ring slots, and = should transfer over fairly cleanly. I'm a little concerned about the perfo= rmance variability it introduces (i.e. if the thread gets into "catch up" m= ode), particularly for larger rings, since real-time software values predic= tability. What if the reload criteria was instead something like: #define ENQ_RETRY_LIMIT 32 //arbitrary if (old.e.idx !=3D tail - size) { if (++fail_cnt < ENQ_RETRY_LIMIT) { tail++; } else { fail_cnt =3D 0; tail =3D rte_lfring_reload(...);=20 } goto restart; } fail_cnt =3D 0; 3. Using a zero-length array to mark the start of the ring is a nice approa= ch -- I'll incorporate that into the patchset. At an algorithm level, I don't see any other differences. Implementation-wi= se, we'll need to nail the memory ordering flags to best support weak consi= stency machines, as you pointed out elsewhere. Out of curiosity, is p64_lfring based on the nb ring patchset? I noticed it= used a different algorithm until pretty recently. Thanks, Gage