From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gage.eads@intel.com>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id E4E525F24
 for <dev@dpdk.org>; Mon, 28 Jan 2019 22:04:47 +0100 (CET)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
 by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 28 Jan 2019 13:04:46 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.56,534,1539673200"; d="scan'208";a="120181649"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by fmsmga008.fm.intel.com with ESMTP; 28 Jan 2019 13:04:46 -0800
Received: from fmsmsx108.amr.corp.intel.com ([169.254.9.99]) by
 FMSMSX106.amr.corp.intel.com ([169.254.5.135]) with mapi id 14.03.0415.000;
 Mon, 28 Jan 2019 13:04:46 -0800
From: "Eads, Gage" <gage.eads@intel.com>
To: Ola Liljedahl <Ola.Liljedahl@arm.com>, "dev@dpdk.org" <dev@dpdk.org>,
 Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, "Richardson, Bruce"
 <bruce.richardson@intel.com>
CC: nd <nd@arm.com>
Thread-Topic: [RFC] lfring: lock-free ring buffer
Thread-Index: AQHUtwUD27HnS9OItk2iqwDKLNhf3qXFCdhg
Date: Mon, 28 Jan 2019 21:04:46 +0000
Message-ID: <9184057F7FC11744A2107296B6B8EB1E541CC413@FMSMSX108.amr.corp.intel.com>
References: <1548678513-14348-1-git-send-email-ola.liljedahl@arm.com>
In-Reply-To: <1548678513-14348-1-git-send-email-ola.liljedahl@arm.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNGJjYjUxNTItMGU5ZC00NjM0LTkxODEtNjMzZmVhN2U1ODE2IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiSW5ZM2pjVWw2WmFKRHd2NFJVQmpBejZ0dVMwcFFzK1R2ZmFKczRNVndKUHh5VEg3XC95NEFZT3hEZFRLbXJxZm8ifQ==
x-ctpclassification: CTP_NT
dlp-product: dlpe-windows
dlp-version: 11.0.400.15
dlp-reaction: no-action
x-originating-ip: [10.1.200.107]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Jan 2019 21:04:48 -0000

Hey Ola,

> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Monday, January 28, 2019 6:29 AM
> To: dev@dpdk.org; Eads, Gage <gage.eads@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: nd <nd@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>
> Subject: [RFC] lfring: lock-free ring buffer
>=20
> Lock-free MP/MC ring buffer with SP/SC options.
> The ring buffer metadata only maintains one set of head and tail pointers=
. Tail is
> updated on enqueue, head is updated on dequeue.
> The ring slots between head and tail always contains valid (unconsumed) s=
lots.
> Each ring slot consists of a struct of data pointer ("ptr") and ring inde=
x ("idx").
> Slot.idx is only updated on enqueue with the new ring index. The slot ind=
ex is
> used to ensure that slots are written sequentially (important for FIFO or=
dering).
> It is also used to synchronise multiple producers.
>=20
> MP enqueue scans the ring from the tail until head+size, looking for an e=
mpty
> slot as indicated by slot.idx equaling the ring index of the previous lap=
 (index -
> ring size). The empty slot is written (.ptr =3D data, .idx =3D index) usi=
ng CAS. If CAS
> fails, some other thread wrote the slot and the thread continues with the=
 next
> slot.
>=20
> When all elements have been enqueued or if the ring buffer is full, the t=
hread
> updates the ring buffer tail pointer (unless this has not already been up=
dated by
> some other thread). Thus this thread will help other threads that have wr=
itten
> ring slots but not yet updated the tail pointer.
>=20
> If a slot with an index different from current or previous lap is found, =
it is
> assumed that the thread has fallen behind (e.g. been preempted) and the l=
ocal
> tail pointer is (conditionally) updated from the ring buffer metadata in =
order to
> quickly catch up.
>=20
> MP dequeue reads (up to) N slots from the head index and attempts to comm=
it
> the operation using CAS on the head pointer.
>=20
> Enqueue N elements: N+1 CAS operations (but the last CAS operation might =
not
> be necessary during contention as producers will help each other) Dequeue=
 N
> elements: 1 CAS operation
>=20
> As the lock-free ring supports both MT-safe (MP/MC) and single-threaded
> (SP/SC) operations, there is a question whether the modes should be chose=
n
> only when creating a ring buffer or if the application can mix MT-safe an=
d single-
> threaded enqueue (or dequeue) calls. To follow the pattern set by rte_rin=
g,
> specific functions for MP and SP enqueue and MC and SC dequeue are made
> available. The wisdom of this ability can be questioned. The lock-free de=
sign
> allows threads to be blocked for long periods of time without interfering=
 with
> the operation of the ring buffer but mixing (lock-free) MT-safe and singl=
e-
> threaded calls requires that there are on such threads that wake up at th=
e wrong
> moment (when a single-threaded operation is in progress).
>=20

I see this as a user error. The rte_ring documentation is clear that ring_s=
p_enqueue and ring_sc_dequeue functions are not MT-safe, and (if I'm readin=
g this correctly) this scenario involves having 2+ threads executing an ope=
ration in parallel.

> 128-bit lock-free atomic compare exchange operations for AArch64
> (ARMv8.0) and x86-64 are provided in lockfree16.h. I expect these functio=
ns to
> find a different home and possibly a different API.
> Note that a 'frail' version atomic_compare_exchange is implemented, this
> means that in the case of failure, the old value returned is not guarante=
ed to be
> atomically read (as this is not possible on ARMv8.0 without also writing =
to the
> location). Atomically read values are often not necessary, it is a succes=
sful
> compare and exchange operation which provides atomicity.
>=20
> Signed-off-by: Ola Liljedahl <ola.liljedahl@arm.com>

A few high-level thoughts:
1. The enqueue and dequeue functions are not compatible with the rte_ring's=
 bulk semantics, where either all pointers or none are operated on. This is=
 necessary to implement the ring API of course, but in a broader sense I th=
ink it's necessary for usability in general. For example, how would a user =
(or the ring mempool handler) distinguish between partial dequeues that occ=
ur because the ring is empty vs. those that occur due to a race? dequeue_bu=
lk_mc could return 0 even though the ring is full, if its first CAS fails a=
nd the head has advanced beyond its local tail variable. The user could ret=
ry dequeue, of course, but how many times until they feel confident that th=
e ring is truly empty and give up? I don't think we can avoid reserving rin=
g slots prior to enqueueing/dequeueing.

2. On the plus side, the enqueue function design that allows it to use 1/2 =
the atomics of mine appears to be independent of reserving ring slots, and =
should transfer over fairly cleanly. I'm a little concerned about the perfo=
rmance variability it introduces (i.e. if the thread gets into "catch up" m=
ode), particularly for larger rings, since real-time software values predic=
tability. What if the reload criteria was instead something like:

#define ENQ_RETRY_LIMIT 32 //arbitrary

if (old.e.idx !=3D tail - size) {
    if (++fail_cnt < ENQ_RETRY_LIMIT) {
        tail++;
    } else {
        fail_cnt =3D 0;
        tail =3D rte_lfring_reload(...);=20
    }
    goto restart;
}
fail_cnt =3D 0;

3. Using a zero-length array to mark the start of the ring is a nice approa=
ch -- I'll incorporate that into the patchset.

At an algorithm level, I don't see any other differences. Implementation-wi=
se, we'll need to nail the memory ordering flags to best support weak consi=
stency machines, as you pointed out elsewhere.

Out of curiosity, is p64_lfring based on the nb ring patchset? I noticed it=
 used a different algorithm until pretty recently.

Thanks,
Gage

</snip>