DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ola Liljedahl <Ola.Liljedahl@arm.com>
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"bruce.richardson@intel.com" <bruce.richardson@intel.com>,
	"gage.eads@intel.com" <gage.eads@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: nd <nd@arm.com>
Subject: Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Date: Fri, 1 Feb 2019 18:19:58 +0000	[thread overview]
Message-ID: <1549045210.20325.15.camel@arm.com> (raw)
In-Reply-To: <9184057F7FC11744A2107296B6B8EB1E541CDFAB@FMSMSX108.amr.corp.intel.com>

On Fri, 2019-02-01 at 15:40 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > Sent: Wednesday, January 30, 2019 5:36 AM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>; Eads, Gage <gage.eads@intel.com>;
> > dev@dpdk.org
> > Cc: nd <nd@arm.com>
> > Subject: Re: [RFC] lfring: lock-free ring buffer
> > 
> > On Wed, 2019-01-30 at 05:17 +0000, Eads, Gage wrote:
> > <SNIP>
> > > 
> > > 
> > > > 
> > > > > 
> > > > > 2. On the plus side, the enqueue function design that allows it to
> > > > > use
> > > > > 1/2 the atomics of mine appears to be independent of reserving
> > > > > ring slots, and should transfer over fairly cleanly. I'm a little
> > > > > concerned about the performance variability it introduces (i.e. if
> > > > > the thread gets into "catch up" mode),
> > > > If a thread has to catch up, it means it is using a stale
> > > > (head/tail) index (more than one ring lap behaind). Better to try to
> > > > load a fresh value (if one is
> > > > available) than to iterate over the ring until it has caught up. So
> > > > I think this is the better/faster design.
> > > > 
> > > > Catch up mode is not triggered by finding an occupied slot for the
> > > > current lap (that was just written by some other thread). Or at
> > > > least this is the idea.
> > > > If we
> > > > find a freshly written slot, we just move to the next slot.
> > > > 
> > > > > 
> > > > > 
> > > > > particularly for larger rings, since real-time software values
> > > > > predictability.
> > > > > What if the reload criteria was instead something like:
> > > > > 
> > > > > #define ENQ_RETRY_LIMIT 32 //arbitrary
> > > > > 
> > > > > if (old.e.idx != tail - size) {
> > > > >     if (++fail_cnt < ENQ_RETRY_LIMIT) {
> > > > >         tail++;
> > > > >     } else {
> > > > >         fail_cnt = 0;
> > > > >         tail = rte_lfring_reload(...);
> > > > >     }
> > > > >     goto restart;
> > > > > }
> > > > > fail_cnt = 0;
> > > > There are three cases (slot index must be between q->tail and
> > > > q->head + q-
> > > > > 
> > > > > 
> > > > > size):
> > > > slot.idx == tail - size: slot is free, try to write it slot.idx == tail:
> > > > slot has just been
> > > > written (by other thread), skip to next slot (tail++) none of the above:
> > > > thread is
> > > > behind (at least one lap), re-load tail from q-
> > > > > 
> > > > > 
> > > > > tail
> > > > I think using the retry count actually delays catching up to a fresh
> > > > position.
> > > > 
> > > Miscommunication here -- by "catch up", I meant the case where the
> > > thread is behind but by less than one lap (the second case you
> > > describe). In the worst case, the thread would have to read N-1 (N =
> > > ring size) entries before reaching the next available entry, and N could
> > > easily
> > be in the thousands.
> > > 
> > > That's the performance variability I was referring to, and why I
> > > suggested capping the failed slot reads at 32. Maintaining a local
> > > fail_cnt variable is a small price to pay (relative to a CAS) to
> > > prevent a ring enqueue latency spike.
> > OK, I misunderstood. We can reload the private tail from the shared ring
> > buffer
> > tail earlier. You could actually do this on every failed CAS but I think
> > that would
> > be overkill. Your design with a retry limit is better, we need to find out
> > what is a
> > suitable value for the retry limit.
> > 
> Too late for lfring v2 unfortunately, but this idea will actually break lock-
> freedom. For example, say the tail index is 33 slots behind the next available
> slot. An enqueueing thread would get stuck retrying slots from tail index ->
> tail index + 32 until another thread updates the tail index. 
I don't think this is true.

If we hit the retry limit, we do: tail = _rte_lfring_reload(&r->tail, tail);
where we obtain a fresh copy of the shared ring->tail value but only use this if
it is "larger" (using serial number arithmetic) than our private tail variable.
If ring->tail hasn't been updated yet, we continue to use our private tail but
increment it to the next slot. We never go backwards.

static __rte_always_inline uint64_t
_rte_lfring_reload(const uint64_t *loc, uint64_t idx)
{
        uint64_t fresh = __atomic_load_n(loc, __ATOMIC_RELAXED);
        if (_rte_lfring_before(idx, fresh)) {
                /* fresh is after idx, use this instead */
                idx = fresh;
        } else {
                /* Continue with next slot */
                idx++;
        }
        return idx;
}
> 
> > 
> > > 
> > > 
> > > But you're right that we should still catch the 1+ lap behind case, so
> > > the reload criteria could be:
> > > 
> > > #define ENQ_RETRY_LIMIT 32 //arbitrary
> > > 
> > > if (old.e.idx != tail - size) {
> > >     if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx == tail) {
> > >         tail++;
> > >     } else {
> > >         fail_cnt = 0;
> > >         tail = rte_lfring_reload(...);
> > >     }
> > >     goto restart;
> > > }
> > > fail_cnt = 0;
> > Agree.
> > 
> > > 
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 3. Using a zero-length array to mark the start of the ring is a
> > > > > nice approach
> > > > > -- I'll incorporate that into the patchset.
> > > > > 
> > > > > At an algorithm level, I don't see any other differences.
> > > > > Implementation-wise, we'll need to nail the memory ordering flags
> > > > > to best support weak consistency machines, as you pointed out
> > > > > elsewhere.
> > > > There is no pre-acquisition of slots in enqueue and dequeue. That
> > > > separate step makes lock-freedom impossible (I think).
> > > > 
> > > Can you elaborate?
> > I think lock-freedom is impossible with the initial acquisition of elements.
> > This acquisition creates a side effect that cannot be undone or helped by
> > other
> > threads.
> > 
> > You can still implement a "non-blocking" ring buffer (like your original
> > design)
> > which hides any delay in threads that access the ring buffer but isn't
> > properly
> > lock-free (which could be considered unnecessary in a DPDK environment,
> > threads may get delayed but shouldn't die or block forever).
> > 
> I think I see what you're saying. For example If a dequeueing thread reserves
> N elements and then is preempted, it's effectively reduced the number of
> available elements by N during that period.
> 
> Practically speaking, I don't think this is a problem. Not just because
> threads shouldn't die or block forever, but if a thread can be preempted after
> reserving N ring slots (but before enqueueing to or dequeueing from them), the
> net effect is those N buffers are taken out of the system. This isn't really
> different than if it that thread was preempted *outside* of the ring functions
> while holding N buffers -- the application should be designed to be resilient
> to that.
Yes, practically speaking, this behaviour is unlikely to pose a problem for DPDK
applications. But the update of another piece of metadata will decrease
scalability (unless each of prod/cons head/tail is in a separate cache line but
that's a lot of mostly empty cache lines).

If we only need the non-blocking behavior in DPDK (to handle variable latency
and unexpected delays from threads that access the ring buffer or any data
structure that uses the ring buffer), there are other ways to achieve that that
will be more efficient than a lock-free ring like this or your nblk ring which
need to perform a CAS per element and then some extra CAS operations.

> 
> > 
> > > 
> > >  I don't currently see any other way to support rte_ring bulk
> > > semantics, which is necessary for creating a non-blocking mempool
> > > handler, so we should clear up any doubt.
> > OK, I wasn't aware of this strict dependency on bulk enqueue and dequeue.
> > Bulk
> > dequeue (mempool allocates buffers from the ring) should be easy to support
> > though. Bulk enqueue (mempool frees buffers to the ring) will work as long
> > as
> > the ring is large enough to hold all free buffers so it can never become
> > full (need
> > to reload head/tail pointers at the appropriate places to avoid spurious
> > full/empty status). I assume this is the case, initially all free buffers
> > will be stored
> > in the ring?
> > 
> > > 
> > > 
> > > In the NB ring patchset each thread reserves a number of slots before
> > > performing the enq/deq, but doesn't reserve *specific* slots (unlike
> > > rte_ring). This reservation is atomic, so that we never over-subscribe
> > > valid ring entries (for dequeue) or unused ring entries (for enqueue).
> > > This guarantees that the enq/deq operation will eventually complete,
> > > regardless of the behavior of other threads. This is why the enqueue
> > > loop doesn't check if space is available and the dequeue loop doesn't
> > > check if any more valid entries remain.
> > I initially thought these acquisitions were just for compatibility with
> > rte_ring
> > (update the same metadata so that the user can mix MP/MC and SP/SC calls)
> > but
> > realise they are there for the bulk enqueue/dequeue. But doing this
> > acquisition
> > means each enqueue or dequeue will update two metadata variables so creates
> > more contention and less scalability. I think it would be good if we could
> > provide
> > the bulk behaviour *without* this initial acquisition and only update one
> > metadata location per enqueue/dequeue operation.
> > 
> Agreed. I see lfring v2 has avoided this on the dequeue side, but I think it's
> unavoidable on the enqueue side -- since we can't write ring entries and CAS
> the tail pointer in one big atomic operation. AFAICT, slot reservations are
> unavoidable there to achieve bulk semantics.
For enqueue, yes unless you trust the ring to be large enough to be able to hold
all elements that might be enqueued. Isn't this the case for a mempool? All
buffers will originally be on the ring?

> 
> > 
> > > 
> > > 
> > > This sort of reservation should be compatible with lfring, but
> > > requires changes (e.g. two sets of head/tail pointers).
> > > 
> > <SNIP>
> > > 
> > > 
> > --
> > Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
> > ola.liljedahl
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


      reply	other threads:[~2019-02-01 18:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-28 12:28 Ola Liljedahl
2019-01-28 21:04 ` Eads, Gage
2019-01-28 22:26   ` Ola Liljedahl
2019-01-30  5:17     ` Eads, Gage
2019-01-30  5:56       ` Eads, Gage
2019-01-30 11:36       ` Ola Liljedahl
2019-02-01 15:40         ` Eads, Gage
2019-02-01 18:19           ` Ola Liljedahl [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1549045210.20325.15.camel@arm.com \
    --to=ola.liljedahl@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=gage.eads@intel.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).