DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ola Liljedahl <Ola.Liljedahl@arm.com>
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"bruce.richardson@intel.com" <bruce.richardson@intel.com>,
	"gage.eads@intel.com" <gage.eads@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: nd <nd@arm.com>
Subject: Re: [dpdk-dev] [RFC] lfring: lock-free ring buffer
Date: Wed, 30 Jan 2019 11:36:07 +0000	[thread overview]
Message-ID: <1548848178.2915.40.camel@arm.com> (raw)
In-Reply-To: <9184057F7FC11744A2107296B6B8EB1E541CD01D@FMSMSX108.amr.corp.intel.com>

On Wed, 2019-01-30 at 05:17 +0000, Eads, Gage wrote:
<SNIP>
> 
> > > 2. On the plus side, the enqueue function design that allows it to use
> > > 1/2 the atomics of mine appears to be independent of reserving ring
> > > slots, and should transfer over fairly cleanly. I'm a little concerned
> > > about the performance variability it introduces (i.e. if the thread
> > > gets into "catch up" mode),
> > If a thread has to catch up, it means it is using a stale (head/tail) index
> > (more
> > than one ring lap behaind). Better to try to load a fresh value (if one is
> > available) than to iterate over the ring until it has caught up. So I think
> > this is the
> > better/faster design.
> > 
> > Catch up mode is not triggered by finding an occupied slot for the current
> > lap
> > (that was just written by some other thread). Or at least this is the idea.
> > If we
> > find a freshly written slot, we just move to the next slot.
> > 
> > > 
> > > particularly for larger rings, since real-time software values
> > > predictability.
> > > What if the reload criteria was instead something like:
> > > 
> > > #define ENQ_RETRY_LIMIT 32 //arbitrary
> > > 
> > > if (old.e.idx != tail - size) {
> > >     if (++fail_cnt < ENQ_RETRY_LIMIT) {
> > >         tail++;
> > >     } else {
> > >         fail_cnt = 0;
> > >         tail = rte_lfring_reload(...);
> > >     }
> > >     goto restart;
> > > }
> > > fail_cnt = 0;
> > There are three cases (slot index must be between q->tail and q->head + q-
> > > 
> > > size):
> > slot.idx == tail - size: slot is free, try to write it slot.idx == tail:
> > slot has just been
> > written (by other thread), skip to next slot (tail++) none of the above:
> > thread is
> > behind (at least one lap), re-load tail from q-
> > > 
> > > tail
> > I think using the retry count actually delays catching up to a fresh
> > position.
> > 
> Miscommunication here -- by "catch up", I meant the case where the thread is
> behind but by less than one lap (the second case you describe). In the worst
> case, the thread would have to read N-1 (N = ring size) entries before
> reaching the next available entry, and N could easily be in the thousands.
> That's the performance variability I was referring to, and why I suggested
> capping the failed slot reads at 32. Maintaining a local fail_cnt variable is
> a small price to pay (relative to a CAS) to prevent a ring enqueue latency
> spike.
OK, I misunderstood. We can reload the private tail from the shared ring buffer
tail earlier. You could actually do this on every failed CAS but I think that
would be overkill. Your design with a retry limit is better, we need to find out
what is a suitable value for the retry limit.

> 
> But you're right that we should still catch the 1+ lap behind case, so the
> reload criteria could be:
> 
> #define ENQ_RETRY_LIMIT 32 //arbitrary
> 
> if (old.e.idx != tail - size) {
>     if (++fail_cnt < ENQ_RETRY_LIMIT && old.e.idx == tail) {
>         tail++;
>     } else {
>         fail_cnt = 0;
>         tail = rte_lfring_reload(...);
>     }
>     goto restart;
> }
> fail_cnt = 0;
Agree.

> 
> > 
> > > 
> > > 
> > > 3. Using a zero-length array to mark the start of the ring is a nice
> > > approach
> > > -- I'll incorporate that into the patchset.
> > > 
> > > At an algorithm level, I don't see any other differences.
> > > Implementation-wise, we'll need to nail the memory ordering flags to
> > > best support weak consistency machines, as you pointed out elsewhere.
> > There is no pre-acquisition of slots in enqueue and dequeue. That separate
> > step
> > makes lock-freedom impossible (I think).
> > 
> Can you elaborate?
I think lock-freedom is impossible with the initial acquisition of elements.
This acquisition creates a side effect that cannot be undone or helped by other
threads.

You can still implement a "non-blocking" ring buffer (like your original design) which hides any delay in threads that access the ring buffer but isn't properly lock-free (which could be considered unnecessary in a DPDK environment, threads may get delayed but shouldn't die or block forever).

>  I don't currently see any other way to support rte_ring bulk semantics, which
> is necessary for creating a non-blocking mempool handler, so we should clear
> up any doubt.
OK, I wasn't aware of this strict dependency on bulk enqueue and dequeue. Bulk
dequeue (mempool allocates buffers from the ring) should be easy to support
though. Bulk enqueue (mempool frees buffers to the ring) will work as long as
the ring is large enough to hold all free buffers so it can never become full
(need to reload head/tail pointers at the appropriate places to avoid spurious
full/empty status). I assume this is the case, initially all free buffers will
be stored in the ring?

> 
> In the NB ring patchset each thread reserves a number of slots before
> performing the enq/deq, but doesn't reserve *specific* slots (unlike
> rte_ring). This reservation is atomic, so that we never over-subscribe valid
> ring entries (for dequeue) or unused ring entries (for enqueue). This
> guarantees that the enq/deq operation will eventually complete, regardless of
> the behavior of other threads. This is why the enqueue loop doesn't check if
> space is available and the dequeue loop doesn't check if any more valid
> entries remain.
I initially thought these acquisitions were just for compatibility with rte_ring
(update the same metadata so that the user can mix MP/MC and SP/SC calls) but
realise they are there for the bulk enqueue/dequeue. But doing this acquisition
means each enqueue or dequeue will update two metadata variables so creates more
contention and less scalability. I think it would be good if we could provide
the bulk behaviour *without* this initial acquisition and only update one
metadata location per enqueue/dequeue operation.

> 
> This sort of reservation should be compatible with lfring, but requires
> changes (e.g. two sets of head/tail pointers).
> 
<SNIP>
> 
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


  parent reply	other threads:[~2019-01-30 11:36 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-28 12:28 Ola Liljedahl
2019-01-28 21:04 ` Eads, Gage
2019-01-28 22:26   ` Ola Liljedahl
2019-01-30  5:17     ` Eads, Gage
2019-01-30  5:56       ` Eads, Gage
2019-01-30 11:36       ` Ola Liljedahl [this message]
2019-02-01 15:40         ` Eads, Gage
2019-02-01 18:19           ` Ola Liljedahl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1548848178.2915.40.camel@arm.com \
    --to=ola.liljedahl@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=gage.eads@intel.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).