RE: [PATCH 1/1] eal: correct memory ordering in MCS lock

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Ola Liljedahl <Ola.Liljedahl@arm.com>,
	Stephen Hemminger <stephen@networkplumber.org>
Cc: Wathsala Vithanage <wathsala.vithanage@arm.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	Tyler Retzlaff <roretzla@linux.microsoft.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"vattunuru@marvell.com" <vattunuru@marvell.com>, nd <nd@arm.com>
Subject: RE: [PATCH 1/1] eal: correct memory ordering in MCS lock
Date: Thu, 6 Nov 2025 07:48:13 +0000	[thread overview]
Message-ID: <97ceab55634246a8a3b778880ca0810d@huawei.com> (raw)
In-Reply-To: <1CB25A39-722A-47C0-B531-5A7A52C2F80C@arm.com>



> > > > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > > > >>
> > > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > > I am surprised that a driver would use mcslock.
> > > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > > contention.
> > > > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > > maintainer can clarify.
> > > > >
> > > >
> > > > If MCS lock is really broken, it needs to be fixed anyway.
> > > > It might be used by other third-party libs/apps that do use on DPDK.
> > >
> > > 100% agree it must be fixed.
> > > It would be good to have test or static analysis tool that could validate all the
> lock
> > > types.
> >
> >
> > +1
> > Another option would be to have sort of stress test for all locking types we have in
> our UT.
> > At least for ring I found it very useful.
> A stress test only tests for that specific hardware and compiler you are using. E.g.
> the LDAPR
> problem in rte_ring was hidden for years because compilers (GCC) did not support
> generating
> LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR
> since
> Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC
> 13 support
> for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify
> +rcpc for -march
> or specify a CPU that supports FEAT_RCPC, building with defaults won't generate
> LDAPR and the
> bug would remain hidden. A stress test that is run on the target every time DPDK is
> rebuilt for
> some target would cover some of that.
> A stress test (like any test) is also not guaranteed to provoke all code paths and
> thread interleavings
> so there are still no guarantees.

Regarding the code coverage - these days there are several tools that can
collect such data for you. 
I understand that stress-test passing doesn't guarantee that your code is completely bug free,
and of course it will able to catch bugs only on the platform you run it on. 
But it is still way better than nothing.

> I claim that it is important to verify that the code follows the C/C++ memory model
> and implements
> all the necessary happens-before relations (synchronize-with edges). Then it will
> work on all
> hardware and compilers now and in the future that implement the C/C++ memory
> model (and
> Java which is quite similar AFAIK, all atomics being seq_cst).
> 
> One such verifier exists in Progress64. It verifies the actual implementation by
> executing all
> possible interleavings (permutations) of two threads running a test program and
> ensures all
> required synchronize-with edges exists (as implemented by load-acquire/store-
> release). Lack
> of synchronize-with between two accesses to the same location by different threads
> means
> a data race has been found.
> 
> As an example, make that load-acquire in the MCS unlock function a load-relaxed
> instead (as
> it was before that patch) and run the verifier:
> $ ./verify -am mcslock
> Verifying mcslock
> (lots of output of failed verifications)
> <CTRL-C>
> Interrupted
> Histogram over number of steps:
>  20: 594
>  21: 0
>  22: 0
>  23: 0
>  24: 0
>  25: 0
>  26: 566
>  27: 0
>  28: 684
>  29: 0
>  30: 155
>  31: 0
>  32: 593
>  33: 0
>  34: 420
>  35: 0
>  36: 155
> Summary:
>  succeeded: 3167
>  interrupted: 0
>  failed: 5871
>  total: 9038 (0x234e)
> Synchronization analysis:
>  load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56
> (count 594)
>  load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70
> (count 2573)
>  load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38
> (count 2573)
>  load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28
> (count 8444)
>  src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
> Elapsed 1.028909533 seconds, average 113842ns/permutation, average
> 11530ns/step
> 
> The first failed permutation is 0x7. Re-run that with the verbose flag.
> $ ./verify -avm -p7 mcslock
> Verifying mcslock using permutation 0x7
> Step  0: thread 1 file src/p64_mcslock.c line  25 -w write_8(0x7c5dd070,0,regular)
> Step  1: thread 1 file src/p64_mcslock.c line  26 -w write_1(0x7c5dd078,0x1,regular)
> Step  2: thread 1 file src/p64_mcslock.c line  28 rw
> exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
> Step  3: thread 0 file src/p64_mcslock.c line  25 -w write_8(0x7c5d8c70,0,regular)
> Step  4: thread 0 file src/p64_mcslock.c line  26 -w write_1(0x7c5d8c78,0x1,regular)
> Step  5: thread 0 file src/p64_mcslock.c line  28 rw
> exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
> Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
> Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
> Step  6: thread 0 file src/p64_mcslock.c line  37 r- read_8(0x7c5dd070,regular)=0
> Regular read_8 on step 6 matches regular write_8 from other thread on step 0
> Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
> Step  7: thread 0 file src/p64_mcslock.c line  38 -w
> store_8(0x7c5dd070,0x7c5d8c70,rls)
> Step  8: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
> Step  9: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 10: thread 1 file src/ver_mcslock.c line  42 r- read_1(0x7c5dd2b8,regular)=0
> Step 11: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
> Step 12: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 13: thread 1 file src/ver_mcslock.c line  43 -w write_1(0x7c5dd2b8,0x1,regular)
> Step 14: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
> Step 15: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 16: thread 1 file src/ver_mcslock.c line  44 r- read_1(0x7c5dd2b8,regular)=0x1
> Regular read_1 on step 16 matches regular write_1 from same thread on step 13
> Step 17: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
> Step 18: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 19: thread 1 file src/ver_mcslock.c line  45 -w write_1(0x7c5dd2b8,0,regular)
> Step 20: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
> Step 21: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 22: thread 1 file src/p64_mcslock.c line  51 r-
> load_8(0x7c5dd070,rlx)=0x7c5d8c70
> Step 23: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
> Step 24: thread 0 file src/p64_mcslock.c line  42 -- yield()
> Yielding to other thread
> Step 25: thread 1 file src/p64_mcslock.c line  69 r- read_1(0x7c5d8c78,regular)=0x1
> Regular read_1 on step 25 matches regular write_1 from other thread on step 4
> ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
> Module mcslock permutation 0x7 step 26: Verification failed
> 
> Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it
> runs 4 (2^32) permutations but this can be changed.
> $ ./verify -am mcslock
> Verifying mcslock
> Verifying permutation 0...
> ...
> Histogram over number of steps:
>  20: 281018368
>  21: 0
>  22: 0
>  23: 0
>  24: 143130624
>  25: 0
>  26: 906690560
>  27: 137625600
>  28: 1200701440
>  29: 252887040
>  30: 752590848
>  31: 181182464
>  32: 279252992
>  33: 74387456
>  34: 59373568
>  35: 17340416
>  36: 6410240
>  37: 2017280
>  38: 268800
>  39: 89600
> Summary:
>  succeeded: 4294967296
>  interrupted: 0
>  failed: 0
>  total: 4294967296 (0x100000000)
> Synchronization analysis:
>  load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56
> (count 281018368)
>  load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70
> (count 4013948928)
>  load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38
> (count 2808086528)
>  load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38
> (count 1205862400)
>  load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28
> (count 4013948928)
>  No data races detected
> Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step
> 
> The verifier is still work in progress and has some limitations. It implements an SC
> (sequential consistency)-like memory model but then analyses all regular accesses to
> ensure the necessary synchronize-with (acquire/release) edges are present or a data
> race is reported. One problem is that atomic accesses by definition never cause data
> races so the verifier needs some help which is done using additional regular (non-
> atomic) accesses. I think more work can be done here to detect data races without
> requiring additional regular accesses. Possibly an RCpc-like model could also be
> simulated,
> e.g. a load-acquire could return not the SC (current) value of a memory location but
> the
> earliest value that existing synchronize-with edges allow. This would require more
> intrusive changes to the verifier.
> Thread fences are not supported. I don't mind this too much as I don't think thread
> fences
> Should be used unless really necessary as their semantics often are misunderstood
> (as we
> have seen). However, that means code like the seqlock can't be verified. I have a
> herd7
> litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed)
> atomic
> loads and stores to identify synchronize-with edges is a possibility so perhaps they
> will be
> supported at some point.
> Only two threads are currently supported but extending this to more threads is
> trivial.
> Verification speed will suffer though. I am planning to make the verifier
> multithreaded so
> can use all processor cores in a system.
> I have many more ideas on how to improve the verifier.

Did I understand you right: you are working on some verifier tool that will be
publicly available and can be incorporated into DPDK CI process?
If so - that would be great.
Though, I still think we can have both: stress tests and verifier :)
Konstantin

next prev parent reply	other threads:[~2025-11-06  7:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 18:47 Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 17:07   ` Stephen Hemminger
2025-11-03 17:30     ` Wathsala Vithanage
2025-11-03 18:06       ` Konstantin Ananyev
2025-11-03 18:47         ` Wathsala Vithanage
2025-11-03 18:48         ` Stephen Hemminger
2025-11-03 19:13           ` Wathsala Vithanage
2025-11-04  8:18           ` Konstantin Ananyev
2025-11-05 19:39             ` Ola Liljedahl
2025-11-06  7:48               ` Konstantin Ananyev [this message]
2025-11-03 23:48 ` Stephen Hemminger
2025-11-04 23:30   ` Wathsala Vithanage

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97ceab55634246a8a3b778880ca0810d@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Ola.Liljedahl@arm.com \
    --cc=dev@dpdk.org \
    --cc=nd@arm.com \
    --cc=roretzla@linux.microsoft.com \
    --cc=stephen@networkplumber.org \
    --cc=vattunuru@marvell.com \
    --cc=wathsala.vithanage@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).