From: Ola Liljedahl <Ola.Liljedahl@arm.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
Stephen Hemminger <stephen@networkplumber.org>
Cc: Wathsala Vithanage <wathsala.vithanage@arm.com>,
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
Tyler Retzlaff <roretzla@linux.microsoft.com>,
"dev@dpdk.org" <dev@dpdk.org>,
"vattunuru@marvell.com" <vattunuru@marvell.com>, nd <nd@arm.com>
Subject: Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
Date: Wed, 5 Nov 2025 19:39:23 +0000 [thread overview]
Message-ID: <1CB25A39-722A-47C0-B531-5A7A52C2F80C@arm.com> (raw)
In-Reply-To: <9a7a720e91f546879368a0ed83dd7224@huawei.com>
> On 2025-11-04, 09:18, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
> > On Mon, 3 Nov 2025 18:06:05 +0000
> > Konstantin Ananyev <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
> >
> > > >
> > > > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > > > Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>> wrote:
> > > > >
> > > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > > >>
> > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > I am surprised that a driver would use mcslock.
> > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > contention.
> > > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > maintainer can clarify.
> > > >
> > >
> > > If MCS lock is really broken, it needs to be fixed anyway.
> > > It might be used by other third-party libs/apps that do use on DPDK.
> >
> > 100% agree it must be fixed.
> > It would be good to have test or static analysis tool that could validate all the lock
> > types.
>
>
> +1
> Another option would be to have sort of stress test for all locking types we have in our UT.
> At least for ring I found it very useful.
A stress test only tests for that specific hardware and compiler you are using. E.g. the LDAPR
problem in rte_ring was hidden for years because compilers (GCC) did not support generating
LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR since
Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC 13 support
for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify +rcpc for -march
or specify a CPU that supports FEAT_RCPC, building with defaults won't generate LDAPR and the
bug would remain hidden. A stress test that is run on the target every time DPDK is rebuilt for
some target would cover some of that.
A stress test (like any test) is also not guaranteed to provoke all code paths and thread interleavings
so there are still no guarantees.
I claim that it is important to verify that the code follows the C/C++ memory model and implements
all the necessary happens-before relations (synchronize-with edges). Then it will work on all
hardware and compilers now and in the future that implement the C/C++ memory model (and
Java which is quite similar AFAIK, all atomics being seq_cst).
One such verifier exists in Progress64. It verifies the actual implementation by executing all
possible interleavings (permutations) of two threads running a test program and ensures all
required synchronize-with edges exists (as implemented by load-acquire/store-release). Lack
of synchronize-with between two accesses to the same location by different threads means
a data race has been found.
As an example, make that load-acquire in the MCS unlock function a load-relaxed instead (as
it was before that patch) and run the verifier:
$ ./verify -am mcslock
Verifying mcslock
(lots of output of failed verifications)
<CTRL-C>
Interrupted
Histogram over number of steps:
20: 594
21: 0
22: 0
23: 0
24: 0
25: 0
26: 566
27: 0
28: 684
29: 0
30: 155
31: 0
32: 593
33: 0
34: 420
35: 0
36: 155
Summary:
succeeded: 3167
interrupted: 0
failed: 5871
total: 9038 (0x234e)
Synchronization analysis:
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 594)
load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 2573)
load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 2573)
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 8444)
src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
Elapsed 1.028909533 seconds, average 113842ns/permutation, average 11530ns/step
The first failed permutation is 0x7. Re-run that with the verbose flag.
$ ./verify -avm -p7 mcslock
Verifying mcslock using permutation 0x7
Step 0: thread 1 file src/p64_mcslock.c line 25 -w write_8(0x7c5dd070,0,regular)
Step 1: thread 1 file src/p64_mcslock.c line 26 -w write_1(0x7c5dd078,0x1,regular)
Step 2: thread 1 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
Step 3: thread 0 file src/p64_mcslock.c line 25 -w write_8(0x7c5d8c70,0,regular)
Step 4: thread 0 file src/p64_mcslock.c line 26 -w write_1(0x7c5d8c78,0x1,regular)
Step 5: thread 0 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
Step 6: thread 0 file src/p64_mcslock.c line 37 r- read_8(0x7c5dd070,regular)=0
Regular read_8 on step 6 matches regular write_8 from other thread on step 0
Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
Step 7: thread 0 file src/p64_mcslock.c line 38 -w store_8(0x7c5dd070,0x7c5d8c70,rls)
Step 8: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
Step 9: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 10: thread 1 file src/ver_mcslock.c line 42 r- read_1(0x7c5dd2b8,regular)=0
Step 11: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
Step 12: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 13: thread 1 file src/ver_mcslock.c line 43 -w write_1(0x7c5dd2b8,0x1,regular)
Step 14: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
Step 15: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 16: thread 1 file src/ver_mcslock.c line 44 r- read_1(0x7c5dd2b8,regular)=0x1
Regular read_1 on step 16 matches regular write_1 from same thread on step 13
Step 17: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
Step 18: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 19: thread 1 file src/ver_mcslock.c line 45 -w write_1(0x7c5dd2b8,0,regular)
Step 20: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
Step 21: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 22: thread 1 file src/p64_mcslock.c line 51 r- load_8(0x7c5dd070,rlx)=0x7c5d8c70
Step 23: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
Step 24: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 25: thread 1 file src/p64_mcslock.c line 69 r- read_1(0x7c5d8c78,regular)=0x1
Regular read_1 on step 25 matches regular write_1 from other thread on step 4
ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
Module mcslock permutation 0x7 step 26: Verification failed
Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it
runs 4 (2^32) permutations but this can be changed.
$ ./verify -am mcslock
Verifying mcslock
Verifying permutation 0...
...
Histogram over number of steps:
20: 281018368
21: 0
22: 0
23: 0
24: 143130624
25: 0
26: 906690560
27: 137625600
28: 1200701440
29: 252887040
30: 752590848
31: 181182464
32: 279252992
33: 74387456
34: 59373568
35: 17340416
36: 6410240
37: 2017280
38: 268800
39: 89600
Summary:
succeeded: 4294967296
interrupted: 0
failed: 0
total: 4294967296 (0x100000000)
Synchronization analysis:
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 281018368)
load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 4013948928)
load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38 (count 2808086528)
load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 1205862400)
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 4013948928)
No data races detected
Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step
The verifier is still work in progress and has some limitations. It implements an SC
(sequential consistency)-like memory model but then analyses all regular accesses to
ensure the necessary synchronize-with (acquire/release) edges are present or a data
race is reported. One problem is that atomic accesses by definition never cause data
races so the verifier needs some help which is done using additional regular (non-
atomic) accesses. I think more work can be done here to detect data races without
requiring additional regular accesses. Possibly an RCpc-like model could also be simulated,
e.g. a load-acquire could return not the SC (current) value of a memory location but the
earliest value that existing synchronize-with edges allow. This would require more
intrusive changes to the verifier.
Thread fences are not supported. I don't mind this too much as I don't think thread fences
Should be used unless really necessary as their semantics often are misunderstood (as we
have seen). However, that means code like the seqlock can't be verified. I have a herd7
litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed) atomic
loads and stores to identify synchronize-with edges is a possibility so perhaps they will be
supported at some point.
Only two threads are currently supported but extending this to more threads is trivial.
Verification speed will suffer though. I am planning to make the verifier multithreaded so
can use all processor cores in a system.
I have many more ideas on how to improve the verifier.
- Ola
next prev parent reply other threads:[~2025-11-06 11:24 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-23 18:47 Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 17:07 ` Stephen Hemminger
2025-11-03 17:30 ` Wathsala Vithanage
2025-11-03 18:06 ` Konstantin Ananyev
2025-11-03 18:47 ` Wathsala Vithanage
2025-11-03 18:48 ` Stephen Hemminger
2025-11-03 19:13 ` Wathsala Vithanage
2025-11-04 8:18 ` Konstantin Ananyev
2025-11-05 19:39 ` Ola Liljedahl [this message]
2025-11-06 7:48 ` Konstantin Ananyev
2025-11-03 23:48 ` Stephen Hemminger
2025-11-04 23:30 ` Wathsala Vithanage
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1CB25A39-722A-47C0-B531-5A7A52C2F80C@arm.com \
--to=ola.liljedahl@arm.com \
--cc=Honnappa.Nagarahalli@arm.com \
--cc=dev@dpdk.org \
--cc=konstantin.ananyev@huawei.com \
--cc=nd@arm.com \
--cc=roretzla@linux.microsoft.com \
--cc=stephen@networkplumber.org \
--cc=vattunuru@marvell.com \
--cc=wathsala.vithanage@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).