Re: [PATCH 1/1] eal: correct memory ordering in MCS lock

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Ola Liljedahl <Ola.Liljedahl@arm.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
	Stephen Hemminger <stephen@networkplumber.org>
Cc: Wathsala Vithanage <wathsala.vithanage@arm.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	Tyler Retzlaff <roretzla@linux.microsoft.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"vattunuru@marvell.com" <vattunuru@marvell.com>, nd <nd@arm.com>
Subject: Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
Date: Wed, 5 Nov 2025 19:39:23 +0000	[thread overview]
Message-ID: <1CB25A39-722A-47C0-B531-5A7A52C2F80C@arm.com> (raw)
In-Reply-To: <9a7a720e91f546879368a0ed83dd7224@huawei.com>

> On 2025-11-04, 09:18, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
> > On Mon, 3 Nov 2025 18:06:05 +0000
> > Konstantin Ananyev <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
> > 
> > > >
> > > > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > > > Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>> wrote:
> > > > >
> > > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > > >>
> > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > I am surprised that a driver would use mcslock.
> > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > contention.
> > > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > maintainer can clarify.
> > > >
> > >
> > > If MCS lock is really broken, it needs to be fixed anyway.
> > > It might be used by other third-party libs/apps that do use on DPDK.
> > 
> > 100% agree it must be fixed.
> > It would be good to have test or static analysis tool that could validate all the lock
> > types.
>
>
> +1
> Another option would be to have sort of stress test for all locking types we have in our UT.
> At least for ring I found it very useful.
A stress test only tests for that specific hardware and compiler you are using. E.g. the LDAPR
problem in rte_ring was hidden for years because compilers (GCC) did not support generating
LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR since
Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC 13 support
for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify +rcpc for -march
or specify a CPU that supports FEAT_RCPC, building with defaults won't generate LDAPR and the
bug would remain hidden. A stress test that is run on the target every time DPDK is rebuilt for
some target would cover some of that.
A stress test (like any test) is also not guaranteed to provoke all code paths and thread interleavings
so there are still no guarantees.

I claim that it is important to verify that the code follows the C/C++ memory model and implements
all the necessary happens-before relations (synchronize-with edges). Then it will work on all
hardware and compilers now and in the future that implement the C/C++ memory model (and
Java which is quite similar AFAIK, all atomics being seq_cst).

One such verifier exists in Progress64. It verifies the actual implementation by executing all
possible interleavings (permutations) of two threads running a test program and ensures all
required synchronize-with edges exists (as implemented by load-acquire/store-release). Lack
of synchronize-with between two accesses to the same location by different threads means
a data race has been found.

As an example, make that load-acquire in the MCS unlock function a load-relaxed instead (as
it was before that patch) and run the verifier:
$ ./verify -am mcslock
Verifying mcslock
(lots of output of failed verifications)
<CTRL-C>
Interrupted
Histogram over number of steps:
 20: 594
 21: 0
 22: 0
 23: 0
 24: 0
 25: 0
 26: 566
 27: 0
 28: 684
 29: 0
 30: 155
 31: 0
 32: 593
 33: 0
 34: 420
 35: 0
 36: 155
Summary:
 succeeded: 3167
 interrupted: 0
 failed: 5871
 total: 9038 (0x234e)
Synchronization analysis:
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 594)
 load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 2573)
 load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 2573)
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 8444)
 src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
Elapsed 1.028909533 seconds, average 113842ns/permutation, average 11530ns/step

The first failed permutation is 0x7. Re-run that with the verbose flag.
$ ./verify -avm -p7 mcslock
Verifying mcslock using permutation 0x7
Step  0: thread 1 file src/p64_mcslock.c line  25 -w write_8(0x7c5dd070,0,regular)
Step  1: thread 1 file src/p64_mcslock.c line  26 -w write_1(0x7c5dd078,0x1,regular)
Step  2: thread 1 file src/p64_mcslock.c line  28 rw exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
Step  3: thread 0 file src/p64_mcslock.c line  25 -w write_8(0x7c5d8c70,0,regular)
Step  4: thread 0 file src/p64_mcslock.c line  26 -w write_1(0x7c5d8c78,0x1,regular)
Step  5: thread 0 file src/p64_mcslock.c line  28 rw exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
Step  6: thread 0 file src/p64_mcslock.c line  37 r- read_8(0x7c5dd070,regular)=0
Regular read_8 on step 6 matches regular write_8 from other thread on step 0
Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
Step  7: thread 0 file src/p64_mcslock.c line  38 -w store_8(0x7c5dd070,0x7c5d8c70,rls)
Step  8: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
Step  9: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 10: thread 1 file src/ver_mcslock.c line  42 r- read_1(0x7c5dd2b8,regular)=0
Step 11: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
Step 12: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 13: thread 1 file src/ver_mcslock.c line  43 -w write_1(0x7c5dd2b8,0x1,regular)
Step 14: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
Step 15: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 16: thread 1 file src/ver_mcslock.c line  44 r- read_1(0x7c5dd2b8,regular)=0x1
Regular read_1 on step 16 matches regular write_1 from same thread on step 13
Step 17: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
Step 18: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 19: thread 1 file src/ver_mcslock.c line  45 -w write_1(0x7c5dd2b8,0,regular)
Step 20: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
Step 21: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 22: thread 1 file src/p64_mcslock.c line  51 r- load_8(0x7c5dd070,rlx)=0x7c5d8c70
Step 23: thread 0 file src/p64_mcslock.c line  42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
Step 24: thread 0 file src/p64_mcslock.c line  42 -- yield()
Yielding to other thread
Step 25: thread 1 file src/p64_mcslock.c line  69 r- read_1(0x7c5d8c78,regular)=0x1
Regular read_1 on step 25 matches regular write_1 from other thread on step 4
ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
Module mcslock permutation 0x7 step 26: Verification failed

Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it
runs 4 (2^32) permutations but this can be changed.
$ ./verify -am mcslock
Verifying mcslock
Verifying permutation 0...
...
Histogram over number of steps:
 20: 281018368
 21: 0
 22: 0
 23: 0
 24: 143130624
 25: 0
 26: 906690560
 27: 137625600
 28: 1200701440
 29: 252887040
 30: 752590848
 31: 181182464
 32: 279252992
 33: 74387456
 34: 59373568
 35: 17340416
 36: 6410240
 37: 2017280
 38: 268800
 39: 89600
Summary:
 succeeded: 4294967296
 interrupted: 0
 failed: 0
 total: 4294967296 (0x100000000)
Synchronization analysis:
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 281018368)
 load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 4013948928)
 load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38 (count 2808086528)
 load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 1205862400)
 load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 4013948928)
 No data races detected
Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step

The verifier is still work in progress and has some limitations. It implements an SC
(sequential consistency)-like memory model but then analyses all regular accesses to
ensure the necessary synchronize-with (acquire/release) edges are present or a data
race is reported. One problem is that atomic accesses by definition never cause data
races so the verifier needs some help which is done using additional regular (non-
atomic) accesses. I think more work can be done here to detect data races without
requiring additional regular accesses. Possibly an RCpc-like model could also be simulated,
e.g. a load-acquire could return not the SC (current) value of a memory location but the
earliest value that existing synchronize-with edges allow. This would require more
intrusive changes to the verifier.
Thread fences are not supported. I don't mind this too much as I don't think thread fences
Should be used unless really necessary as their semantics often are misunderstood (as we
have seen). However, that means code like the seqlock can't be verified. I have a herd7
litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed) atomic
loads and stores to identify synchronize-with edges is a possibility so perhaps they will be
supported at some point.
Only two threads are currently supported but extending this to more threads is trivial.
Verification speed will suffer though. I am planning to make the verifier multithreaded so
can use all processor cores in a system.
I have many more ideas on how to improve the verifier.

- Ola

next prev parent reply	other threads:[~2025-11-06 11:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 18:47 Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 17:07   ` Stephen Hemminger
2025-11-03 17:30     ` Wathsala Vithanage
2025-11-03 18:06       ` Konstantin Ananyev
2025-11-03 18:47         ` Wathsala Vithanage
2025-11-03 18:48         ` Stephen Hemminger
2025-11-03 19:13           ` Wathsala Vithanage
2025-11-04  8:18           ` Konstantin Ananyev
2025-11-05 19:39             ` Ola Liljedahl [this message]
2025-11-06  7:48               ` Konstantin Ananyev
2025-11-03 23:48 ` Stephen Hemminger
2025-11-04 23:30   ` Wathsala Vithanage

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1CB25A39-722A-47C0-B531-5A7A52C2F80C@arm.com \
    --to=ola.liljedahl@arm.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@huawei.com \
    --cc=nd@arm.com \
    --cc=roretzla@linux.microsoft.com \
    --cc=stephen@networkplumber.org \
    --cc=vattunuru@marvell.com \
    --cc=wathsala.vithanage@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).