* [PATCH 1/1] eal: correct memory ordering in MCS lock
@ 2025-10-23 18:47 Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 23:48 ` Stephen Hemminger
0 siblings, 2 replies; 13+ messages in thread
From: Wathsala Vithanage @ 2025-10-23 18:47 UTC (permalink / raw)
To: Honnappa Nagarahalli, Tyler Retzlaff
Cc: dev, Wathsala Vithanage, Ola Liljedahl
Fix incorrect memory ordering in the MCS lock implementation by
adding proper synchronizing edges to establish clear happens-before
relationships between threads invoking lock() and unlock(). These
synchronizing edges prevent potential deadlocks caused by improper
ordering and are documented in detail through in-code comments.
The previously relaxed load of the successor’s lock object pointer
in unlock() has been upgraded to a load-acquire operation. This
change ensures that the successor’s initialization does not
overwrite the current lock holder’s update to the locked field,
which could otherwise lead to deadlocks.
Remove two unnecessary fences:
The acquire fence in unlock() had no matching release fence, making
it ineffective for enforcing memory order. The associated comment
suggested it prevented speculative reordering, but such fences (data
memory barriers) only establish memory ordering and do not control
instruction speculation.
The release-acquire fence pair in lock() was previously justified as
preventing reordering between the load-acquire loop of me->locked
and the store-release of prev->next. This is no longer needed, as the
new synchronizing edges ensure a chain of happens-before
relationships between memory operations of threads calling lock() and
unlock().
Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
---
lib/eal/include/rte_mcslock.h | 100 ++++++++++++++++++++++------------
1 file changed, 64 insertions(+), 36 deletions(-)
diff --git a/lib/eal/include/rte_mcslock.h b/lib/eal/include/rte_mcslock.h
index bb218d2e50..0af7a94a06 100644
--- a/lib/eal/include/rte_mcslock.h
+++ b/lib/eal/include/rte_mcslock.h
@@ -57,11 +57,21 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
rte_atomic_store_explicit(&me->locked, 1, rte_memory_order_relaxed);
rte_atomic_store_explicit(&me->next, NULL, rte_memory_order_relaxed);
- /* If the queue is empty, the exchange operation is enough to acquire
- * the lock. Hence, the exchange operation requires acquire semantics.
- * The store to me->next above should complete before the node is
- * visible to other CPUs/threads. Hence, the exchange operation requires
- * release semantics as well.
+ /*
+ * A0/R0: Queue might be empty, perform the exchange (RMW) with both acquire and
+ * release semantics:
+ * A0: Acquire — synchronizes with both R0 and R2.
+ * Must synchronize with R0 to ensure that this thread observes predecessor's
+ * initialization of its lock object or risk them overwriting this thread's
+ * update to the next of the same object via store to prev->next.
+ *
+ * Must synchronize with R2 the releasing CAS in unlock(), this will ensure
+ * that all prior critical-section writes become visible to this thread.
+ *
+ * R0: Release — ensures the successor observes our initialization of me->next;
+ * without it, me->next could be overwritten to NULL after the successor
+ * sets its own address, causing deadlock. This release synchronizes with
+ * A0 above.
*/
prev = rte_atomic_exchange_explicit(msl, me, rte_memory_order_acq_rel);
if (likely(prev == NULL)) {
@@ -70,24 +80,26 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
*/
return;
}
- /* The store to me->next above should also complete before the node is
- * visible to predecessor thread releasing the lock. Hence, the store
- * prev->next also requires release semantics. Note that, for example,
- * on ARM, the release semantics in the exchange operation is not
- * strong as a release fence and is not sufficient to enforce the
- * desired order here.
+
+ /*
+ * R1: With the relaxed memory model of C/C++, it's essential that after
+ * we link ourselves by storing prev->next = me, the owner of prev must
+ * observe our prior initialization of me->locked. Otherwise it could
+ * clear me->locked before we set it to 1, which may deadlock.
+ * Perform a releasing store so the predecessor's acquire loads A2 and A3
+ * observes our initialization, establishing a happens-before from those
+ * writes.
*/
rte_atomic_store_explicit(&prev->next, me, rte_memory_order_release);
- /* The while-load of me->locked should not move above the previous
- * store to prev->next. Otherwise it will cause a deadlock. Need a
- * store-load barrier.
- */
- rte_atomic_thread_fence(rte_memory_order_acq_rel);
- /* If the lock has already been acquired, it first atomically
+ /*
+ * A1: If the lock has already been acquired, it first atomically
* places the node at the end of the queue and then proceeds
* to spin on me->locked until the previous lock holder resets
- * the me->locked using mcslock_unlock().
+ * the me->locked in rte_mcslock_unlock().
+ * This load must synchronize with store-release R3 to ensure that
+ * all updates to critical section by previous lock holder is visible
+ * to this thread after acquiring the lock.
*/
rte_wait_until_equal_32((uint32_t *)(uintptr_t)&me->locked, 0, rte_memory_order_acquire);
}
@@ -103,30 +115,46 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
static inline void
rte_mcslock_unlock(RTE_ATOMIC(rte_mcslock_t *) *msl, RTE_ATOMIC(rte_mcslock_t *) me)
{
- /* Check if there are more nodes in the queue. */
- if (likely(rte_atomic_load_explicit(&me->next, rte_memory_order_relaxed) == NULL)) {
+ /*
+ * A2: Check whether a successor is already queued.
+ * Load me->next with acquire semantics so it can synchronize with the
+ * successor’s release store R1. This guarantees that the successor’s
+ * initialization of its lock object (me) is completed before we observe
+ * it here, preventing a race between this thread’s store-release to
+ * me->next->locked and the successor’s store to me->locked.
+ */
+ if (likely(rte_atomic_load_explicit(&me->next, rte_memory_order_acquire) == NULL)) {
/* No, last member in the queue. */
- rte_mcslock_t *save_me = rte_atomic_load_explicit(&me, rte_memory_order_relaxed);
+ rte_mcslock_t *save_me = me;
- /* Release the lock by setting it to NULL */
+ /*
+ * R2: Try to release the lock by swinging *msl from save_me to NULL.
+ * Use release semantics so all critical section writes become
+ * visible to the next lock acquirer.
+ */
if (likely(rte_atomic_compare_exchange_strong_explicit(msl, &save_me, NULL,
rte_memory_order_release, rte_memory_order_relaxed)))
return;
- /* Speculative execution would be allowed to read in the
- * while-loop first. This has the potential to cause a
- * deadlock. Need a load barrier.
- */
- rte_atomic_thread_fence(rte_memory_order_acquire);
- /* More nodes added to the queue by other CPUs.
- * Wait until the next pointer is set.
+ /*
+ * A3: Another thread was enqueued concurrently, so the CAS and the lock
+ * release failed. Wait until the successor sets our 'next' pointer.
+ * This load must synchronize with the successor’s release store (R1) to
+ * ensure that the successor’s initialization completes before we observe
+ * it here. This ordering prevents a race between this thread’s later
+ * store-release to me->next->locked and the successor’s store to me->locked.
*/
RTE_ATOMIC(uintptr_t) *next;
next = (__rte_atomic uintptr_t *)&me->next;
- RTE_WAIT_UNTIL_MASKED(next, UINTPTR_MAX, !=, 0, rte_memory_order_relaxed);
+ RTE_WAIT_UNTIL_MASKED(next, UINTPTR_MAX, !=, 0, rte_memory_order_acquire);
}
- /* Pass lock to next waiter. */
+ /*
+ * R3: Pass the lock to the successor.
+ * Use a release store to synchronize with A1 when clearing me->next->locked
+ * so the successor observes our critical section writes after it sees locked
+ * become 0.
+ */
rte_atomic_store_explicit(&me->next->locked, 0, rte_memory_order_release);
}
@@ -149,11 +177,11 @@ rte_mcslock_trylock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
/* Try to lock */
rte_mcslock_t *expected = NULL;
- /* The lock can be taken only when the queue is empty. Hence,
- * the compare-exchange operation requires acquire semantics.
- * The store to me->next above should complete before the node
- * is visible to other CPUs/threads. Hence, the compare-exchange
- * operation requires release semantics as well.
+ /*
+ * A4/R4: The lock can be acquired only when the queue is empty.
+ * The compare-and-exchange operation must use acquire and release
+ * semantics for the same reasons described in the rte_mcslock_lock()
+ * function’s empty-queue case (see A0/R0 for details).
*/
return rte_atomic_compare_exchange_strong_explicit(msl, &expected, me,
rte_memory_order_acq_rel, rte_memory_order_relaxed);
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-10-23 18:47 [PATCH 1/1] eal: correct memory ordering in MCS lock Wathsala Vithanage
@ 2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 17:07 ` Stephen Hemminger
2025-11-03 23:48 ` Stephen Hemminger
1 sibling, 1 reply; 13+ messages in thread
From: Wathsala Vithanage @ 2025-11-03 15:12 UTC (permalink / raw)
To: Honnappa Nagarahalli, Tyler Retzlaff; +Cc: dev, Ola Liljedahl, vattunuru
MCS lock is broken, it's just a matter of time it will run into a deadlock.
drivers/dma/cnxk is a user of MCS lock.
--wathsala
On 10/23/25 13:47, Wathsala Vithanage wrote:
> Fix incorrect memory ordering in the MCS lock implementation by
> adding proper synchronizing edges to establish clear happens-before
> relationships between threads invoking lock() and unlock(). These
> synchronizing edges prevent potential deadlocks caused by improper
> ordering and are documented in detail through in-code comments.
>
> The previously relaxed load of the successor’s lock object pointer
> in unlock() has been upgraded to a load-acquire operation. This
> change ensures that the successor’s initialization does not
> overwrite the current lock holder’s update to the locked field,
> which could otherwise lead to deadlocks.
>
> Remove two unnecessary fences:
>
> The acquire fence in unlock() had no matching release fence, making
> it ineffective for enforcing memory order. The associated comment
> suggested it prevented speculative reordering, but such fences (data
> memory barriers) only establish memory ordering and do not control
> instruction speculation.
>
> The release-acquire fence pair in lock() was previously justified as
> preventing reordering between the load-acquire loop of me->locked
> and the store-release of prev->next. This is no longer needed, as the
> new synchronizing edges ensure a chain of happens-before
> relationships between memory operations of threads calling lock() and
> unlock().
>
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> ---
> lib/eal/include/rte_mcslock.h | 100 ++++++++++++++++++++++------------
> 1 file changed, 64 insertions(+), 36 deletions(-)
>
> diff --git a/lib/eal/include/rte_mcslock.h b/lib/eal/include/rte_mcslock.h
> index bb218d2e50..0af7a94a06 100644
> --- a/lib/eal/include/rte_mcslock.h
> +++ b/lib/eal/include/rte_mcslock.h
> @@ -57,11 +57,21 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
> rte_atomic_store_explicit(&me->locked, 1, rte_memory_order_relaxed);
> rte_atomic_store_explicit(&me->next, NULL, rte_memory_order_relaxed);
>
> - /* If the queue is empty, the exchange operation is enough to acquire
> - * the lock. Hence, the exchange operation requires acquire semantics.
> - * The store to me->next above should complete before the node is
> - * visible to other CPUs/threads. Hence, the exchange operation requires
> - * release semantics as well.
> + /*
> + * A0/R0: Queue might be empty, perform the exchange (RMW) with both acquire and
> + * release semantics:
> + * A0: Acquire — synchronizes with both R0 and R2.
> + * Must synchronize with R0 to ensure that this thread observes predecessor's
> + * initialization of its lock object or risk them overwriting this thread's
> + * update to the next of the same object via store to prev->next.
> + *
> + * Must synchronize with R2 the releasing CAS in unlock(), this will ensure
> + * that all prior critical-section writes become visible to this thread.
> + *
> + * R0: Release — ensures the successor observes our initialization of me->next;
> + * without it, me->next could be overwritten to NULL after the successor
> + * sets its own address, causing deadlock. This release synchronizes with
> + * A0 above.
> */
> prev = rte_atomic_exchange_explicit(msl, me, rte_memory_order_acq_rel);
> if (likely(prev == NULL)) {
> @@ -70,24 +80,26 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
> */
> return;
> }
> - /* The store to me->next above should also complete before the node is
> - * visible to predecessor thread releasing the lock. Hence, the store
> - * prev->next also requires release semantics. Note that, for example,
> - * on ARM, the release semantics in the exchange operation is not
> - * strong as a release fence and is not sufficient to enforce the
> - * desired order here.
> +
> + /*
> + * R1: With the relaxed memory model of C/C++, it's essential that after
> + * we link ourselves by storing prev->next = me, the owner of prev must
> + * observe our prior initialization of me->locked. Otherwise it could
> + * clear me->locked before we set it to 1, which may deadlock.
> + * Perform a releasing store so the predecessor's acquire loads A2 and A3
> + * observes our initialization, establishing a happens-before from those
> + * writes.
> */
> rte_atomic_store_explicit(&prev->next, me, rte_memory_order_release);
>
> - /* The while-load of me->locked should not move above the previous
> - * store to prev->next. Otherwise it will cause a deadlock. Need a
> - * store-load barrier.
> - */
> - rte_atomic_thread_fence(rte_memory_order_acq_rel);
> - /* If the lock has already been acquired, it first atomically
> + /*
> + * A1: If the lock has already been acquired, it first atomically
> * places the node at the end of the queue and then proceeds
> * to spin on me->locked until the previous lock holder resets
> - * the me->locked using mcslock_unlock().
> + * the me->locked in rte_mcslock_unlock().
> + * This load must synchronize with store-release R3 to ensure that
> + * all updates to critical section by previous lock holder is visible
> + * to this thread after acquiring the lock.
> */
> rte_wait_until_equal_32((uint32_t *)(uintptr_t)&me->locked, 0, rte_memory_order_acquire);
> }
> @@ -103,30 +115,46 @@ rte_mcslock_lock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
> static inline void
> rte_mcslock_unlock(RTE_ATOMIC(rte_mcslock_t *) *msl, RTE_ATOMIC(rte_mcslock_t *) me)
> {
> - /* Check if there are more nodes in the queue. */
> - if (likely(rte_atomic_load_explicit(&me->next, rte_memory_order_relaxed) == NULL)) {
> + /*
> + * A2: Check whether a successor is already queued.
> + * Load me->next with acquire semantics so it can synchronize with the
> + * successor’s release store R1. This guarantees that the successor’s
> + * initialization of its lock object (me) is completed before we observe
> + * it here, preventing a race between this thread’s store-release to
> + * me->next->locked and the successor’s store to me->locked.
> + */
> + if (likely(rte_atomic_load_explicit(&me->next, rte_memory_order_acquire) == NULL)) {
> /* No, last member in the queue. */
> - rte_mcslock_t *save_me = rte_atomic_load_explicit(&me, rte_memory_order_relaxed);
> + rte_mcslock_t *save_me = me;
>
> - /* Release the lock by setting it to NULL */
> + /*
> + * R2: Try to release the lock by swinging *msl from save_me to NULL.
> + * Use release semantics so all critical section writes become
> + * visible to the next lock acquirer.
> + */
> if (likely(rte_atomic_compare_exchange_strong_explicit(msl, &save_me, NULL,
> rte_memory_order_release, rte_memory_order_relaxed)))
> return;
>
> - /* Speculative execution would be allowed to read in the
> - * while-loop first. This has the potential to cause a
> - * deadlock. Need a load barrier.
> - */
> - rte_atomic_thread_fence(rte_memory_order_acquire);
> - /* More nodes added to the queue by other CPUs.
> - * Wait until the next pointer is set.
> + /*
> + * A3: Another thread was enqueued concurrently, so the CAS and the lock
> + * release failed. Wait until the successor sets our 'next' pointer.
> + * This load must synchronize with the successor’s release store (R1) to
> + * ensure that the successor’s initialization completes before we observe
> + * it here. This ordering prevents a race between this thread’s later
> + * store-release to me->next->locked and the successor’s store to me->locked.
> */
> RTE_ATOMIC(uintptr_t) *next;
> next = (__rte_atomic uintptr_t *)&me->next;
> - RTE_WAIT_UNTIL_MASKED(next, UINTPTR_MAX, !=, 0, rte_memory_order_relaxed);
> + RTE_WAIT_UNTIL_MASKED(next, UINTPTR_MAX, !=, 0, rte_memory_order_acquire);
> }
>
> - /* Pass lock to next waiter. */
> + /*
> + * R3: Pass the lock to the successor.
> + * Use a release store to synchronize with A1 when clearing me->next->locked
> + * so the successor observes our critical section writes after it sees locked
> + * become 0.
> + */
> rte_atomic_store_explicit(&me->next->locked, 0, rte_memory_order_release);
> }
>
> @@ -149,11 +177,11 @@ rte_mcslock_trylock(RTE_ATOMIC(rte_mcslock_t *) *msl, rte_mcslock_t *me)
> /* Try to lock */
> rte_mcslock_t *expected = NULL;
>
> - /* The lock can be taken only when the queue is empty. Hence,
> - * the compare-exchange operation requires acquire semantics.
> - * The store to me->next above should complete before the node
> - * is visible to other CPUs/threads. Hence, the compare-exchange
> - * operation requires release semantics as well.
> + /*
> + * A4/R4: The lock can be acquired only when the queue is empty.
> + * The compare-and-exchange operation must use acquire and release
> + * semantics for the same reasons described in the rte_mcslock_lock()
> + * function’s empty-queue case (see A0/R0 for details).
> */
> return rte_atomic_compare_exchange_strong_explicit(msl, &expected, me,
> rte_memory_order_acq_rel, rte_memory_order_relaxed);
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 15:12 ` Wathsala Vithanage
@ 2025-11-03 17:07 ` Stephen Hemminger
2025-11-03 17:30 ` Wathsala Vithanage
0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2025-11-03 17:07 UTC (permalink / raw)
To: Wathsala Vithanage
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl, vattunuru
On Mon, 3 Nov 2025 09:12:39 -0600
Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
> MCS lock is broken, it's just a matter of time it will run into a deadlock.
>
> drivers/dma/cnxk is a user of MCS lock.
I am surprised that a driver would use mcslock.
MCSlock is targeted at case of large number of CPU's with lots of contention.
It will likely be slower than spinlock or ticketlock for the use case of driver.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 17:07 ` Stephen Hemminger
@ 2025-11-03 17:30 ` Wathsala Vithanage
2025-11-03 18:06 ` Konstantin Ananyev
0 siblings, 1 reply; 13+ messages in thread
From: Wathsala Vithanage @ 2025-11-03 17:30 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl, vattunuru
On 11/3/25 11:07, Stephen Hemminger wrote:
> On Mon, 3 Nov 2025 09:12:39 -0600
> Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
>
>> MCS lock is broken, it's just a matter of time it will run into a deadlock.
>>
>> drivers/dma/cnxk is a user of MCS lock.
> I am surprised that a driver would use mcslock.
> MCSlock is targeted at case of large number of CPU's with lots of contention.
> It will likely be slower than spinlock or ticketlock for the use case of driver.
It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
maintainer can clarify.
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 17:30 ` Wathsala Vithanage
@ 2025-11-03 18:06 ` Konstantin Ananyev
2025-11-03 18:47 ` Wathsala Vithanage
2025-11-03 18:48 ` Stephen Hemminger
0 siblings, 2 replies; 13+ messages in thread
From: Konstantin Ananyev @ 2025-11-03 18:06 UTC (permalink / raw)
To: Wathsala Vithanage, Stephen Hemminger
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl, vattunuru
>
> On 11/3/25 11:07, Stephen Hemminger wrote:
> > On Mon, 3 Nov 2025 09:12:39 -0600
> > Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
> >
> >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> >>
> >> drivers/dma/cnxk is a user of MCS lock.
> > I am surprised that a driver would use mcslock.
> > MCSlock is targeted at case of large number of CPU's with lots of contention.
> > It will likely be slower than spinlock or ticketlock for the use case of driver.
> It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> maintainer can clarify.
>
If MCS lock is really broken, it needs to be fixed anyway.
It might be used by other third-party libs/apps that do use on DPDK.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 18:06 ` Konstantin Ananyev
@ 2025-11-03 18:47 ` Wathsala Vithanage
2025-11-03 18:48 ` Stephen Hemminger
1 sibling, 0 replies; 13+ messages in thread
From: Wathsala Vithanage @ 2025-11-03 18:47 UTC (permalink / raw)
To: Konstantin Ananyev, Stephen Hemminger
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl, vattunuru
On 11/3/25 12:06, Konstantin Ananyev wrote:
>
>> On 11/3/25 11:07, Stephen Hemminger wrote:
>>> On Mon, 3 Nov 2025 09:12:39 -0600
>>> Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
>>>
>>>> MCS lock is broken, it's just a matter of time it will run into a deadlock.
>>>>
>>>> drivers/dma/cnxk is a user of MCS lock.
>>> I am surprised that a driver would use mcslock.
>>> MCSlock is targeted at case of large number of CPU's with lots of contention.
>>> It will likely be slower than spinlock or ticketlock for the use case of driver.
>> It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
>> maintainer can clarify.
>>
> If MCS lock is really broken, it needs to be fixed anyway.
> It might be used by other third-party libs/apps that do use on DPDK.
It’s really broken, because the release from lock() isn’t properly acquired
by the thread calling unlock(). This can lead to a situation where the
unlock()
caller’s write to the locked field gets overwritten by the lock()
caller’s write,
resulting in a potential deadlock.
This patch adds the missing synchronization edge, ensuring that all writes
made before the lock() caller updates prev->next are visible to the unlock()
caller once it reads me->next.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 18:06 ` Konstantin Ananyev
2025-11-03 18:47 ` Wathsala Vithanage
@ 2025-11-03 18:48 ` Stephen Hemminger
2025-11-03 19:13 ` Wathsala Vithanage
2025-11-04 8:18 ` Konstantin Ananyev
1 sibling, 2 replies; 13+ messages in thread
From: Stephen Hemminger @ 2025-11-03 18:48 UTC (permalink / raw)
To: Konstantin Ananyev
Cc: Wathsala Vithanage, Honnappa Nagarahalli, Tyler Retzlaff, dev,
Ola Liljedahl, vattunuru
On Mon, 3 Nov 2025 18:06:05 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:
> >
> > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
> > >
> > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > >>
> > >> drivers/dma/cnxk is a user of MCS lock.
> > > I am surprised that a driver would use mcslock.
> > > MCSlock is targeted at case of large number of CPU's with lots of contention.
> > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > maintainer can clarify.
> >
>
> If MCS lock is really broken, it needs to be fixed anyway.
> It might be used by other third-party libs/apps that do use on DPDK.
100% agree it must be fixed.
It would be good to have test or static analysis tool that could validate all the lock types.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 18:48 ` Stephen Hemminger
@ 2025-11-03 19:13 ` Wathsala Vithanage
2025-11-04 8:18 ` Konstantin Ananyev
1 sibling, 0 replies; 13+ messages in thread
From: Wathsala Vithanage @ 2025-11-03 19:13 UTC (permalink / raw)
To: Stephen Hemminger, Konstantin Ananyev
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl, vattunuru
[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]
On 11/3/25 12:48, Stephen Hemminger wrote:
> On Mon, 3 Nov 2025 18:06:05 +0000
> Konstantin Ananyev<konstantin.ananyev@huawei.com> wrote:
>
>>> On 11/3/25 11:07, Stephen Hemminger wrote:
>>>> On Mon, 3 Nov 2025 09:12:39 -0600
>>>> Wathsala Vithanage<wathsala.vithanage@arm.com> wrote:
>>>>
>>>>> MCS lock is broken, it's just a matter of time it will run into a deadlock.
>>>>>
>>>>> drivers/dma/cnxk is a user of MCS lock.
>>>> I am surprised that a driver would use mcslock.
>>>> MCSlock is targeted at case of large number of CPU's with lots of contention.
>>>> It will likely be slower than spinlock or ticketlock for the use case of driver.
>>> It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
>>> maintainer can clarify.
>>>
>> If MCS lock is really broken, it needs to be fixed anyway.
>> It might be used by other third-party libs/apps that do use on DPDK.
> 100% agree it must be fixed.
> It would be good to have test or static analysis tool that could validate all the lock types.
Looked at seqlock; it seems correct.
Herd7 has been useful, but it's not a static analysis tool. It requires
identifying a trouble
spot manually and then writing a litmus test to test the hypothesis.
[-- Attachment #2: Type: text/html, Size: 2367 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 18:48 ` Stephen Hemminger
2025-11-03 19:13 ` Wathsala Vithanage
@ 2025-11-04 8:18 ` Konstantin Ananyev
2025-11-05 19:39 ` Ola Liljedahl
1 sibling, 1 reply; 13+ messages in thread
From: Konstantin Ananyev @ 2025-11-04 8:18 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Wathsala Vithanage, Honnappa Nagarahalli, Tyler Retzlaff, dev,
Ola Liljedahl, vattunuru
> On Mon, 3 Nov 2025 18:06:05 +0000
> Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:
>
> > >
> > > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > > Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
> > > >
> > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > >>
> > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > I am surprised that a driver would use mcslock.
> > > > MCSlock is targeted at case of large number of CPU's with lots of
> contention.
> > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > maintainer can clarify.
> > >
> >
> > If MCS lock is really broken, it needs to be fixed anyway.
> > It might be used by other third-party libs/apps that do use on DPDK.
>
> 100% agree it must be fixed.
> It would be good to have test or static analysis tool that could validate all the lock
> types.
+1
Another option would be to have sort of stress test for all locking types we have in our UT.
At least for ring I found it very useful.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-04 8:18 ` Konstantin Ananyev
@ 2025-11-05 19:39 ` Ola Liljedahl
2025-11-06 7:48 ` Konstantin Ananyev
0 siblings, 1 reply; 13+ messages in thread
From: Ola Liljedahl @ 2025-11-05 19:39 UTC (permalink / raw)
To: Konstantin Ananyev, Stephen Hemminger
Cc: Wathsala Vithanage, Honnappa Nagarahalli, Tyler Retzlaff, dev,
vattunuru, nd
> On 2025-11-04, 09:18, "Konstantin Ananyev" <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
>
> > On Mon, 3 Nov 2025 18:06:05 +0000
> > Konstantin Ananyev <konstantin.ananyev@huawei.com <mailto:konstantin.ananyev@huawei.com>> wrote:
> >
> > > >
> > > > On 11/3/25 11:07, Stephen Hemminger wrote:
> > > > > On Mon, 3 Nov 2025 09:12:39 -0600
> > > > > Wathsala Vithanage <wathsala.vithanage@arm.com <mailto:wathsala.vithanage@arm.com>> wrote:
> > > > >
> > > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > > >>
> > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > I am surprised that a driver would use mcslock.
> > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > contention.
> > > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > maintainer can clarify.
> > > >
> > >
> > > If MCS lock is really broken, it needs to be fixed anyway.
> > > It might be used by other third-party libs/apps that do use on DPDK.
> >
> > 100% agree it must be fixed.
> > It would be good to have test or static analysis tool that could validate all the lock
> > types.
>
>
> +1
> Another option would be to have sort of stress test for all locking types we have in our UT.
> At least for ring I found it very useful.
A stress test only tests for that specific hardware and compiler you are using. E.g. the LDAPR
problem in rte_ring was hidden for years because compilers (GCC) did not support generating
LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR since
Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC 13 support
for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify +rcpc for -march
or specify a CPU that supports FEAT_RCPC, building with defaults won't generate LDAPR and the
bug would remain hidden. A stress test that is run on the target every time DPDK is rebuilt for
some target would cover some of that.
A stress test (like any test) is also not guaranteed to provoke all code paths and thread interleavings
so there are still no guarantees.
I claim that it is important to verify that the code follows the C/C++ memory model and implements
all the necessary happens-before relations (synchronize-with edges). Then it will work on all
hardware and compilers now and in the future that implement the C/C++ memory model (and
Java which is quite similar AFAIK, all atomics being seq_cst).
One such verifier exists in Progress64. It verifies the actual implementation by executing all
possible interleavings (permutations) of two threads running a test program and ensures all
required synchronize-with edges exists (as implemented by load-acquire/store-release). Lack
of synchronize-with between two accesses to the same location by different threads means
a data race has been found.
As an example, make that load-acquire in the MCS unlock function a load-relaxed instead (as
it was before that patch) and run the verifier:
$ ./verify -am mcslock
Verifying mcslock
(lots of output of failed verifications)
<CTRL-C>
Interrupted
Histogram over number of steps:
20: 594
21: 0
22: 0
23: 0
24: 0
25: 0
26: 566
27: 0
28: 684
29: 0
30: 155
31: 0
32: 593
33: 0
34: 420
35: 0
36: 155
Summary:
succeeded: 3167
interrupted: 0
failed: 5871
total: 9038 (0x234e)
Synchronization analysis:
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 594)
load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 2573)
load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 2573)
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 8444)
src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
Elapsed 1.028909533 seconds, average 113842ns/permutation, average 11530ns/step
The first failed permutation is 0x7. Re-run that with the verbose flag.
$ ./verify -avm -p7 mcslock
Verifying mcslock using permutation 0x7
Step 0: thread 1 file src/p64_mcslock.c line 25 -w write_8(0x7c5dd070,0,regular)
Step 1: thread 1 file src/p64_mcslock.c line 26 -w write_1(0x7c5dd078,0x1,regular)
Step 2: thread 1 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
Step 3: thread 0 file src/p64_mcslock.c line 25 -w write_8(0x7c5d8c70,0,regular)
Step 4: thread 0 file src/p64_mcslock.c line 26 -w write_1(0x7c5d8c78,0x1,regular)
Step 5: thread 0 file src/p64_mcslock.c line 28 rw exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
Step 6: thread 0 file src/p64_mcslock.c line 37 r- read_8(0x7c5dd070,regular)=0
Regular read_8 on step 6 matches regular write_8 from other thread on step 0
Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
Step 7: thread 0 file src/p64_mcslock.c line 38 -w store_8(0x7c5dd070,0x7c5d8c70,rls)
Step 8: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
Step 9: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 10: thread 1 file src/ver_mcslock.c line 42 r- read_1(0x7c5dd2b8,regular)=0
Step 11: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
Step 12: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 13: thread 1 file src/ver_mcslock.c line 43 -w write_1(0x7c5dd2b8,0x1,regular)
Step 14: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
Step 15: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 16: thread 1 file src/ver_mcslock.c line 44 r- read_1(0x7c5dd2b8,regular)=0x1
Regular read_1 on step 16 matches regular write_1 from same thread on step 13
Step 17: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
Step 18: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 19: thread 1 file src/ver_mcslock.c line 45 -w write_1(0x7c5dd2b8,0,regular)
Step 20: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
Step 21: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 22: thread 1 file src/p64_mcslock.c line 51 r- load_8(0x7c5dd070,rlx)=0x7c5d8c70
Step 23: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
Step 24: thread 0 file src/p64_mcslock.c line 42 -- yield()
Yielding to other thread
Step 25: thread 1 file src/p64_mcslock.c line 69 r- read_1(0x7c5d8c78,regular)=0x1
Regular read_1 on step 25 matches regular write_1 from other thread on step 4
ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
Module mcslock permutation 0x7 step 26: Verification failed
Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it
runs 4 (2^32) permutations but this can be changed.
$ ./verify -am mcslock
Verifying mcslock
Verifying permutation 0...
...
Histogram over number of steps:
20: 281018368
21: 0
22: 0
23: 0
24: 143130624
25: 0
26: 906690560
27: 137625600
28: 1200701440
29: 252887040
30: 752590848
31: 181182464
32: 279252992
33: 74387456
34: 59373568
35: 17340416
36: 6410240
37: 2017280
38: 268800
39: 89600
Summary:
succeeded: 4294967296
interrupted: 0
failed: 0
total: 4294967296 (0x100000000)
Synchronization analysis:
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56 (count 281018368)
load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70 (count 4013948928)
load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38 (count 2808086528)
load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38 (count 1205862400)
load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28 (count 4013948928)
No data races detected
Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step
The verifier is still work in progress and has some limitations. It implements an SC
(sequential consistency)-like memory model but then analyses all regular accesses to
ensure the necessary synchronize-with (acquire/release) edges are present or a data
race is reported. One problem is that atomic accesses by definition never cause data
races so the verifier needs some help which is done using additional regular (non-
atomic) accesses. I think more work can be done here to detect data races without
requiring additional regular accesses. Possibly an RCpc-like model could also be simulated,
e.g. a load-acquire could return not the SC (current) value of a memory location but the
earliest value that existing synchronize-with edges allow. This would require more
intrusive changes to the verifier.
Thread fences are not supported. I don't mind this too much as I don't think thread fences
Should be used unless really necessary as their semantics often are misunderstood (as we
have seen). However, that means code like the seqlock can't be verified. I have a herd7
litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed) atomic
loads and stores to identify synchronize-with edges is a possibility so perhaps they will be
supported at some point.
Only two threads are currently supported but extending this to more threads is trivial.
Verification speed will suffer though. I am planning to make the verifier multithreaded so
can use all processor cores in a system.
I have many more ideas on how to improve the verifier.
- Ola
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-05 19:39 ` Ola Liljedahl
@ 2025-11-06 7:48 ` Konstantin Ananyev
0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Ananyev @ 2025-11-06 7:48 UTC (permalink / raw)
To: Ola Liljedahl, Stephen Hemminger
Cc: Wathsala Vithanage, Honnappa Nagarahalli, Tyler Retzlaff, dev,
vattunuru, nd
> > > > > >> MCS lock is broken, it's just a matter of time it will run into a deadlock.
> > > > > >>
> > > > > >> drivers/dma/cnxk is a user of MCS lock.
> > > > > > I am surprised that a driver would use mcslock.
> > > > > > MCSlock is targeted at case of large number of CPU's with lots of
> > > contention.
> > > > > > It will likely be slower than spinlock or ticketlock for the use case of driver.
> > > > > It appears in |drivers/dma/cnxk/cnxk_dmadev_fp.c|, perhaps the
> > > > > maintainer can clarify.
> > > > >
> > > >
> > > > If MCS lock is really broken, it needs to be fixed anyway.
> > > > It might be used by other third-party libs/apps that do use on DPDK.
> > >
> > > 100% agree it must be fixed.
> > > It would be good to have test or static analysis tool that could validate all the
> lock
> > > types.
> >
> >
> > +1
> > Another option would be to have sort of stress test for all locking types we have in
> our UT.
> > At least for ring I found it very useful.
> A stress test only tests for that specific hardware and compiler you are using. E.g.
> the LDAPR
> problem in rte_ring was hidden for years because compilers (GCC) did not support
> generating
> LDAPR instead of LDAR for load-acquire. Arm processors have implemented LDAPR
> since
> Neoverse N1 which was released in 2019. To compare, it was only in 2023 with GCC
> 13 support
> for FEAT_RCPC (i.e. the LDAPR instruction) was added but you also must specify
> +rcpc for -march
> or specify a CPU that supports FEAT_RCPC, building with defaults won't generate
> LDAPR and the
> bug would remain hidden. A stress test that is run on the target every time DPDK is
> rebuilt for
> some target would cover some of that.
> A stress test (like any test) is also not guaranteed to provoke all code paths and
> thread interleavings
> so there are still no guarantees.
Regarding the code coverage - these days there are several tools that can
collect such data for you.
I understand that stress-test passing doesn't guarantee that your code is completely bug free,
and of course it will able to catch bugs only on the platform you run it on.
But it is still way better than nothing.
> I claim that it is important to verify that the code follows the C/C++ memory model
> and implements
> all the necessary happens-before relations (synchronize-with edges). Then it will
> work on all
> hardware and compilers now and in the future that implement the C/C++ memory
> model (and
> Java which is quite similar AFAIK, all atomics being seq_cst).
>
> One such verifier exists in Progress64. It verifies the actual implementation by
> executing all
> possible interleavings (permutations) of two threads running a test program and
> ensures all
> required synchronize-with edges exists (as implemented by load-acquire/store-
> release). Lack
> of synchronize-with between two accesses to the same location by different threads
> means
> a data race has been found.
>
> As an example, make that load-acquire in the MCS unlock function a load-relaxed
> instead (as
> it was before that patch) and run the verifier:
> $ ./verify -am mcslock
> Verifying mcslock
> (lots of output of failed verifications)
> <CTRL-C>
> Interrupted
> Histogram over number of steps:
> 20: 594
> 21: 0
> 22: 0
> 23: 0
> 24: 0
> 25: 0
> 26: 566
> 27: 0
> 28: 684
> 29: 0
> 30: 155
> 31: 0
> 32: 593
> 33: 0
> 34: 420
> 35: 0
> 36: 155
> Summary:
> succeeded: 3167
> interrupted: 0
> failed: 5871
> total: 9038 (0x234e)
> Synchronization analysis:
> load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56
> (count 594)
> load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70
> (count 2573)
> load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38
> (count 2573)
> load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28
> (count 8444)
> src/p64_mcslock.c:69 data-races-with src/p64_mcslock.c:26 (count 5871)
> Elapsed 1.028909533 seconds, average 113842ns/permutation, average
> 11530ns/step
>
> The first failed permutation is 0x7. Re-run that with the verbose flag.
> $ ./verify -avm -p7 mcslock
> Verifying mcslock using permutation 0x7
> Step 0: thread 1 file src/p64_mcslock.c line 25 -w write_8(0x7c5dd070,0,regular)
> Step 1: thread 1 file src/p64_mcslock.c line 26 -w write_1(0x7c5dd078,0x1,regular)
> Step 2: thread 1 file src/p64_mcslock.c line 28 rw
> exchange_8(0x7c5dd2b0,0x7c5dd070,acq_rls)=0
> Step 3: thread 0 file src/p64_mcslock.c line 25 -w write_8(0x7c5d8c70,0,regular)
> Step 4: thread 0 file src/p64_mcslock.c line 26 -w write_1(0x7c5d8c78,0x1,regular)
> Step 5: thread 0 file src/p64_mcslock.c line 28 rw
> exchange_8(0x7c5dd2b0,0x7c5d8c70,acq_rls)=0x7c5dd070
> Atomic read_8 on step 5 matches atomic write_8 from other thread on step 2
> Step 5 (src/p64_mcslock.c:28) synchronizes-with step 2 (src/p64_mcslock.c:28)
> Step 6: thread 0 file src/p64_mcslock.c line 37 r- read_8(0x7c5dd070,regular)=0
> Regular read_8 on step 6 matches regular write_8 from other thread on step 0
> Read on step 6 matching write on step 0 saved by synchronizes-with on steps 5-2
> Step 7: thread 0 file src/p64_mcslock.c line 38 -w
> store_8(0x7c5dd070,0x7c5d8c70,rls)
> Step 8: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 8 matches regular write_1 from same thread on step 4
> Step 9: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 10: thread 1 file src/ver_mcslock.c line 42 r- read_1(0x7c5dd2b8,regular)=0
> Step 11: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 11 matches regular write_1 from same thread on step 4
> Step 12: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 13: thread 1 file src/ver_mcslock.c line 43 -w write_1(0x7c5dd2b8,0x1,regular)
> Step 14: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 14 matches regular write_1 from same thread on step 4
> Step 15: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 16: thread 1 file src/ver_mcslock.c line 44 r- read_1(0x7c5dd2b8,regular)=0x1
> Regular read_1 on step 16 matches regular write_1 from same thread on step 13
> Step 17: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 17 matches regular write_1 from same thread on step 4
> Step 18: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 19: thread 1 file src/ver_mcslock.c line 45 -w write_1(0x7c5dd2b8,0,regular)
> Step 20: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 20 matches regular write_1 from same thread on step 4
> Step 21: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 22: thread 1 file src/p64_mcslock.c line 51 r-
> load_8(0x7c5dd070,rlx)=0x7c5d8c70
> Step 23: thread 0 file src/p64_mcslock.c line 42 r- load_1(0x7c5d8c78,acq)=0x1
> Atomic read_1 on step 23 matches regular write_1 from same thread on step 4
> Step 24: thread 0 file src/p64_mcslock.c line 42 -- yield()
> Yielding to other thread
> Step 25: thread 1 file src/p64_mcslock.c line 69 r- read_1(0x7c5d8c78,regular)=0x1
> Regular read_1 on step 25 matches regular write_1 from other thread on step 4
> ERROR: Read on step 25 matching write on step 4 missing synchronize-with!
> Module mcslock permutation 0x7 step 26: Verification failed
>
> Change load-relaxed to load-acquire in unlock and rerun the verifier. By default it
> runs 4 (2^32) permutations but this can be changed.
> $ ./verify -am mcslock
> Verifying mcslock
> Verifying permutation 0...
> ...
> Histogram over number of steps:
> 20: 281018368
> 21: 0
> 22: 0
> 23: 0
> 24: 143130624
> 25: 0
> 26: 906690560
> 27: 137625600
> 28: 1200701440
> 29: 252887040
> 30: 752590848
> 31: 181182464
> 32: 279252992
> 33: 74387456
> 34: 59373568
> 35: 17340416
> 36: 6410240
> 37: 2017280
> 38: 268800
> 39: 89600
> Summary:
> succeeded: 4294967296
> interrupted: 0
> failed: 0
> total: 4294967296 (0x100000000)
> Synchronization analysis:
> load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:56
> (count 281018368)
> load @ src/p64_mcslock.c:42 synchronizes-with store @ src/p64_mcslock.c:70
> (count 4013948928)
> load @ src/p64_mcslock.c:51 synchronizes-with store @ src/p64_mcslock.c:38
> (count 2808086528)
> load @ src/p64_mcslock.c:65 synchronizes-with store @ src/p64_mcslock.c:38
> (count 1205862400)
> load @ src/p64_mcslock.c:28 synchronizes-with store @ src/p64_mcslock.c:28
> (count 4013948928)
> No data races detected
> Elapsed 5742.058291250 seconds, average 1336ns/permutation, average 47ns/step
>
> The verifier is still work in progress and has some limitations. It implements an SC
> (sequential consistency)-like memory model but then analyses all regular accesses to
> ensure the necessary synchronize-with (acquire/release) edges are present or a data
> race is reported. One problem is that atomic accesses by definition never cause data
> races so the verifier needs some help which is done using additional regular (non-
> atomic) accesses. I think more work can be done here to detect data races without
> requiring additional regular accesses. Possibly an RCpc-like model could also be
> simulated,
> e.g. a load-acquire could return not the SC (current) value of a memory location but
> the
> earliest value that existing synchronize-with edges allow. This would require more
> intrusive changes to the verifier.
> Thread fences are not supported. I don't mind this too much as I don't think thread
> fences
> Should be used unless really necessary as their semantics often are misunderstood
> (as we
> have seen). However, that means code like the seqlock can't be verified. I have a
> herd7
> litmus test for seqlock (p64_rwsync) instead. Matching thread fences with (relaxed)
> atomic
> loads and stores to identify synchronize-with edges is a possibility so perhaps they
> will be
> supported at some point.
> Only two threads are currently supported but extending this to more threads is
> trivial.
> Verification speed will suffer though. I am planning to make the verifier
> multithreaded so
> can use all processor cores in a system.
> I have many more ideas on how to improve the verifier.
Did I understand you right: you are working on some verifier tool that will be
publicly available and can be incorporated into DPDK CI process?
If so - that would be great.
Though, I still think we can have both: stress tests and verifier :)
Konstantin
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-10-23 18:47 [PATCH 1/1] eal: correct memory ordering in MCS lock Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
@ 2025-11-03 23:48 ` Stephen Hemminger
2025-11-04 23:30 ` Wathsala Vithanage
1 sibling, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2025-11-03 23:48 UTC (permalink / raw)
To: Wathsala Vithanage
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl
On Thu, 23 Oct 2025 18:47:24 +0000
Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
> Fix incorrect memory ordering in the MCS lock implementation by
> adding proper synchronizing edges to establish clear happens-before
> relationships between threads invoking lock() and unlock(). These
> synchronizing edges prevent potential deadlocks caused by improper
> ordering and are documented in detail through in-code comments.
>
> The previously relaxed load of the successor’s lock object pointer
> in unlock() has been upgraded to a load-acquire operation. This
> change ensures that the successor’s initialization does not
> overwrite the current lock holder’s update to the locked field,
> which could otherwise lead to deadlocks.
>
> Remove two unnecessary fences:
>
> The acquire fence in unlock() had no matching release fence, making
> it ineffective for enforcing memory order. The associated comment
> suggested it prevented speculative reordering, but such fences (data
> memory barriers) only establish memory ordering and do not control
> instruction speculation.
>
> The release-acquire fence pair in lock() was previously justified as
> preventing reordering between the load-acquire loop of me->locked
> and the store-release of prev->next. This is no longer needed, as the
> new synchronizing edges ensure a chain of happens-before
> relationships between memory operations of threads calling lock() and
> unlock().
>
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Thanks for the good explanatory comments.
Could you please add a Fixes: tag and Cc: stable@dpdk.org
so it can go to the right stable releases as well.
I noticed that Progress64 has same effective code.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/1] eal: correct memory ordering in MCS lock
2025-11-03 23:48 ` Stephen Hemminger
@ 2025-11-04 23:30 ` Wathsala Vithanage
0 siblings, 0 replies; 13+ messages in thread
From: Wathsala Vithanage @ 2025-11-04 23:30 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Honnappa Nagarahalli, Tyler Retzlaff, dev, Ola Liljedahl
On 11/3/25 17:48, Stephen Hemminger wrote:
> On Thu, 23 Oct 2025 18:47:24 +0000
> Wathsala Vithanage <wathsala.vithanage@arm.com> wrote:
>
>> Fix incorrect memory ordering in the MCS lock implementation by
>> adding proper synchronizing edges to establish clear happens-before
>> relationships between threads invoking lock() and unlock(). These
>> synchronizing edges prevent potential deadlocks caused by improper
>> ordering and are documented in detail through in-code comments.
>>
>> The previously relaxed load of the successor’s lock object pointer
>> in unlock() has been upgraded to a load-acquire operation. This
>> change ensures that the successor’s initialization does not
>> overwrite the current lock holder’s update to the locked field,
>> which could otherwise lead to deadlocks.
>>
>> Remove two unnecessary fences:
>>
>> The acquire fence in unlock() had no matching release fence, making
>> it ineffective for enforcing memory order. The associated comment
>> suggested it prevented speculative reordering, but such fences (data
>> memory barriers) only establish memory ordering and do not control
>> instruction speculation.
>>
>> The release-acquire fence pair in lock() was previously justified as
>> preventing reordering between the load-acquire loop of me->locked
>> and the store-release of prev->next. This is no longer needed, as the
>> new synchronizing edges ensure a chain of happens-before
>> relationships between memory operations of threads calling lock() and
>> unlock().
>>
>> Signed-off-by: Wathsala Vithanage <wathsala.vithanage@arm.com>
>> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Thanks for the good explanatory comments.
>
> Could you please add a Fixes: tag and Cc: stable@dpdk.org
> so it can go to the right stable releases as well.
>
> I noticed that Progress64 has same effective code.
>
Yes, the P64 MCS implementation is aligned with the DPDK version
after these changes.
Ola’s verification tool confirms that the P64 implementation is correct,
so this should be correct as well.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-11-06 11:24 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-23 18:47 [PATCH 1/1] eal: correct memory ordering in MCS lock Wathsala Vithanage
2025-11-03 15:12 ` Wathsala Vithanage
2025-11-03 17:07 ` Stephen Hemminger
2025-11-03 17:30 ` Wathsala Vithanage
2025-11-03 18:06 ` Konstantin Ananyev
2025-11-03 18:47 ` Wathsala Vithanage
2025-11-03 18:48 ` Stephen Hemminger
2025-11-03 19:13 ` Wathsala Vithanage
2025-11-04 8:18 ` Konstantin Ananyev
2025-11-05 19:39 ` Ola Liljedahl
2025-11-06 7:48 ` Konstantin Ananyev
2025-11-03 23:48 ` Stephen Hemminger
2025-11-04 23:30 ` Wathsala Vithanage
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).