[AMD Official Use Only - AMD Internal Distribution Only]

Hi Dariusz,


From: Dariusz Sosnowski <dsosnowski@nvidia.com>
Sent: Friday, November 07, 2025 11:29 PM
To: Tummala, Sivaprasad <Sivaprasad.Tummala@amd.com>; Alexander Kozyrev <akozyrev@nvidia.com>; Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Cc: jerinj@marvell.com <jerinj@marvell.com>; kirankumark@marvell.com <kirankumark@marvell.com>; ndabilpuram@marvell.com <ndabilpuram@marvell.com>; yanzhirun_163@163.com <yanzhirun_163@163.com>; david.marchand@redhat.com <david.marchand@redhat.com>; ktraynor@redhat.com <ktraynor@redhat.com>; thomas@monjalon.net <thomas@monjalon.net>; konstantin.ananyev@huawei.com <konstantin.ananyev@huawei.com>; konstantin.v.ananyev@yandex.ru <konstantin.v.ananyev@yandex.ru>; bruce.richardson@intel.com <bruce.richardson@intel.com>; maxime.coquelin@redhat.com <maxime.coquelin@redhat.com>; anatoly.burakov@intel.com <anatoly.burakov@intel.com>; aconole@redhat.com <aconole@redhat.com>; dev@dpdk.org <dev@dpdk.org>; stable@dpdk.org <stable@dpdk.org>
Subject: Re: [PATCH] net/mlx5: fix spurious CPU wakeups caused by invalid CQE

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


> Hi,
>
> Thank you for the contribution. Please see comments below.
>
> On Wed, Oct 15, 2025 at 01:39:57PM +0000, Sivaprasad Tummala wrote:
>> Previously, the PMD used a common monitor callback to determine
>> CQE ownership for power-aware polling. However, when a CQE contained
>> an invalid opcode(MLX5_CQE_INVALID), ownership bit was not reliable.
>> As a result, the monitor condition could falsely indicate CQE
>> availability and cause the CPU to wake up unnecessarily during
>> low traffic periods.
>>
>> This resulted in spurious wakeups in monitor-wait mode and reduced
>> the expected power savings, as cores exited the sleep state even
>> when no valid CQEs were available.
>>
>> This patch introduces a dedicated callback that skips invalid CQEs
>> and optimizes power efficiency by preventing false wakeups caused
>> by hardware-owned or invalid entries.
>>
>> Fixes: a8f0df6bf98d ("net/mlx5: support power monitoring")
>> Cc: akozyrev@nvidia.com
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
>> ---
>>  drivers/net/mlx5/mlx5_rx.c | 17 ++++++++++++++++-
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
>> index 420a03068d..2765b4b730 100644
>> --- a/drivers/net/mlx5/mlx5_rx.c
>> +++ b/drivers/net/mlx5/mlx5_rx.c
>> @@ -295,6 +295,20 @@ mlx5_monitor_callback(const uint64_t value,
>>       return (value & m) == v ? -1 : 0;
>>  }
>>
>> +static int
>> +mlx5_monitor_cqe_own_callback(const uint64_t value,
>> +             const uint64_t opaque[RTE_POWER_MONITOR_OPAQUE_SZ])
>> +{
>> +     const uint64_t m = opaque[CLB_MSK_IDX];
>> +     const uint64_t v = opaque[CLB_VAL_IDX];
>> +     const uint64_t match = ((value & m) == v);
>
> Could you please rename "match" variable to "sw_owned"?
> This name would better relay the meaning of the checked condition that
> CQE owner bit value signifies that CQE is SW owned.
ACK! Will update this in v2.
>
>> +     const uint64_t opcode = MLX5_CQE_OPCODE(value);
>> +     const uint64_t valid_op = (opcode ^ MLX5_CQE_INVALID);
>
>IMO the usage of bit operations here (although logic is correct) is a bit confusing.
>Could you rewrite it in terms of logical operations so it's easier to
>follow? For example like this:
>
>        const uint64_t valid_op = opcode != MLX5_CQE_INVALID
>
>        return (sw_owned && valid_op) ? -1 : 0;
>
>This also would properly describe in code the required condition:
>CQE can be parsed by SW if and only if owner bit is "SW owned" and CQE
>opcode is valid.
ACK! Will update this in v2.
>
>> +
>> +     /* ownership bit is not valid for invalid opcode; CQE is HW owned */
>> +     return -(match & valid_op);
>> +}
>> +
>> int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
>>  {
>>       struct mlx5_rxq_data *rxq = rx_queue;
>> @@ -312,12 +326,13 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
>>               pmc->addr = &cqe->validity_iteration_count;
>>               pmc->opaque[CLB_VAL_IDX] = vic;
>>               pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_VIC_INIT;
>> +             pmc->fn = mlx5_monitor_callback;
>
>Alex, Slava: Just to double check - in case of enhanced CQE compression
>layout, should both CQE opcode and vic be checked?
>Right now only vic is checked in power monitor callback for that case.
>In Rx datapath both are checked to determine CQE ownership:
>https://github.com/DPDK/dpdk/blob/main/drivers/common/mlx5/mlx5_common.h#L277
>
>>       } else {
>>               pmc->addr = &cqe->op_own;
>>               pmc->opaque[CLB_VAL_IDX] = !!idx;
>>               pmc->opaque[CLB_MSK_IDX] = MLX5_CQE_OWNER_MASK;
>> +             pmc->fn = mlx5_monitor_cqe_own_callback;
>>       }
>> -     pmc->fn = mlx5_monitor_callback;
>>       pmc->size = sizeof(uint8_t);
>>       return 0;
>>  }
>> --
>> 2.43.0
>>
>
>Best regards,
>Dariusz Sosnowski

Thanks & Regards,
Sivaprasad