* [PATCH 1/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
@ 2024-03-20 10:55 ` Huisong Li
2024-03-20 10:55 ` [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration Huisong Li
` (14 subsequent siblings)
15 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-03-20 10:55 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, lihuisong
The system-wide CPU latency QoS limit has a positive impact on the idle
state selection in cpuidle governor.
Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
CPU latency QoS limit on system and send the QoS request for userspace.
Please see the PM QoS framework in the following link:
https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
This feature has beed supported by kernel-v2.6.25.
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
So this PM QoS API make it easy to obtain the CPU latency limit on system
and send the CPU latency QoS request for the application that need them.
The recommend usage method is as follows:
1) an application process first creates QoS request.
2) update the CPU latency request to zero when need.
3) back to the default value when no need(this step is optional).
4) release QoS request when process exit.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
doc/guides/prog_guide/power_man.rst | 16 ++++
doc/guides/rel_notes/release_24_03.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 98 ++++++++++++++++++++++++
lib/power/rte_power_qos.h | 101 +++++++++++++++++++++++++
lib/power/version.map | 4 +
6 files changed, 225 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..493c75bf9d 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,22 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+PM QoS API
+----------
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service threads are delay sensitive and very except
+the low resume time, like interrupt packet receiving mode.
+
+This PM QoS API is aimed to obtain the CPU latency limit on system and send the
+CPU latency QoS request for the application that need them.
+
+* ``rte_power_qos_get_curr_cpu_latency()`` is used to get the current CPU
+ latency limit on system.
+* For sending CPU latency QoS request, first call ``rte_power_create_qos_request()``
+ to create a QoS request, then update CPU latency value by calling
+ ``rte_power_qos_update_request()``. The ``rte_power_release_qos_request()`` is
+ used to release this QoS request when process exit.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_03.rst b/doc/guides/rel_notes/release_24_03.rst
index 14826ea08f..b5be724133 100644
--- a/doc/guides/rel_notes/release_24_03.rst
+++ b/doc/guides/rel_notes/release_24_03.rst
@@ -196,6 +196,10 @@ New Features
Added DMA producer mode to measure performance of ``OP_FORWARD`` mode
of event DMA adapter.
+* **Added CPU latency PM QoS support.**
+
+ Added the interface querying cpu latency PM QoS limit on system and
+ the interface sending cpu latency QoS request in power lib.
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..d2b55923a0
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define QOS_CPU_DMA_LATENCY_DEV "/dev/cpu_dma_latency"
+
+struct rte_power_qos_info {
+ /*
+ * Keep file descriptor to update QoS request until there are no
+ * necessary anymore.
+ */
+ int fd;
+ int cur_cpu_latency; /* unit microseconds */
+ };
+
+struct rte_power_qos_info g_qos = {
+ .fd = -1,
+ .cur_cpu_latency = -1,
+};
+
+int
+rte_power_qos_get_curr_cpu_latency(int *latency)
+{
+ int fd, ret;
+
+ fd = open(QOS_CPU_DMA_LATENCY_DEV, O_RDONLY);
+ if (fd < 0) {
+ POWER_LOG(ERR, "Failed to open %s", QOS_CPU_DMA_LATENCY_DEV);
+ return -1;
+ }
+
+ ret = read(fd, latency, sizeof(*latency));
+ if (ret == 0) {
+ POWER_LOG(ERR, "Failed to read %s", QOS_CPU_DMA_LATENCY_DEV);
+ return -1;
+ }
+ close(fd);
+
+ return 0;
+}
+
+int
+rte_power_qos_update_request(int latency)
+{
+ int ret;
+
+ if (g_qos.fd == -1) {
+ POWER_LOG(ERR, "please create QoS request first.");
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be non negative number.");
+ return -EINVAL;
+ }
+
+ if (g_qos.cur_cpu_latency != -1 && latency == g_qos.cur_cpu_latency)
+ return 0;
+
+ ret = write(g_qos.fd, &latency, sizeof(latency));
+ if (ret == 0) {
+ POWER_LOG(ERR, "Failed to write %s", QOS_CPU_DMA_LATENCY_DEV);
+ return -1;
+ }
+ g_qos.cur_cpu_latency = latency;
+
+ return 0;
+}
+
+int
+rte_power_create_qos_request(void)
+{
+ g_qos.fd = open(QOS_CPU_DMA_LATENCY_DEV, O_WRONLY);
+ if (g_qos.fd < 0) {
+ POWER_LOG(ERR, "Failed to open %s.", QOS_CPU_DMA_LATENCY_DEV);
+ return -1;
+ }
+
+ return 0;
+}
+
+void
+rte_power_release_qos_request(void)
+{
+ if (g_qos.fd != -1) {
+ close(g_qos.fd);
+ g_qos.fd = -1;
+ }
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..d39f5d0c0f
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The system-wide CPU latency QoS limit has a positive impact on the idle
+ * state selection in cpuidle governor.
+ *
+ * Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
+ * CPU latency QoS limit on system and send the QoS request for userspace.
+ * Please see the PM QoS framework in the following link:
+ * https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
+ *
+ * The deeper the idle state, the lower the power consumption, but the longer
+ * the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * So this PM QoS API make it easy to obtain the CPU latency limit on system and
+ * send the CPU latency QoS request for the application that need them.
+ *
+ * The recommend usage method is as follows:
+ * 1) an application process first creates QoS request.
+ * 2) update the CPU latency request to zero when need.
+ * 3) back to the default value @see PM_QOS_CPU_LATENCY_DEFAULT_VALUE when
+ * no need (this step is optional).
+ * 4)release QoS request when process exit.
+ */
+
+#define QOS_USEC_PER_SEC 1000000
+#define PM_QOS_CPU_LATENCY_DEFAULT_VALUE (2000 * QOS_USEC_PER_SEC)
+#define PM_QOS_STRICT_LATENCY_VALUE 0
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create CPU latency QoS request and release this request by
+ * @see rte_power_release_qos_request.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_create_qos_request(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * release CPU latency QoS request.
+ */
+__rte_experimental
+void rte_power_release_qos_request(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current CPU latency QoS limit on system.
+ * The default value in kernel is @see PM_QOS_CPU_LATENCY_DEFAULT_VALUE.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_get_curr_cpu_latency(int *latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Update the CPU latency QoS request.
+ * Note: need to create QoS request first and then call this API.
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_update_request(int latency);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..42770762b1 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_create_qos_request;
+ rte_power_release_qos_request;
+ rte_power_qos_get_curr_cpu_latency;
+ rte_power_qos_update_request;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
@ 2024-03-20 10:55 ` Huisong Li
2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
` (13 subsequent siblings)
15 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-03-20 10:55 UTC (permalink / raw)
To: dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, lihuisong
Add PM QoS request configuration to declease the process resume latency.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
examples/l3fwd-power/main.c | 41 ++++++++++++++++++++++++++++++++++++-
1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index f4adcf41b5..78f292ed02 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2232,12 +2233,48 @@ static int check_ptype(uint16_t portid)
}
+static int
+pm_qos_init(void)
+{
+ int cur_cpu_latency;
+ int ret;
+
+ ret = rte_power_qos_get_curr_cpu_latency(&cur_cpu_latency);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER, "failed to get current cpu latency.\n");
+ return ret;
+ }
+ RTE_LOG(INFO, L3FWD_POWER, "current cpu latency is %dus on system.\n",
+ (cur_cpu_latency / QOS_USEC_PER_SEC));
+
+ ret = rte_power_create_qos_request();
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER, "Failed to create power QoS request.\n");
+ return ret;
+ }
+
+ /*
+ * Set strict latency requirement to prevent service thread going into
+ * a deeper sleep state whose resume time is longer.
+ */
+ ret = rte_power_qos_update_request(PM_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0)
+ RTE_LOG(ERR, L3FWD_POWER, "Failed to change cpu latency to 0.\n");
+ return ret;
+}
+
static int
init_power_library(void)
{
enum power_management_env env;
unsigned int lcore_id;
- int ret = 0;
+ int ret;
+
+ ret = pm_qos_init();
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER, "init power Qos failed.\n");
+ return ret;
+ }
RTE_LCORE_FOREACH(lcore_id) {
/* init power management library */
@@ -2268,6 +2305,8 @@ deinit_power_library(void)
unsigned int lcore_id, max_pkg, max_die, die, pkg;
int ret = 0;
+ rte_power_release_qos_request();
+
RTE_LCORE_FOREACH(lcore_id) {
/* deinit power management library */
ret = rte_power_exit(lcore_id);
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH 0/2] introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
2024-03-20 10:55 ` [PATCH 1/2] power: " Huisong Li
2024-03-20 10:55 ` [PATCH 2/2] examples/l3fwd-power: add PM QoS request configuration Huisong Li
@ 2024-03-20 14:05 ` Morten Brørup
2024-03-21 3:04 ` lihuisong (C)
2024-06-13 11:20 ` [PATCH v2 0/2] power: " Huisong Li
` (12 subsequent siblings)
15 siblings, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-03-20 14:05 UTC (permalink / raw)
To: Huisong Li, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> From: Huisong Li [mailto:lihuisong@huawei.com]
> Sent: Wednesday, 20 March 2024 11.55
>
> The system-wide CPU latency QoS limit has a positive impact on the idle
> state selection in cpuidle governor.
>
> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> CPU latency QoS limit on system and send the QoS request for userspace.
> Please see the PM QoS framework in the following link:
> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> This feature is supported by kernel-v2.6.25.
>
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> So this series introduce PM QoS interface.
This looks like a 1:1 wrapper for a Linux kernel feature.
Does Windows or BSD offer something similar?
Furthermore, any high-res timing should use nanoseconds, not microseconds or milliseconds.
I realize that the Linux kernel only uses microseconds for these APIs, but the DPDK API should use nanoseconds.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
@ 2024-03-21 3:04 ` lihuisong (C)
2024-03-21 13:30 ` Morten Brørup
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-03-21 3:04 UTC (permalink / raw)
To: Morten Brørup, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
Hi Moren,
Thanks for your revew.
在 2024/3/20 22:05, Morten Brørup 写道:
>> From: Huisong Li [mailto:lihuisong@huawei.com]
>> Sent: Wednesday, 20 March 2024 11.55
>>
>> The system-wide CPU latency QoS limit has a positive impact on the idle
>> state selection in cpuidle governor.
>>
>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>> CPU latency QoS limit on system and send the QoS request for userspace.
>> Please see the PM QoS framework in the following link:
>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>> This feature is supported by kernel-v2.6.25.
>>
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> So this series introduce PM QoS interface.
> This looks like a 1:1 wrapper for a Linux kernel feature.
right
> Does Windows or BSD offer something similar?
How do we know Windows or BSD support this similar feature?
The DPDK power lib just work on Linux according to the meson.build under
lib/power.
If they support this features, they can open it.
>
> Furthermore, any high-res timing should use nanoseconds, not microseconds or milliseconds.
> I realize that the Linux kernel only uses microseconds for these APIs, but the DPDK API should use nanoseconds.
Nanoseconds is more precise, it's good.
But DPDK API how use nanoseconds as you said the the Linux kernel only
uses microseconds for these APIs.
Kernel interface just know an integer value with microseconds unit.
/BR
/Huisong
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH 0/2] introduce PM QoS interface
2024-03-21 3:04 ` lihuisong (C)
@ 2024-03-21 13:30 ` Morten Brørup
2024-03-22 8:54 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-03-21 13:30 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Thursday, 21 March 2024 04.04
>
> Hi Moren,
>
> Thanks for your revew.
>
> 在 2024/3/20 22:05, Morten Brørup 写道:
> >> From: Huisong Li [mailto:lihuisong@huawei.com]
> >> Sent: Wednesday, 20 March 2024 11.55
> >>
> >> The system-wide CPU latency QoS limit has a positive impact on the idle
> >> state selection in cpuidle governor.
> >>
> >> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >> CPU latency QoS limit on system and send the QoS request for userspace.
> >> Please see the PM QoS framework in the following link:
> >> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >> This feature is supported by kernel-v2.6.25.
> >>
> >> The deeper the idle state, the lower the power consumption, but the longer
> >> the resume time. Some service are delay sensitive and very except the low
> >> resume time, like interrupt packet receiving mode.
> >>
> >> So this series introduce PM QoS interface.
> > This looks like a 1:1 wrapper for a Linux kernel feature.
> right
> > Does Windows or BSD offer something similar?
> How do we know Windows or BSD support this similar feature?
Ask Windows experts or research using Google.
> The DPDK power lib just work on Linux according to the meson.build under
> lib/power.
> If they support this features, they can open it.
The DPDK power lib currently only works on Linux, yes.
But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
DPDK is on track to work across multiple platforms, including Windows.
We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
> >
> > Furthermore, any high-res timing should use nanoseconds, not microseconds or
> milliseconds.
> > I realize that the Linux kernel only uses microseconds for these APIs, but
> the DPDK API should use nanoseconds.
> Nanoseconds is more precise, it's good.
> But DPDK API how use nanoseconds as you said the the Linux kernel only
> uses microseconds for these APIs.
> Kernel interface just know an integer value with microseconds unit.
One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-21 13:30 ` Morten Brørup
@ 2024-03-22 8:54 ` lihuisong (C)
2024-03-22 12:35 ` Morten Brørup
2024-03-22 17:55 ` Tyler Retzlaff
0 siblings, 2 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-03-22 8:54 UTC (permalink / raw)
To: Morten Brørup, dev, Tyler Retzlaff, weh,
longli@microsoft.com >> Long Li, alan.elder
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
+Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
在 2024/3/21 21:30, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Thursday, 21 March 2024 04.04
>>
>> Hi Moren,
>>
>> Thanks for your revew.
>>
>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>
>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>> state selection in cpuidle governor.
>>>>
>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>> Please see the PM QoS framework in the following link:
>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>> This feature is supported by kernel-v2.6.25.
>>>>
>>>> The deeper the idle state, the lower the power consumption, but the longer
>>>> the resume time. Some service are delay sensitive and very except the low
>>>> resume time, like interrupt packet receiving mode.
>>>>
>>>> So this series introduce PM QoS interface.
>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>> right
>>> Does Windows or BSD offer something similar?
>> How do we know Windows or BSD support this similar feature?
> Ask Windows experts or research using Google.
I download freebsd source code, I didn't find this similar feature.
They don't even support cpuidle feature(this QoS feature affects cpuilde.).
I don't find any useful about this on Windows from google.
@Tyler, @Alan, @Wei and @Long
Do you know windows support that userspace read and send CPU latency
which has an impact on deep level of CPU idle?
>> The DPDK power lib just work on Linux according to the meson.build under
>> lib/power.
>> If they support this features, they can open it.
> The DPDK power lib currently only works on Linux, yes.
> But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
>
> DPDK is on track to work across multiple platforms, including Windows.
> We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
totally understand you.
>
>>> Furthermore, any high-res timing should use nanoseconds, not microseconds or
>> milliseconds.
>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>> the DPDK API should use nanoseconds.
>> Nanoseconds is more precise, it's good.
>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>> uses microseconds for these APIs.
>> Kernel interface just know an integer value with microseconds unit.
> One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
If so, we have to modify the implementation interface on Linux. This
change the input/output unit about the interface.
And DPDK also has to do this based on kernel version. It is not good.
The cpuidle governor select which idle state based on the worst-case
latency of idle state.
These the worst-case latency of Cstate reported by ACPI table is in
microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI
(Low Power Idle States) in ACPI spec [1].
So it is probably not meaning to change this interface implementation.
For the case need PM QoS in DPDK, I think, it is better to set cpu
latency to zero to prevent service thread from the deeper the idle state.
> You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
>
[1]
https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH 0/2] introduce PM QoS interface
2024-03-22 8:54 ` lihuisong (C)
@ 2024-03-22 12:35 ` Morten Brørup
2024-03-26 2:11 ` lihuisong (C)
2024-03-22 17:55 ` Tyler Retzlaff
1 sibling, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-03-22 12:35 UTC (permalink / raw)
To: lihuisong (C), dev, Tyler Retzlaff, weh, longli, alan.elder
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Friday, 22 March 2024 09.54
>
> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>
> 在 2024/3/21 21:30, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Thursday, 21 March 2024 04.04
> >>
> >> Hi Moren,
> >>
> >> Thanks for your revew.
> >>
> >> 在 2024/3/20 22:05, Morten Brørup 写道:
> >>>> From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>> Sent: Wednesday, 20 March 2024 11.55
> >>>>
> >>>> The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>> state selection in cpuidle governor.
> >>>>
> >>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain
> the
> >>>> CPU latency QoS limit on system and send the QoS request for userspace.
> >>>> Please see the PM QoS framework in the following link:
> >>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>> This feature is supported by kernel-v2.6.25.
> >>>>
> >>>> The deeper the idle state, the lower the power consumption, but the
> longer
> >>>> the resume time. Some service are delay sensitive and very except the low
> >>>> resume time, like interrupt packet receiving mode.
> >>>>
> >>>> So this series introduce PM QoS interface.
> >>> This looks like a 1:1 wrapper for a Linux kernel feature.
> >> right
> >>> Does Windows or BSD offer something similar?
> >> How do we know Windows or BSD support this similar feature?
> > Ask Windows experts or research using Google.
> I download freebsd source code, I didn't find this similar feature.
> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> I don't find any useful about this on Windows from google.
>
>
> @Tyler, @Alan, @Wei and @Long
>
> Do you know windows support that userspace read and send CPU latency
> which has an impact on deep level of CPU idle?
>
> >> The DPDK power lib just work on Linux according to the meson.build under
> >> lib/power.
> >> If they support this features, they can open it.
> > The DPDK power lib currently only works on Linux, yes.
> > But its API should still be designed to be platform agnostic, so the
> functions can be implemented on other platforms in the future.
> >
> > DPDK is on track to work across multiple platforms, including Windows.
> > We must always consider other platforms, and not design DPDK APIs as if they
> are for Linux/BSD only.
> totally understand you.
> >
> >>> Furthermore, any high-res timing should use nanoseconds, not microseconds
> or
> >> milliseconds.
> >>> I realize that the Linux kernel only uses microseconds for these APIs, but
> >> the DPDK API should use nanoseconds.
> >> Nanoseconds is more precise, it's good.
> >> But DPDK API how use nanoseconds as you said the the Linux kernel only
> >> uses microseconds for these APIs.
> >> Kernel interface just know an integer value with microseconds unit.
> > One solution is to expose nanoseconds in the DPDK API, and in the Linux
> specific implementation convert from/to microseconds.
> If so, we have to modify the implementation interface on Linux. This
> change the input/output unit about the interface.
> And DPDK also has to do this based on kernel version. It is not good.
> The cpuidle governor select which idle state based on the worst-case
> latency of idle state.
> These the worst-case latency of Cstate reported by ACPI table is in
> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI
> (Low Power Idle States) in ACPI spec [1].
> So it is probably not meaning to change this interface implementation.
OK... Since microsecond resolution is good enough for ACPI and Linux, you have me convinced that it's also good enough for DPDK (for this specific topic).
Thank you for the detailed reply!
>
> For the case need PM QoS in DPDK, I think, it is better to set cpu
> latency to zero to prevent service thread from the deeper the idle state.
It would defeat the purpose (i.e. not saving sufficient amounts of power) if the CPU cannot enter a deeper idle state.
Personally, I would think a wake-up latency of up to 10 microseconds should be fine for must purposes.
Default Linux timerslack is 50 microseconds, so you could also use that value.
> > You might also want to add a note to the in-line documentation of the
> relevant functions that the Linux implementation only uses microsecond
> resolution.
> >
> [1]
> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-22 12:35 ` Morten Brørup
@ 2024-03-26 2:11 ` lihuisong (C)
2024-03-26 8:27 ` Morten Brørup
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-03-26 2:11 UTC (permalink / raw)
To: Morten Brørup, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
在 2024/3/22 20:35, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Friday, 22 March 2024 09.54
>>
>> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>>
>> 在 2024/3/21 21:30, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Thursday, 21 March 2024 04.04
>>>>
>>>> Hi Moren,
>>>>
>>>> Thanks for your revew.
>>>>
>>>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>>>
>>>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>>>> state selection in cpuidle governor.
>>>>>>
>>>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain
>> the
>>>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>>>> Please see the PM QoS framework in the following link:
>>>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>>>> This feature is supported by kernel-v2.6.25.
>>>>>>
>>>>>> The deeper the idle state, the lower the power consumption, but the
>> longer
>>>>>> the resume time. Some service are delay sensitive and very except the low
>>>>>> resume time, like interrupt packet receiving mode.
>>>>>>
>>>>>> So this series introduce PM QoS interface.
>>>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>>>> right
>>>>> Does Windows or BSD offer something similar?
>>>> How do we know Windows or BSD support this similar feature?
>>> Ask Windows experts or research using Google.
>> I download freebsd source code, I didn't find this similar feature.
>> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
>> I don't find any useful about this on Windows from google.
>>
>>
>> @Tyler, @Alan, @Wei and @Long
>>
>> Do you know windows support that userspace read and send CPU latency
>> which has an impact on deep level of CPU idle?
>>
>>>> The DPDK power lib just work on Linux according to the meson.build under
>>>> lib/power.
>>>> If they support this features, they can open it.
>>> The DPDK power lib currently only works on Linux, yes.
>>> But its API should still be designed to be platform agnostic, so the
>> functions can be implemented on other platforms in the future.
>>> DPDK is on track to work across multiple platforms, including Windows.
>>> We must always consider other platforms, and not design DPDK APIs as if they
>> are for Linux/BSD only.
>> totally understand you.
>>>>> Furthermore, any high-res timing should use nanoseconds, not microseconds
>> or
>>>> milliseconds.
>>>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>>>> the DPDK API should use nanoseconds.
>>>> Nanoseconds is more precise, it's good.
>>>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>>>> uses microseconds for these APIs.
>>>> Kernel interface just know an integer value with microseconds unit.
>>> One solution is to expose nanoseconds in the DPDK API, and in the Linux
>> specific implementation convert from/to microseconds.
>> If so, we have to modify the implementation interface on Linux. This
>> change the input/output unit about the interface.
>> And DPDK also has to do this based on kernel version. It is not good.
>> The cpuidle governor select which idle state based on the worst-case
>> latency of idle state.
>> These the worst-case latency of Cstate reported by ACPI table is in
>> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3. _LPI
>> (Low Power Idle States) in ACPI spec [1].
>> So it is probably not meaning to change this interface implementation.
> OK... Since microsecond resolution is good enough for ACPI and Linux, you have me convinced that it's also good enough for DPDK (for this specific topic).
>
> Thank you for the detailed reply!
>
>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>> latency to zero to prevent service thread from the deeper the idle state.
> It would defeat the purpose (i.e. not saving sufficient amounts of power) if the CPU cannot enter a deeper idle state.
Yes, it is not good for power.
AFAIS, PM QoS is just to decrease the influence for performance.
Anyway, if we set to zero, system can be into Cstates-0 at least.
>
> Personally, I would think a wake-up latency of up to 10 microseconds should be fine for must purposes.
> Default Linux timerslack is 50 microseconds, so you could also use that value.
How much CPU latency is ok. Maybe, we can give the decision to the
application.
Linux will collect all these QoS request and use the minimum latency.
what do you think, Morten?
>
>>> You might also want to add a note to the in-line documentation of the
>> relevant functions that the Linux implementation only uses microsecond
>> resolution.
>> [1]
>> https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH 0/2] introduce PM QoS interface
2024-03-26 2:11 ` lihuisong (C)
@ 2024-03-26 8:27 ` Morten Brørup
2024-03-26 12:15 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-03-26 8:27 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Tuesday, 26 March 2024 03.12
>
> 在 2024/3/22 20:35, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Friday, 22 March 2024 09.54
[...]
> >> For the case need PM QoS in DPDK, I think, it is better to set cpu
> >> latency to zero to prevent service thread from the deeper the idle
> state.
> > It would defeat the purpose (i.e. not saving sufficient amounts of
> power) if the CPU cannot enter a deeper idle state.
> Yes, it is not good for power.
> AFAIS, PM QoS is just to decrease the influence for performance.
> Anyway, if we set to zero, system can be into Cstates-0 at least.
> >
> > Personally, I would think a wake-up latency of up to 10 microseconds
> should be fine for must purposes.
> > Default Linux timerslack is 50 microseconds, so you could also use
> that value.
> How much CPU latency is ok. Maybe, we can give the decision to the
> application.
Yes, the application should decide the acceptable worst-case latency.
> Linux will collect all these QoS request and use the minimum latency.
> what do you think, Morten?
For the example application, you could use a value of 50 microseconds and refer to this value also being the default timerslack in Linux.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-26 8:27 ` Morten Brørup
@ 2024-03-26 12:15 ` lihuisong (C)
2024-03-26 12:46 ` Morten Brørup
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-03-26 12:15 UTC (permalink / raw)
To: Morten Brørup, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
在 2024/3/26 16:27, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Tuesday, 26 March 2024 03.12
>>
>> 在 2024/3/22 20:35, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Friday, 22 March 2024 09.54
> [...]
>
>>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>>>> latency to zero to prevent service thread from the deeper the idle
>> state.
>>> It would defeat the purpose (i.e. not saving sufficient amounts of
>> power) if the CPU cannot enter a deeper idle state.
>> Yes, it is not good for power.
>> AFAIS, PM QoS is just to decrease the influence for performance.
>> Anyway, if we set to zero, system can be into Cstates-0 at least.
>>> Personally, I would think a wake-up latency of up to 10 microseconds
>> should be fine for must purposes.
>>> Default Linux timerslack is 50 microseconds, so you could also use
>> that value.
>> How much CPU latency is ok. Maybe, we can give the decision to the
>> application.
> Yes, the application should decide the acceptable worst-case latency.
>
>> Linux will collect all these QoS request and use the minimum latency.
>> what do you think, Morten?
> For the example application, you could use a value of 50 microseconds and refer to this value also being the default timerslack in Linux.
There is a description for "/proc/<pid>/timerslack_ns" in Linux document [1]
"
This file provides the value of the task’s timerslack value in nanoseconds.
This value specifies an amount of time that normal timers may be
deferred in order to coalesce timers and avoid unnecessary wakeups.
This allows a task’s interactivity vs power consumption tradeoff to be
adjusted.
"
I cannot understand what the relationship is between the timerslack in
Linux and cpu latency to wake up.
It seems that timerslack is just to defer the timer in order to coalesce
timers and avoid unnecessary wakeups.
And it has not a lot to do with the CPU latency which is aimed to avoid
task to enter deeper idle state and satify application request.
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH 0/2] introduce PM QoS interface
2024-03-26 12:15 ` lihuisong (C)
@ 2024-03-26 12:46 ` Morten Brørup
2024-03-29 1:59 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-03-26 12:46 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> Sent: Tuesday, 26 March 2024 13.15
>
> 在 2024/3/26 16:27, Morten Brørup 写道:
> >> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >> Sent: Tuesday, 26 March 2024 03.12
> >>
> >> 在 2024/3/22 20:35, Morten Brørup 写道:
> >>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>>> Sent: Friday, 22 March 2024 09.54
> > [...]
> >
> >>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
> >>>> latency to zero to prevent service thread from the deeper the idle
> >> state.
> >>> It would defeat the purpose (i.e. not saving sufficient amounts of
> >> power) if the CPU cannot enter a deeper idle state.
> >> Yes, it is not good for power.
> >> AFAIS, PM QoS is just to decrease the influence for performance.
> >> Anyway, if we set to zero, system can be into Cstates-0 at least.
> >>> Personally, I would think a wake-up latency of up to 10 microseconds
> >> should be fine for must purposes.
> >>> Default Linux timerslack is 50 microseconds, so you could also use
> >> that value.
> >> How much CPU latency is ok. Maybe, we can give the decision to the
> >> application.
> > Yes, the application should decide the acceptable worst-case latency.
> >
> >> Linux will collect all these QoS request and use the minimum latency.
> >> what do you think, Morten?
> > For the example application, you could use a value of 50 microseconds
> and refer to this value also being the default timerslack in Linux.
> There is a description for "/proc/<pid>/timerslack_ns" in Linux document
> [1]
> "
> This file provides the value of the task’s timerslack value in
> nanoseconds.
> This value specifies an amount of time that normal timers may be
> deferred in order to coalesce timers and avoid unnecessary wakeups.
> This allows a task’s interactivity vs power consumption tradeoff to be
> adjusted.
> "
> I cannot understand what the relationship is between the timerslack in
> Linux and cpu latency to wake up.
> It seems that timerslack is just to defer the timer in order to coalesce
> timers and avoid unnecessary wakeups.
> And it has not a lot to do with the CPU latency which is aimed to avoid
> task to enter deeper idle state and satify application request.
Correct. They control two different things.
However, both can cause latency for the application, so my rationale for the relationship was:
If the application accepts X us of latency caused by kernel scheduling delays (caused by timerslack), the application should accept the same amount of latency caused by CPU wake-up latency.
This also means that if you want lower latency than 50 us, you should not only set cpu wake-up latency, you should also set timerslack.
Obviously, if the application is only affected by one of the two, the application only needs to adjust that one of them.
As for the 50 us value, someone in the Linux kernel team decided that 50 us was an acceptable amount of latency for the kernel; we could use the same value, referring to that. Or we could choose some other value, and describe how we came up with our own value. And if necessary, also adjust timerslack accordingly.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-26 12:46 ` Morten Brørup
@ 2024-03-29 1:59 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-03-29 1:59 UTC (permalink / raw)
To: Morten Brørup, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
在 2024/3/26 20:46, Morten Brørup 写道:
>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>> Sent: Tuesday, 26 March 2024 13.15
>>
>> 在 2024/3/26 16:27, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Tuesday, 26 March 2024 03.12
>>>>
>>>> 在 2024/3/22 20:35, Morten Brørup 写道:
>>>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>>>> Sent: Friday, 22 March 2024 09.54
>>> [...]
>>>
>>>>>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>>>>>> latency to zero to prevent service thread from the deeper the idle
>>>> state.
>>>>> It would defeat the purpose (i.e. not saving sufficient amounts of
>>>> power) if the CPU cannot enter a deeper idle state.
>>>> Yes, it is not good for power.
>>>> AFAIS, PM QoS is just to decrease the influence for performance.
>>>> Anyway, if we set to zero, system can be into Cstates-0 at least.
>>>>> Personally, I would think a wake-up latency of up to 10 microseconds
>>>> should be fine for must purposes.
>>>>> Default Linux timerslack is 50 microseconds, so you could also use
>>>> that value.
>>>> How much CPU latency is ok. Maybe, we can give the decision to the
>>>> application.
>>> Yes, the application should decide the acceptable worst-case latency.
>>>
>>>> Linux will collect all these QoS request and use the minimum latency.
>>>> what do you think, Morten?
>>> For the example application, you could use a value of 50 microseconds
>> and refer to this value also being the default timerslack in Linux.
>> There is a description for "/proc/<pid>/timerslack_ns" in Linux document
>> [1]
>> "
>> This file provides the value of the task’s timerslack value in
>> nanoseconds.
>> This value specifies an amount of time that normal timers may be
>> deferred in order to coalesce timers and avoid unnecessary wakeups.
>> This allows a task’s interactivity vs power consumption tradeoff to be
>> adjusted.
>> "
>> I cannot understand what the relationship is between the timerslack in
>> Linux and cpu latency to wake up.
>> It seems that timerslack is just to defer the timer in order to coalesce
>> timers and avoid unnecessary wakeups.
>> And it has not a lot to do with the CPU latency which is aimed to avoid
>> task to enter deeper idle state and satify application request.
> Correct. They control two different things.
>
> However, both can cause latency for the application, so my rationale for the relationship was:
> If the application accepts X us of latency caused by kernel scheduling delays (caused by timerslack), the application should accept the same amount of latency caused by CPU wake-up latency.
Understand, thanks for explain.
>
> This also means that if you want lower latency than 50 us, you should not only set cpu wake-up latency, you should also set timerslack.
>
> Obviously, if the application is only affected by one of the two, the application only needs to adjust that one of them.
Yes, I think it is.
>
> As for the 50 us value, someone in the Linux kernel team decided that 50 us was an acceptable amount of latency for the kernel; we could use the same value, referring to that. Or we could choose some other value, and describe how we came up with our own value. And if necessary, also adjust timerslack accordingly.
So how about use the default 50us of timerslack in l3fwd-power?
And we add some description about this in code or document, like,
suggest user also need to modify this process's timerslack if want a
more little latency.
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-22 8:54 ` lihuisong (C)
2024-03-22 12:35 ` Morten Brørup
@ 2024-03-22 17:55 ` Tyler Retzlaff
2024-03-26 2:20 ` lihuisong (C)
1 sibling, 1 reply; 114+ messages in thread
From: Tyler Retzlaff @ 2024-03-22 17:55 UTC (permalink / raw)
To: lihuisong (C)
Cc: Morten Brørup, dev, weh,
longli@microsoft.com >> Long Li, alan.elder, thomas,
ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
liuyonglong
On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>
> 在 2024/3/21 21:30, Morten Brørup 写道:
> >>From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>Sent: Thursday, 21 March 2024 04.04
> >>
> >>Hi Moren,
> >>
> >>Thanks for your revew.
> >>
> >>在 2024/3/20 22:05, Morten Brørup 写道:
> >>>>From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>>Sent: Wednesday, 20 March 2024 11.55
> >>>>
> >>>>The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>>state selection in cpuidle governor.
> >>>>
> >>>>Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >>>>CPU latency QoS limit on system and send the QoS request for userspace.
> >>>>Please see the PM QoS framework in the following link:
> >>>>https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>>This feature is supported by kernel-v2.6.25.
> >>>>
> >>>>The deeper the idle state, the lower the power consumption, but the longer
> >>>>the resume time. Some service are delay sensitive and very except the low
> >>>>resume time, like interrupt packet receiving mode.
> >>>>
> >>>>So this series introduce PM QoS interface.
> >>>This looks like a 1:1 wrapper for a Linux kernel feature.
> >>right
> >>>Does Windows or BSD offer something similar?
> >>How do we know Windows or BSD support this similar feature?
> >Ask Windows experts or research using Google.
> I download freebsd source code, I didn't find this similar feature.
> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> I don't find any useful about this on Windows from google.
>
>
> @Tyler, @Alan, @Wei and @Long
>
> Do you know windows support that userspace read and send CPU latency
> which has an impact on deep level of CPU idle?
it is unlikely you'll find an api that let's you manage things in terms
of raw latency values as the linux knobs here do. windows more often employs
policy centric schemes to permit the system to abstract implementation detail.
powercfg is probably the closest thing you can use to tune the same
things on windows. where you select e.g. the 'performance' scheme but it
won't allow you to pick specific latency numbers.
https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options
>
> >>The DPDK power lib just work on Linux according to the meson.build under
> >>lib/power.
> >>If they support this features, they can open it.
> >The DPDK power lib currently only works on Linux, yes.
> >But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
> >
> >DPDK is on track to work across multiple platforms, including Windows.
> >We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
> totally understand you.
since lib/power isn't built for windows at this time i don't think it's
appropriate to constrain your innovation. i do appreciate the engagement
though and would just offer general guidance that if you can design your
api with some kind of abstraction in mind that would be great and by all
means if you can figure out how to wrangle powercfg /Qh into satisfying the
api in a policy centric way it might be kind of nice.
i'll let other windows experts chime in here if they choose.
thanks!
> >
> >>>Furthermore, any high-res timing should use nanoseconds, not microseconds or
> >>milliseconds.
> >>>I realize that the Linux kernel only uses microseconds for these APIs, but
> >>the DPDK API should use nanoseconds.
> >>Nanoseconds is more precise, it's good.
> >>But DPDK API how use nanoseconds as you said the the Linux kernel only
> >>uses microseconds for these APIs.
> >>Kernel interface just know an integer value with microseconds unit.
> >One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
> If so, we have to modify the implementation interface on Linux. This
> change the input/output unit about the interface.
> And DPDK also has to do this based on kernel version. It is not good.
> The cpuidle governor select which idle state based on the worst-case
> latency of idle state.
> These the worst-case latency of Cstate reported by ACPI table is in
> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
> _LPI (Low Power Idle States) in ACPI spec [1].
> So it is probably not meaning to change this interface implementation.
>
> For the case need PM QoS in DPDK, I think, it is better to set cpu
> latency to zero to prevent service thread from the deeper the idle
> state.
> >You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
> >
> [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-22 17:55 ` Tyler Retzlaff
@ 2024-03-26 2:20 ` lihuisong (C)
2024-03-26 16:04 ` Tyler Retzlaff
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-03-26 2:20 UTC (permalink / raw)
To: Tyler Retzlaff
Cc: Morten Brørup, dev, weh,
longli@microsoft.com >> Long Li, alan.elder, thomas,
ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
liuyonglong
Hi Tyler,
在 2024/3/23 1:55, Tyler Retzlaff 写道:
> On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
>> +Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
>>
>> 在 2024/3/21 21:30, Morten Brørup 写道:
>>>> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>>>> Sent: Thursday, 21 March 2024 04.04
>>>>
>>>> Hi Moren,
>>>>
>>>> Thanks for your revew.
>>>>
>>>> 在 2024/3/20 22:05, Morten Brørup 写道:
>>>>>> From: Huisong Li [mailto:lihuisong@huawei.com]
>>>>>> Sent: Wednesday, 20 March 2024 11.55
>>>>>>
>>>>>> The system-wide CPU latency QoS limit has a positive impact on the idle
>>>>>> state selection in cpuidle governor.
>>>>>>
>>>>>> Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
>>>>>> CPU latency QoS limit on system and send the QoS request for userspace.
>>>>>> Please see the PM QoS framework in the following link:
>>>>>> https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
>>>>>> This feature is supported by kernel-v2.6.25.
>>>>>>
>>>>>> The deeper the idle state, the lower the power consumption, but the longer
>>>>>> the resume time. Some service are delay sensitive and very except the low
>>>>>> resume time, like interrupt packet receiving mode.
>>>>>>
>>>>>> So this series introduce PM QoS interface.
>>>>> This looks like a 1:1 wrapper for a Linux kernel feature.
>>>> right
>>>>> Does Windows or BSD offer something similar?
>>>> How do we know Windows or BSD support this similar feature?
>>> Ask Windows experts or research using Google.
>> I download freebsd source code, I didn't find this similar feature.
>> They don't even support cpuidle feature(this QoS feature affects cpuilde.).
>> I don't find any useful about this on Windows from google.
>>
>>
>> @Tyler, @Alan, @Wei and @Long
>>
>> Do you know windows support that userspace read and send CPU latency
>> which has an impact on deep level of CPU idle?
> it is unlikely you'll find an api that let's you manage things in terms
> of raw latency values as the linux knobs here do. windows more often employs
> policy centric schemes to permit the system to abstract implementation detail.
>
> powercfg is probably the closest thing you can use to tune the same
> things on windows. where you select e.g. the 'performance' scheme but it
> won't allow you to pick specific latency numbers.
>
> https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options
Thanks for your feedback. I will take a look at this tool.
>
>>>> The DPDK power lib just work on Linux according to the meson.build under
>>>> lib/power.
>>>> If they support this features, they can open it.
>>> The DPDK power lib currently only works on Linux, yes.
>>> But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
>>>
>>> DPDK is on track to work across multiple platforms, including Windows.
>>> We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
>> totally understand you.
> since lib/power isn't built for windows at this time i don't think it's
> appropriate to constrain your innovation. i do appreciate the engagement
> though and would just offer general guidance that if you can design your
> api with some kind of abstraction in mind that would be great and by all
> means if you can figure out how to wrangle powercfg /Qh into satisfying the
> api in a policy centric way it might be kind of nice.
Testing this by using powercfg on Windows creates a very challenge for me.
So I don't plan to do this on Windows. If you need, you can add it, ok?
>
> i'll let other windows experts chime in here if they choose.
>
> thanks!
>
>>>>> Furthermore, any high-res timing should use nanoseconds, not microseconds or
>>>> milliseconds.
>>>>> I realize that the Linux kernel only uses microseconds for these APIs, but
>>>> the DPDK API should use nanoseconds.
>>>> Nanoseconds is more precise, it's good.
>>>> But DPDK API how use nanoseconds as you said the the Linux kernel only
>>>> uses microseconds for these APIs.
>>>> Kernel interface just know an integer value with microseconds unit.
>>> One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
>> If so, we have to modify the implementation interface on Linux. This
>> change the input/output unit about the interface.
>> And DPDK also has to do this based on kernel version. It is not good.
>> The cpuidle governor select which idle state based on the worst-case
>> latency of idle state.
>> These the worst-case latency of Cstate reported by ACPI table is in
>> microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
>> _LPI (Low Power Idle States) in ACPI spec [1].
>> So it is probably not meaning to change this interface implementation.
>>
>> For the case need PM QoS in DPDK, I think, it is better to set cpu
>> latency to zero to prevent service thread from the deeper the idle
>> state.
>>> You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
>>>
>> [1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] introduce PM QoS interface
2024-03-26 2:20 ` lihuisong (C)
@ 2024-03-26 16:04 ` Tyler Retzlaff
0 siblings, 0 replies; 114+ messages in thread
From: Tyler Retzlaff @ 2024-03-26 16:04 UTC (permalink / raw)
To: lihuisong (C)
Cc: Morten Brørup, dev, weh,
longli@microsoft.com >> Long Li, alan.elder, thomas,
ferruh.yigit, anatoly.burakov, david.hunt, sivaprasad.tummala,
liuyonglong
On Tue, Mar 26, 2024 at 10:20:45AM +0800, lihuisong (C) wrote:
> Hi Tyler,
>
> 在 2024/3/23 1:55, Tyler Retzlaff 写道:
> >On Fri, Mar 22, 2024 at 04:54:01PM +0800, lihuisong (C) wrote:
> >>+Tyler, +Alan, +Wei, +Long for asking this similar feature on Windows.
> >>
> >>在 2024/3/21 21:30, Morten Brørup 写道:
> >>>>From: lihuisong (C) [mailto:lihuisong@huawei.com]
> >>>>Sent: Thursday, 21 March 2024 04.04
> >>>>
> >>>>Hi Moren,
> >>>>
> >>>>Thanks for your revew.
> >>>>
> >>>>在 2024/3/20 22:05, Morten Brørup 写道:
> >>>>>>From: Huisong Li [mailto:lihuisong@huawei.com]
> >>>>>>Sent: Wednesday, 20 March 2024 11.55
> >>>>>>
> >>>>>>The system-wide CPU latency QoS limit has a positive impact on the idle
> >>>>>>state selection in cpuidle governor.
> >>>>>>
> >>>>>>Linux creates a cpu_dma_latency device under '/dev' directory to obtain the
> >>>>>>CPU latency QoS limit on system and send the QoS request for userspace.
> >>>>>>Please see the PM QoS framework in the following link:
> >>>>>>https://docs.kernel.org/power/pm_qos_interface.html?highlight=qos
> >>>>>>This feature is supported by kernel-v2.6.25.
> >>>>>>
> >>>>>>The deeper the idle state, the lower the power consumption, but the longer
> >>>>>>the resume time. Some service are delay sensitive and very except the low
> >>>>>>resume time, like interrupt packet receiving mode.
> >>>>>>
> >>>>>>So this series introduce PM QoS interface.
> >>>>>This looks like a 1:1 wrapper for a Linux kernel feature.
> >>>>right
> >>>>>Does Windows or BSD offer something similar?
> >>>>How do we know Windows or BSD support this similar feature?
> >>>Ask Windows experts or research using Google.
> >>I download freebsd source code, I didn't find this similar feature.
> >>They don't even support cpuidle feature(this QoS feature affects cpuilde.).
> >>I don't find any useful about this on Windows from google.
> >>
> >>
> >>@Tyler, @Alan, @Wei and @Long
> >>
> >>Do you know windows support that userspace read and send CPU latency
> >>which has an impact on deep level of CPU idle?
> >it is unlikely you'll find an api that let's you manage things in terms
> >of raw latency values as the linux knobs here do. windows more often employs
> >policy centric schemes to permit the system to abstract implementation detail.
> >
> >powercfg is probably the closest thing you can use to tune the same
> >things on windows. where you select e.g. the 'performance' scheme but it
> >won't allow you to pick specific latency numbers.
> >
> >https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/powercfg-command-line-options
>
> Thanks for your feedback. I will take a look at this tool.
>
> >
> >>>>The DPDK power lib just work on Linux according to the meson.build under
> >>>>lib/power.
> >>>>If they support this features, they can open it.
> >>>The DPDK power lib currently only works on Linux, yes.
> >>>But its API should still be designed to be platform agnostic, so the functions can be implemented on other platforms in the future.
> >>>
> >>>DPDK is on track to work across multiple platforms, including Windows.
> >>>We must always consider other platforms, and not design DPDK APIs as if they are for Linux/BSD only.
> >>totally understand you.
> >since lib/power isn't built for windows at this time i don't think it's
> >appropriate to constrain your innovation. i do appreciate the engagement
> >though and would just offer general guidance that if you can design your
> >api with some kind of abstraction in mind that would be great and by all
> >means if you can figure out how to wrangle powercfg /Qh into satisfying the
> >api in a policy centric way it might be kind of nice.
> Testing this by using powercfg on Windows creates a very challenge for me.
> So I don't plan to do this on Windows. If you need, you can add it, ok?
ordinarily i would say it is appropriate to, however in this
circumstance i agree. there is quite possibly significant porting work
to be done so i would have to address it if we ever include it for
windows.
thanks
> >
> >i'll let other windows experts chime in here if they choose.
> >
> >thanks!
> >
> >>>>>Furthermore, any high-res timing should use nanoseconds, not microseconds or
> >>>>milliseconds.
> >>>>>I realize that the Linux kernel only uses microseconds for these APIs, but
> >>>>the DPDK API should use nanoseconds.
> >>>>Nanoseconds is more precise, it's good.
> >>>>But DPDK API how use nanoseconds as you said the the Linux kernel only
> >>>>uses microseconds for these APIs.
> >>>>Kernel interface just know an integer value with microseconds unit.
> >>>One solution is to expose nanoseconds in the DPDK API, and in the Linux specific implementation convert from/to microseconds.
> >>If so, we have to modify the implementation interface on Linux. This
> >>change the input/output unit about the interface.
> >>And DPDK also has to do this based on kernel version. It is not good.
> >>The cpuidle governor select which idle state based on the worst-case
> >>latency of idle state.
> >>These the worst-case latency of Cstate reported by ACPI table is in
> >>microseconds as the section 8.4.1.1. _CST (C States) and 8.4.3.3.
> >>_LPI (Low Power Idle States) in ACPI spec [1].
> >>So it is probably not meaning to change this interface implementation.
> >>
> >>For the case need PM QoS in DPDK, I think, it is better to set cpu
> >>latency to zero to prevent service thread from the deeper the idle
> >>state.
> >>>You might also want to add a note to the in-line documentation of the relevant functions that the Linux implementation only uses microsecond resolution.
> >>>
> >>[1] https://uefi.org/specs/ACPI/6.5/08_Processor_Configuration_and_Control.html
> >.
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v2 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (2 preceding siblings ...)
2024-03-20 14:05 ` [PATCH 0/2] introduce PM QoS interface Morten Brørup
@ 2024-06-13 11:20 ` Huisong Li
2024-06-13 11:20 ` [PATCH v2 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-06-13 11:20 ` [PATCH v2 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-06-19 6:31 ` [PATCH v3 0/2] power: introduce PM QoS interface Huisong Li
` (11 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-13 11:20 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 22 +++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 29 +++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 116 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 70 +++++++++++++++
lib/power/version.map | 2 +
7 files changed, 245 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v2 1/2] power: introduce PM QoS API on CPU wide
2024-06-13 11:20 ` [PATCH v2 0/2] power: " Huisong Li
@ 2024-06-13 11:20 ` Huisong Li
2024-06-14 8:04 ` Morten Brørup
2024-06-13 11:20 ` [PATCH v2 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-06-13 11:20 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
doc/guides/prog_guide/power_man.rst | 22 +++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 116 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 70 +++++++++++++++
lib/power/version.map | 2 +
6 files changed, 216 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..3ff46f06c1 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,28 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can effect the work
+CPU's idle state selection and just allow to enter the shallowest idle state
+if set to zero (strict resume latency) for this CPU.
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can obtain the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index e68a53d757..7c0d36e389 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -89,6 +89,10 @@ New Features
* Added SSE/NEON vector datapath.
+* **Introduce PM QoS interface.**
+
+ * Introduce PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..706f8432ee
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[BUFSIZ] = {0};
+ FILE *f;
+ int ret;
+
+ if (lcore_id >= RTE_MAX_LCORE) {
+ POWER_LOG(ERR, "Lcore id %u can not exceeds %u",
+ lcore_id, RTE_MAX_LCORE - 1U);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ sprintf(buf, "%s", "n/a");
+ else if (latency == PM_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ sprintf(buf, "%u", 0);
+ else
+ sprintf(buf, "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[BUFSIZ];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (lcore_id >= RTE_MAX_LCORE) {
+ POWER_LOG(ERR, "Lcore id %u can not exceeds %u",
+ lcore_id, RTE_MAX_LCORE - 1U);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? PM_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..1ba9568d1b
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define PM_QOS_STRICT_LATENCY_VALUE 0
+#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see PM_QOS_RESUME_LATENCY_NO_CONSTRAINT if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v2 1/2] power: introduce PM QoS API on CPU wide
2024-06-13 11:20 ` [PATCH v2 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-14 8:04 ` Morten Brørup
2024-06-18 12:19 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Morten Brørup @ 2024-06-14 8:04 UTC (permalink / raw)
To: Huisong Li, dev, david.marchand
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
Is it OK to access this path using the lcore_id as CPU parameter to open_core_sysfs_file(), or must it be mapped through rte_lcore_to_cpu_id(lcore_id) first?
@David, do you know?
> +
> +int
> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
> +{
> + char buf[BUFSIZ] = {0};
> + FILE *f;
> + int ret;
> +
> + if (lcore_id >= RTE_MAX_LCORE) {
> + POWER_LOG(ERR, "Lcore id %u can not exceeds %u",
> + lcore_id, RTE_MAX_LCORE - 1U);
> + return -EINVAL;
> + }
The lcore_id could be a registered non-EAL thread.
You should probably fail in that case.
Same comment for rte_power_qos_get_cpu_resume_latency().
> +#define PM_QOS_STRICT_LATENCY_VALUE 0
> +#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
These definitions are in the public header file, and thus should be RTE_POWER_ prefixed and have comments describing them.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/2] power: introduce PM QoS API on CPU wide
2024-06-14 8:04 ` Morten Brørup
@ 2024-06-18 12:19 ` lihuisong (C)
2024-06-18 12:53 ` Morten Brørup
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-06-18 12:19 UTC (permalink / raw)
To: Morten Brørup, dev, david.marchand
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong
Hi Morten,
Thanks for your review.
在 2024/6/14 16:04, Morten Brørup 写道:
>> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
>> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> Is it OK to access this path using the lcore_id as CPU parameter to open_core_sysfs_file(), or must it be mapped through rte_lcore_to_cpu_id(lcore_id) first?
The cpu_id getting by rte_lcore_to_cpu_id() is from
lcore_config[lcore_id].core_id which is from
"/sys/devices/system/cpu/cpuX/topology/core_id" file, please see the
function eal_cpu_core_id().
So I think the number in above "cpuX" must be the lcore_id in DPDK.
And the similar interface in power lib also directly use the locore_id.
>
> @David, do you know?
>
>> +
>> +int
>> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
>> +{
>> + char buf[BUFSIZ] = {0};
>> + FILE *f;
>> + int ret;
>> +
>> + if (lcore_id >= RTE_MAX_LCORE) {
>> + POWER_LOG(ERR, "Lcore id %u can not exceeds %u",
>> + lcore_id, RTE_MAX_LCORE - 1U);
>> + return -EINVAL;
>> + }
> The lcore_id could be a registered non-EAL thread.
> You should probably fail in that case.
right, how about use rte_lcore_is_enabled(locore_id)?
>
> Same comment for rte_power_qos_get_cpu_resume_latency().
>
>
>> +#define PM_QOS_STRICT_LATENCY_VALUE 0
>> +#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> These definitions are in the public header file, and thus should be RTE_POWER_ prefixed and have comments describing them.
Ack
>
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v2 1/2] power: introduce PM QoS API on CPU wide
2024-06-18 12:19 ` lihuisong (C)
@ 2024-06-18 12:53 ` Morten Brørup
0 siblings, 0 replies; 114+ messages in thread
From: Morten Brørup @ 2024-06-18 12:53 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, david.marchand
> From: lihuisong (C) [mailto:lihuisong@huawei.com]
>
> Hi Morten,
>
> Thanks for your review.
>
>
> 在 2024/6/14 16:04, Morten Brørup 写道:
> >> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> >> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> > Is it OK to access this path using the lcore_id as CPU parameter to
> open_core_sysfs_file(), or must it be mapped through
> rte_lcore_to_cpu_id(lcore_id) first?
> The cpu_id getting by rte_lcore_to_cpu_id() is from
> lcore_config[lcore_id].core_id which is from
> "/sys/devices/system/cpu/cpuX/topology/core_id" file, please see the
> function eal_cpu_core_id().
> So I think the number in above "cpuX" must be the lcore_id in DPDK.
> And the similar interface in power lib also directly use the locore_id.
Then it should be OK.
Thanks for the detailed answer.
> >
> > @David, do you know?
> >
> >> +
> >> +int
> >> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
> >> +{
> >> + char buf[BUFSIZ] = {0};
> >> + FILE *f;
> >> + int ret;
> >> +
> >> + if (lcore_id >= RTE_MAX_LCORE) {
> >> + POWER_LOG(ERR, "Lcore id %u can not exceeds %u",
> >> + lcore_id, RTE_MAX_LCORE - 1U);
> >> + return -EINVAL;
> >> + }
> > The lcore_id could be a registered non-EAL thread.
> > You should probably fail in that case.
> right, how about use rte_lcore_is_enabled(locore_id)?
I suppose setting latency for service cores should be forbidden too,
so using rte_lcore_is_enabled() to check for ROLE_RTE is correct.
> >
> > Same comment for rte_power_qos_get_cpu_resume_latency().
> >
> >
> >> +#define PM_QOS_STRICT_LATENCY_VALUE 0
> >> +#define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> > These definitions are in the public header file, and thus should be
> RTE_POWER_ prefixed and have comments describing them.
> Ack
> >
> >
> > .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v2 2/2] examples/l3fwd-power: add PM QoS configuration
2024-06-13 11:20 ` [PATCH v2 0/2] power: " Huisong Li
2024-06-13 11:20 ` [PATCH v2 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-13 11:20 ` Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-13 11:20 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, liuyonglong, lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
examples/l3fwd-power/main.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..96980352ee 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,25 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /*
+ * Set the work CPU with strict latency limit to allow the
+ * process running on the CPU can only enter the shallowest
+ * idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ PM_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2318,15 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v3 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (3 preceding siblings ...)
2024-06-13 11:20 ` [PATCH v2 0/2] power: " Huisong Li
@ 2024-06-19 6:31 ` Huisong Li
2024-06-19 6:31 ` [PATCH v3 1/2] power: introduce PM QoS API on CPU wide Huisong Li
` (2 more replies)
2024-06-27 6:00 ` [PATCH v4 " Huisong Li
` (10 subsequent siblings)
15 siblings, 3 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-19 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 22 +++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 29 +++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 71 +++++++++++++++
lib/power/version.map | 2 +
7 files changed, 244 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v3 1/2] power: introduce PM QoS API on CPU wide
2024-06-19 6:31 ` [PATCH v3 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-06-19 6:31 ` Huisong Li
2024-06-19 14:56 ` Stephen Hemminger
2024-06-19 15:32 ` Thomas Monjalon
2024-06-19 6:31 ` [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-06-19 6:59 ` [PATCH v3 0/2] power: introduce PM QoS interface Morten Brørup
2 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-19 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
doc/guides/prog_guide/power_man.rst | 22 +++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 71 +++++++++++++++
lib/power/version.map | 2 +
6 files changed, 215 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..3ff46f06c1 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,28 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can effect the work
+CPU's idle state selection and just allow to enter the shallowest idle state
+if set to zero (strict resume latency) for this CPU.
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can obtain the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index e68a53d757..7c0d36e389 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -89,6 +89,10 @@ New Features
* Added SSE/NEON vector datapath.
+* **Introduce PM QoS interface.**
+
+ * Introduce PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..b131cf58e7
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[BUFSIZ] = {0};
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ sprintf(buf, "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ sprintf(buf, "%u", 0);
+ else
+ sprintf(buf, "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[BUFSIZ];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..2b25d0d4c1
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 1/2] power: introduce PM QoS API on CPU wide
2024-06-19 6:31 ` [PATCH v3 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-19 14:56 ` Stephen Hemminger
2024-06-20 2:22 ` lihuisong (C)
2024-06-19 15:32 ` Thomas Monjalon
1 sibling, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-06-19 14:56 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
On Wed, 19 Jun 2024 14:31:43 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> +PM QoS
> +------
> +
> +The deeper the idle state, the lower the power consumption, but the longer
> +the resume time. Some service are delay sensitive and very except the low
> +resume time, like interrupt packet receiving mode.
> +
> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> +interface is used to set and get the resume latency limit on the cpuX for
> +userspace. Each cpuidle governor in Linux select which idle state to enter
> +based on this CPU resume latency in their idle task.
> +
> +The per-CPU PM QoS API can be used to set and get the CPU resume latency.
> +
> +The ``rte_power_qos_set_cpu_resume_latency()`` function can effect the work
> +CPU's idle state selection and just allow to enter the shallowest idle state
> +if set to zero (strict resume latency) for this CPU.
> +
> +The ``rte_power_qos_get_cpu_resume_latency()`` function can obtain the resume
> +latency on specified CPU.
> +
Wording of this is hard to read and needs to be reworded for clarity.
Explain more what PM QoS is to the user.
Also, not sure if details about sysfs implementation is helpful.
Should also say this is Linux only.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 1/2] power: introduce PM QoS API on CPU wide
2024-06-19 14:56 ` Stephen Hemminger
@ 2024-06-20 2:22 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-06-20 2:22 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
Hi Stephen,
Thanks for your review.
在 2024/6/19 22:56, Stephen Hemminger 写道:
> On Wed, 19 Jun 2024 14:31:43 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> +PM QoS
>> +------
>> +
>> +The deeper the idle state, the lower the power consumption, but the longer
>> +the resume time. Some service are delay sensitive and very except the low
>> +resume time, like interrupt packet receiving mode.
>> +
>> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> +interface is used to set and get the resume latency limit on the cpuX for
>> +userspace. Each cpuidle governor in Linux select which idle state to enter
>> +based on this CPU resume latency in their idle task.
>> +
>> +The per-CPU PM QoS API can be used to set and get the CPU resume latency.
>> +
>> +The ``rte_power_qos_set_cpu_resume_latency()`` function can effect the work
>> +CPU's idle state selection and just allow to enter the shallowest idle state
>> +if set to zero (strict resume latency) for this CPU.
>> +
>> +The ``rte_power_qos_get_cpu_resume_latency()`` function can obtain the resume
>> +latency on specified CPU.
>> +
> Wording of this is hard to read and needs to be reworded for clarity.
> Explain more what PM QoS is to the user.
Yes, it's very important. How do you feel about the following description?
-->
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to set and get the CPU resume latency
based
on this sysfs.
The ``rte_power_qos_set_cpu_resume_latency()`` function can control the
CPU's
idle state selection in Linux and limit just to enter the shallowest
idle state
to low the delay of resuming service after sleeping by setting strict resume
latency (zero value).
The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
latency on specified CPU.
> Also, not sure if details about sysfs implementation is helpful.
It's just a short background.
IMO, it is helpful for user to make sense of this API.
> Should also say this is Linux only.
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 1/2] power: introduce PM QoS API on CPU wide
2024-06-19 6:31 ` [PATCH v3 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-06-19 14:56 ` Stephen Hemminger
@ 2024-06-19 15:32 ` Thomas Monjalon
2024-06-20 2:32 ` lihuisong (C)
1 sibling, 1 reply; 114+ messages in thread
From: Thomas Monjalon @ 2024-06-19 15:32 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong, lihuisong
19/06/2024 08:31, Huisong Li:
> --- /dev/null
> +++ b/lib/power/rte_power_qos.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef RTE_POWER_QOS_H
> +#define RTE_POWER_QOS_H
> +
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file rte_power_qos.h
> + *
> + * PM QoS API.
> + *
> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
> + * state selection in each cpuidle governor.
> + * Please see the PM QoS on CPU wide in the following link:
> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
> + *
> + * The deeper the idle state, the lower the power consumption, but the
> + * longer the resume time. Some service are delay sensitive and very except the
> + * low resume time, like interrupt packet receiving mode.
> + *
> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
> + * state selection and limit just enter the shallowest idle state to low the
> + * delay after sleep by setting strict resume latency (zero value).
> + */
> +
> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
stdint.h include is missing
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 1/2] power: introduce PM QoS API on CPU wide
2024-06-19 15:32 ` Thomas Monjalon
@ 2024-06-20 2:32 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-06-20 2:32 UTC (permalink / raw)
To: Thomas Monjalon
Cc: dev, mb, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
在 2024/6/19 23:32, Thomas Monjalon 写道:
> 19/06/2024 08:31, Huisong Li:
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.h
>> @@ -0,0 +1,71 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#ifndef RTE_POWER_QOS_H
>> +#define RTE_POWER_QOS_H
>> +
>> +#include <rte_compat.h>
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * @file rte_power_qos.h
>> + *
>> + * PM QoS API.
>> + *
>> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
>> + * state selection in each cpuidle governor.
>> + * Please see the PM QoS on CPU wide in the following link:
>> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>> + *
>> + * The deeper the idle state, the lower the power consumption, but the
>> + * longer the resume time. Some service are delay sensitive and very except the
>> + * low resume time, like interrupt packet receiving mode.
>> + *
>> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
>> + * state selection and limit just enter the shallowest idle state to low the
>> + * delay after sleep by setting strict resume latency (zero value).
>> + */
>> +
>> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
>> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> stdint.h include is missing
Yes, it desn't satisfy self-contained header files.
will add it in next version, thanks.
>
>
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration
2024-06-19 6:31 ` [PATCH v3 0/2] power: introduce PM QoS interface Huisong Li
2024-06-19 6:31 ` [PATCH v3 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-19 6:31 ` Huisong Li
2024-06-19 14:54 ` Stephen Hemminger
2024-06-19 6:59 ` [PATCH v3 0/2] power: introduce PM QoS interface Morten Brørup
2 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-06-19 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong, lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
examples/l3fwd-power/main.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..d263cfef0a 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,25 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /*
+ * Set the work CPU with strict latency limit to allow the
+ * process running on the CPU can only enter the shallowest
+ * idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2318,15 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration
2024-06-19 6:31 ` [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-06-19 14:54 ` Stephen Hemminger
2024-06-20 2:24 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-06-19 14:54 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
On Wed, 19 Jun 2024 14:31:44 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> + /*
> + * Set the work CPU with strict latency limit to allow the
> + * process running on the CPU can only enter the shallowest
> + * idle state.
> + */
Wording of that comment is awkward to read.
Suggest:
Set the worker lcore's to have strict latency limit to allow the
CPU to enter the shallowest idle state.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration
2024-06-19 14:54 ` Stephen Hemminger
@ 2024-06-20 2:24 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-06-20 2:24 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
在 2024/6/19 22:54, Stephen Hemminger 写道:
> On Wed, 19 Jun 2024 14:31:44 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> + /*
>> + * Set the work CPU with strict latency limit to allow the
>> + * process running on the CPU can only enter the shallowest
>> + * idle state.
>> + */
> Wording of that comment is awkward to read.
>
> Suggest:
>
> Set the worker lcore's to have strict latency limit to allow the
> CPU to enter the shallowest idle state.
Ack. Thanks for your suggestion.
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v3 0/2] power: introduce PM QoS interface
2024-06-19 6:31 ` [PATCH v3 0/2] power: introduce PM QoS interface Huisong Li
2024-06-19 6:31 ` [PATCH v3 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-06-19 6:31 ` [PATCH v3 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-06-19 6:59 ` Morten Brørup
2 siblings, 0 replies; 114+ messages in thread
From: Morten Brørup @ 2024-06-19 6:59 UTC (permalink / raw)
To: Huisong Li, dev
Cc: thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
> From: Huisong Li [mailto:lihuisong@huawei.com]
> Sent: Wednesday, 19 June 2024 08.32
>
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Please see the description in kernel document[1].
> Each cpuidle governor in Linux select which idle state to enter based on
> this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-
> testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-
> resume-latency-us
>
> ---
> v3:
> - add RTE_POWER_xxx prefix for some macro in header
> - add the check for lcore_id with rte_lcore_is_enabled
> v2:
> - use PM QoS on CPU wide to replace the one on system wide
Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v4 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (4 preceding siblings ...)
2024-06-19 6:31 ` [PATCH v3 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-06-27 6:00 ` Huisong Li
2024-06-27 6:00 ` [PATCH v4 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-06-27 6:00 ` [PATCH v4 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-07-02 3:50 ` [PATCH v5 0/2] power: introduce PM QoS interface Huisong Li
` (9 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-27 6:00 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 28 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
7 files changed, 247 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v4 1/2] power: introduce PM QoS API on CPU wide
2024-06-27 6:00 ` [PATCH v4 " Huisong Li
@ 2024-06-27 6:00 ` Huisong Li
2024-06-27 15:06 ` Stephen Hemminger
2024-06-27 6:00 ` [PATCH v4 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-06-27 6:00 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index e68a53d757..4de96f60ac 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -89,6 +89,10 @@ New Features
* Added SSE/NEON vector datapath.
+* **Introduce PM QoS interface.**
+
+ * Introduce per-CPU PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..b131cf58e7
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[BUFSIZ] = {0};
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ sprintf(buf, "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ sprintf(buf, "%u", 0);
+ else
+ sprintf(buf, "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[BUFSIZ];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v4 1/2] power: introduce PM QoS API on CPU wide
2024-06-27 6:00 ` [PATCH v4 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-27 15:06 ` Stephen Hemminger
2024-06-28 4:07 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-06-27 15:06 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
On Thu, 27 Jun 2024 14:00:10 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> + char buf[BUFSIZ] = {0};
BUFSIZ is 4K and you probably don't need all of that.
And initializing to 0 here should not be needed.
Why not:
char buf[LINE_MAX];
> + if (latency == 0)
> + sprintf(buf, "%s", "n/a");
> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
> + sprintf(buf, "%u", 0);
> + else
> + sprintf(buf, "%u", latency);
Use snprintf instead.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v4 1/2] power: introduce PM QoS API on CPU wide
2024-06-27 15:06 ` Stephen Hemminger
@ 2024-06-28 4:07 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-06-28 4:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
在 2024/6/27 23:06, Stephen Hemminger 写道:
> On Thu, 27 Jun 2024 14:00:10 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> + char buf[BUFSIZ] = {0};
> BUFSIZ is 4K and you probably don't need all of that.
I rember the maximum buffer length of sysfs show in Linux is BUFSIZ.
Just from the same size, here is ok to receive the data from Linux.
But LINE_MAX is also enough to use.
> And initializing to 0 here should not be needed.
Ack
>
> Why not:
> char buf[LINE_MAX];
Thanks for your suggestion. use it in next version.
>> + if (latency == 0)
>> + sprintf(buf, "%s", "n/a");
>> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
>> + sprintf(buf, "%u", 0);
>> + else
>> + sprintf(buf, "%u", latency);
> Use snprintf instead.
Ack
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v4 2/2] examples/l3fwd-power: add PM QoS configuration
2024-06-27 6:00 ` [PATCH v4 " Huisong Li
2024-06-27 6:00 ` [PATCH v4 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-06-27 6:00 ` Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-06-27 6:00 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..74a07afc6c 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,24 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2317,15 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v5 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (5 preceding siblings ...)
2024-06-27 6:00 ` [PATCH v4 " Huisong Li
@ 2024-07-02 3:50 ` Huisong Li
2024-07-02 3:50 ` [PATCH v5 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-07-02 3:50 ` [PATCH v5 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-07-09 2:29 ` [PATCH v6 0/2] power: introduce PM QoS interface Huisong Li
` (8 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-02 3:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 28 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
7 files changed, 247 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v5 1/2] power: introduce PM QoS API on CPU wide
2024-07-02 3:50 ` [PATCH v5 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-02 3:50 ` Huisong Li
2024-07-03 1:32 ` zhoumin
2024-07-02 3:50 ` [PATCH v5 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-07-02 3:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index e68a53d757..4de96f60ac 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -89,6 +89,10 @@ New Features
* Added SSE/NEON vector datapath.
+* **Introduce PM QoS interface.**
+
+ * Introduce per-CPU PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..375746f832
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[LINE_MAX];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[LINE_MAX];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v5 1/2] power: introduce PM QoS API on CPU wide
2024-07-02 3:50 ` [PATCH v5 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-07-03 1:32 ` zhoumin
2024-07-03 2:52 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: zhoumin @ 2024-07-03 1:32 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi Huisong,
I knew that this patchset was based on the *dpdk-next-net* repository in
our previous communication. However, maybe it's better to rebase this
pactchset on the *dpdk* repository. Because the CI system is not smart
enough to recognize the patchset as a change for the *dpdk-next-net*
repository. I personally feel this patchset is more likely a change for
the *dpdk* repository because it modified the `lib` directory which is
the infrastructure of DPDK instead of a feature for *dpdk-next-net*.
Best regard,
Min Zhou
On Tue, July 2, 2024 at 11:50AM, Huisong Li wrote:
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Each cpuidle governor in Linux select which idle state to enter
> based on this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> doc/guides/prog_guide/power_man.rst | 24 ++++++
> doc/guides/rel_notes/release_24_07.rst | 4 +
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
> lib/power/rte_power_qos.h | 73 ++++++++++++++++
> lib/power/version.map | 2 +
> 6 files changed, 219 insertions(+)
> create mode 100644 lib/power/rte_power_qos.c
> create mode 100644 lib/power/rte_power_qos.h
<snip>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v5 1/2] power: introduce PM QoS API on CPU wide
2024-07-03 1:32 ` zhoumin
@ 2024-07-03 2:52 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-07-03 2:52 UTC (permalink / raw)
To: zhoumin, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi
在 2024/7/3 9:32, zhoumin 写道:
> Hi Huisong,
>
> I knew that this patchset was based on the *dpdk-next-net* repository
> in our previous communication. However, maybe it's better to rebase
> this pactchset on the *dpdk* repository. Because the CI system is not
> smart enough to recognize the patchset as a change for the
> *dpdk-next-net* repository. I personally feel this patchset is more
> likely a change for the *dpdk* repository because it modified the
> `lib` directory which is the infrastructure of DPDK instead of a
> feature for *dpdk-next-net*.
>
Generally, the 'dpdk-next-net' repo is ahead of 'dpdk' repo. And we make
some development based on 'dpdk-next-net' repo.
I found that there are some patches also happen to the same issue that
CI applt patch failed.
So I think the reason why this series trigger warning should be analyzed
and resolved.
>
> On Tue, July 2, 2024 at 11:50AM, Huisong Li wrote:
>> The deeper the idle state, the lower the power consumption, but the
>> longer
>> the resume time. Some service are delay sensitive and very except the
>> low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us"
>> sysfs
>> interface is used to set and get the resume latency limit on the cpuX
>> for
>> userspace. Each cpuidle governor in Linux select which idle state to
>> enter
>> based on this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the
>> delay
>> after sleep by setting strict resume latency (zero value).
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> doc/guides/prog_guide/power_man.rst | 24 ++++++
>> doc/guides/rel_notes/release_24_07.rst | 4 +
>> lib/power/meson.build | 2 +
>> lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
>> lib/power/rte_power_qos.h | 73 ++++++++++++++++
>> lib/power/version.map | 2 +
>> 6 files changed, 219 insertions(+)
>> create mode 100644 lib/power/rte_power_qos.c
>> create mode 100644 lib/power/rte_power_qos.h
> <snip>
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v5 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-02 3:50 ` [PATCH v5 0/2] power: introduce PM QoS interface Huisong Li
2024-07-02 3:50 ` [PATCH v5 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-07-02 3:50 ` Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-02 3:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..74a07afc6c 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,24 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2317,15 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v6 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (6 preceding siblings ...)
2024-07-02 3:50 ` [PATCH v5 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 2:29 ` Huisong Li
2024-07-09 2:29 ` [PATCH v6 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-07-09 2:29 ` [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-07-09 6:31 ` [PATCH v7 0/2] power: introduce PM QoS interface Huisong Li
` (7 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 2:29 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 28 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
7 files changed, 247 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v6 1/2] power: introduce PM QoS API on CPU wide
2024-07-09 2:29 ` [PATCH v6 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 2:29 ` Huisong Li
2024-07-09 2:29 ` [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 2:29 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index 1dd842df3a..af6fd82a3c 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -155,6 +155,10 @@ New Features
Added an API that allows the user to reclaim the defer queue with RCU.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Introduce per-CPU PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..375746f832
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[LINE_MAX];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[LINE_MAX];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-09 2:29 ` [PATCH v6 0/2] power: introduce PM QoS interface Huisong Li
2024-07-09 2:29 ` [PATCH v6 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-07-09 2:29 ` Huisong Li
2024-07-09 3:07 ` Stephen Hemminger
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-07-09 2:29 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..74a07afc6c 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,24 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2317,15 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ if (rte_lcore_is_enabled(lcore_id) == 0)
+ continue;
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-09 2:29 ` [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-07-09 3:07 ` Stephen Hemminger
2024-07-09 3:18 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-07-09 3:07 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
On Tue, 9 Jul 2024 10:29:27 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> + RTE_LCORE_FOREACH(lcore_id) {
> + if (rte_lcore_is_enabled(lcore_id) == 0)
> + continue;
>
Why do you need this check? RTE_LCORE_FOREACH calls rte_next_lcore which
already skips lcores that are not enabled.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v6 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-09 3:07 ` Stephen Hemminger
@ 2024-07-09 3:18 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-07-09 3:18 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, liuyonglong
在 2024/7/9 11:07, Stephen Hemminger 写道:
> On Tue, 9 Jul 2024 10:29:27 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + if (rte_lcore_is_enabled(lcore_id) == 0)
>> + continue;
>>
> Why do you need this check? RTE_LCORE_FOREACH calls rte_next_lcore which
> already skips lcores that are not enabled.
Yes, it is dead code. will delete it in next version. Thanks.
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v7 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (7 preceding siblings ...)
2024-07-09 2:29 ` [PATCH v6 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 6:31 ` Huisong Li
2024-07-09 6:31 ` [PATCH v7 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-07-09 6:31 ` [PATCH v7 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-07-09 7:25 ` [PATCH v8 0/2] power: introduce PM QoS interface Huisong Li
` (6 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 24 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
7 files changed, 243 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v7 1/2] power: introduce PM QoS API on CPU wide
2024-07-09 6:31 ` [PATCH v7 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 6:31 ` Huisong Li
2024-07-09 6:31 ` [PATCH v7 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index 1dd842df3a..af6fd82a3c 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -155,6 +155,10 @@ New Features
Added an API that allows the user to reclaim the defer queue with RCU.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Introduce per-CPU PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..375746f832
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[LINE_MAX];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[LINE_MAX];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v7 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-09 6:31 ` [PATCH v7 0/2] power: introduce PM QoS interface Huisong Li
2024-07-09 6:31 ` [PATCH v7 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-07-09 6:31 ` Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 6:31 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..d518e19467 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,22 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2315,13 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v8 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (8 preceding siblings ...)
2024-07-09 6:31 ` [PATCH v7 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 7:25 ` Huisong Li
2024-07-09 7:25 ` [PATCH v8 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-07-09 7:25 ` [PATCH v8 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-08-09 9:50 ` [PATCH v9 0/2] power: introduce PM QoS interface Huisong Li
` (5 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 7:25 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
examples/l3fwd-power/main.c | 24 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
7 files changed, 243 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v8 1/2] power: introduce PM QoS API on CPU wide
2024-07-09 7:25 ` [PATCH v8 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-07-09 7:25 ` Huisong Li
2024-07-09 7:25 ` [PATCH v8 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 7:25 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_07.rst | 4 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 2 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_07.rst b/doc/guides/rel_notes/release_24_07.rst
index 50ffc1f74a..e771868d9f 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -156,6 +156,10 @@ New Features
* Added defer queue reclamation via RCU.
* Added SVE support for bulk lookup.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Introduce per-CPU PM QoS interface to low the delay after sleep.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..375746f832
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[LINE_MAX];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[LINE_MAX];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index ad92a65f91..81b8ff11b7 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,6 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v8 2/2] examples/l3fwd-power: add PM QoS configuration
2024-07-09 7:25 ` [PATCH v8 0/2] power: introduce PM QoS interface Huisong Li
2024-07-09 7:25 ` [PATCH v8 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-07-09 7:25 ` Huisong Li
1 sibling, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-07-09 7:25 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..d518e19467 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2259,6 +2260,22 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2298,6 +2315,13 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v9 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (9 preceding siblings ...)
2024-07-09 7:25 ` [PATCH v8 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-08-09 9:50 ` Huisong Li
2024-08-09 9:50 ` [PATCH v9 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-08-09 9:50 ` [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
` (4 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-08-09 9:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_11.rst | 5 ++
examples/l3fwd-power/main.c | 24 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 4 +
7 files changed, 246 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v9 1/2] power: introduce PM QoS API on CPU wide
2024-08-09 9:50 ` [PATCH v9 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-08-09 9:50 ` Huisong Li
2024-09-10 2:00 ` fengchengwen
2024-08-09 9:50 ` [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-08-09 9:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_11.rst | 5 ++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 114 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 4 +
6 files changed, 222 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..bd72d0a595 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,11 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the delay after sleep by controlling
+ CPU idle state selection.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..375746f832
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[LINE_MAX];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[LINE_MAX];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ if (f != NULL)
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..4e4955a4cf 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_set_cpu_resume_latency;
+ rte_power_qos_get_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 1/2] power: introduce PM QoS API on CPU wide
2024-08-09 9:50 ` [PATCH v9 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-09-10 2:00 ` fengchengwen
2024-09-10 9:32 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: fengchengwen @ 2024-09-10 2:00 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi Huisong
Please see comments inline.
Thanks
On 2024/8/9 17:50, Huisong Li wrote:
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Each cpuidle governor in Linux select which idle state to enter
> based on this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
...
> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
> new file mode 100644
> index 0000000000..375746f832
> --- /dev/null
> +++ b/lib/power/rte_power_qos.c
> @@ -0,0 +1,114 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#include <errno.h>
> +#include <stdlib.h>
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +#include "power_common.h"
> +#include "rte_power_qos.h"
> +
> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> +
> +int
> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
> +{
> + char buf[LINE_MAX];
no need LINE_MAX, [32] would enough.
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> +
> + if (latency < 0) {
> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + return ret;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
meanning -> meaning
> + * is as follows for different input string.
> + * 1> the resume latency is 0 if the input is "n/a".
> + * 2> the resume latency is no constraint if the input is "0".
> + * 3> the resume latency is the actual value to be set.
> + */
> + if (latency == 0)
> + snprintf(buf, sizeof(buf), "%s", "n/a");
> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
> + snprintf(buf, sizeof(buf), "%u", 0);
> + else
> + snprintf(buf, sizeof(buf), "%u", latency);
> +
> + ret = write_core_sysfs_s(f, buf);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + goto out;
no need of goto
> + }
> +
> +out:
> + if (f != NULL)
> + fclose(f);
just fclose(f) because f is valid here.
> +
> + return ret;
> +}
> +
> +int
> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
> +{
> + char buf[LINE_MAX];
> + int latency = -1;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + return ret;
> + }
> +
> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + goto out;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
meanning -> meaning
> + * is as follows for different output string.
> + * 1> the resume latency is 0 if the output is "n/a".
> + * 2> the resume latency is no constraint if the output is "0".
> + * 3> the resume latency is the actual value in used for other string.
> + */
> + if (strcmp(buf, "n/a") == 0)
> + latency = 0;
> + else {
> + latency = strtoul(buf, NULL, 10);
> + latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
> + }
> +
> +out:
> + if (f != NULL)
> + fclose(f);
just fclose(f) because f is valid here.
> +
> + return latency != -1 ? latency : ret;
> +}
> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
> new file mode 100644
> index 0000000000..990c488373
> --- /dev/null
> +++ b/lib/power/rte_power_qos.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef RTE_POWER_QOS_H
> +#define RTE_POWER_QOS_H
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file rte_power_qos.h
> + *
> + * PM QoS API.
> + *
> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
> + * state selection in each cpuidle governor.
> + * Please see the PM QoS on CPU wide in the following link:
> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
> + *
> + * The deeper the idle state, the lower the power consumption, but the
> + * longer the resume time. Some service are delay sensitive and very except the
> + * low resume time, like interrupt packet receiving mode.
> + *
> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
> + * state selection and limit just enter the shallowest idle state to low the
> + * delay after sleep by setting strict resume latency (zero value).
> + */
> +
> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * @param lcore_id
> + * target logical core id
> + *
> + * @param latency
> + * The latency should be greater than and equal to zero in microseconds unit.
> + *
> + * @return
> + * 0 on success. Otherwise negative value is returned.
> + */
> +__rte_experimental
> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the current resume latency of this logical core.
> + * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
> + * if don't set it.
> + *
> + * @return
> + * Negative value on failure.
> + * >= 0 means the actual resume latency limit on this core.
> + */
> +__rte_experimental
> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* RTE_POWER_QOS_H */
> diff --git a/lib/power/version.map b/lib/power/version.map
> index c9a226614e..4e4955a4cf 100644
> --- a/lib/power/version.map
> +++ b/lib/power/version.map
> @@ -51,4 +51,8 @@ EXPERIMENTAL {
> rte_power_set_uncore_env;
> rte_power_uncore_freqs;
> rte_power_unset_uncore_env;
> +
> + # added in 24.11
> + rte_power_qos_set_cpu_resume_latency;
> + rte_power_qos_get_cpu_resume_latency;
order by alphabetic.
another question, I think rename cpu with core maybe more accurate, despite sysfs export with cpu, but in DPDK it means core.
and there are some rte_power_core_xxx name in rte_power library, I think better to keep the same.
> };
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 1/2] power: introduce PM QoS API on CPU wide
2024-09-10 2:00 ` fengchengwen
@ 2024-09-10 9:32 ` lihuisong (C)
2024-09-12 1:14 ` fengchengwen
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-09-10 9:32 UTC (permalink / raw)
To: fengchengwen, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi Chengwen,
Thanks for your review.
在 2024/9/10 10:00, fengchengwen 写道:
> Hi Huisong
>
> Please see comments inline.
>
> Thanks
>
> On 2024/8/9 17:50, Huisong Li wrote:
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> interface is used to set and get the resume latency limit on the cpuX for
>> userspace. Each cpuidle governor in Linux select which idle state to enter
>> based on this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the delay
>> after sleep by setting strict resume latency (zero value).
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
> ...
>
>> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
>> new file mode 100644
>> index 0000000000..375746f832
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.c
>> @@ -0,0 +1,114 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#include <errno.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +
>> +#include <rte_lcore.h>
>> +#include <rte_log.h>
>> +
>> +#include "power_common.h"
>> +#include "rte_power_qos.h"
>> +
>> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
>> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
>> +
>> +int
>> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
>> +{
>> + char buf[LINE_MAX];
> no need LINE_MAX, [32] would enough.
Ack
>
>> + FILE *f;
>> + int ret;
>> +
>> + if (!rte_lcore_is_enabled(lcore_id)) {
>> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
>> + return -EINVAL;
>> + }
>> +
>> + if (latency < 0) {
>> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
>> + return -EINVAL;
>> + }
>> +
>> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + return ret;
>> + }
>> +
>> + /*
>> + * Based on the sysfs interface pm_qos_resume_latency_us under
>> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
> meanning -> meaning
Ack
>
>> + * is as follows for different input string.
>> + * 1> the resume latency is 0 if the input is "n/a".
>> + * 2> the resume latency is no constraint if the input is "0".
>> + * 3> the resume latency is the actual value to be set.
>> + */
>> + if (latency == 0)
>> + snprintf(buf, sizeof(buf), "%s", "n/a");
>> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
>> + snprintf(buf, sizeof(buf), "%u", 0);
>> + else
>> + snprintf(buf, sizeof(buf), "%u", latency);
>> +
>> + ret = write_core_sysfs_s(f, buf);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + goto out;
> no need of goto
Ack
>
>> + }
>> +
>> +out:
>> + if (f != NULL)
>> + fclose(f);
> just fclose(f) because f is valid here.
Ack
>> +
>> + return ret;
>> +}
>> +
>> +int
>> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
>> +{
>> + char buf[LINE_MAX];
>> + int latency = -1;
>> + FILE *f;
>> + int ret;
>> +
>> + if (!rte_lcore_is_enabled(lcore_id)) {
>> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
>> + return -EINVAL;
>> + }
>> +
>> + ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + return ret;
>> + }
>> +
>> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + goto out;
>> + }
>> +
>> + /*
>> + * Based on the sysfs interface pm_qos_resume_latency_us under
>> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meanning
> meanning -> meaning
Ack
>
>> + * is as follows for different output string.
>> + * 1> the resume latency is 0 if the output is "n/a".
>> + * 2> the resume latency is no constraint if the output is "0".
>> + * 3> the resume latency is the actual value in used for other string.
>> + */
>> + if (strcmp(buf, "n/a") == 0)
>> + latency = 0;
>> + else {
>> + latency = strtoul(buf, NULL, 10);
>> + latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
>> + }
>> +
>> +out:
>> + if (f != NULL)
>> + fclose(f);
> just fclose(f) because f is valid here.
Ack
>
>> +
>> + return latency != -1 ? latency : ret;
>> +}
>> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
>> new file mode 100644
>> index 0000000000..990c488373
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.h
>> @@ -0,0 +1,73 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#ifndef RTE_POWER_QOS_H
>> +#define RTE_POWER_QOS_H
>> +
>> +#include <stdint.h>
>> +
>> +#include <rte_compat.h>
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * @file rte_power_qos.h
>> + *
>> + * PM QoS API.
>> + *
>> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
>> + * state selection in each cpuidle governor.
>> + * Please see the PM QoS on CPU wide in the following link:
>> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>> + *
>> + * The deeper the idle state, the lower the power consumption, but the
>> + * longer the resume time. Some service are delay sensitive and very except the
>> + * low resume time, like interrupt packet receiving mode.
>> + *
>> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
>> + * state selection and limit just enter the shallowest idle state to low the
>> + * delay after sleep by setting strict resume latency (zero value).
>> + */
>> +
>> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
>> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * @param lcore_id
>> + * target logical core id
>> + *
>> + * @param latency
>> + * The latency should be greater than and equal to zero in microseconds unit.
>> + *
>> + * @return
>> + * 0 on success. Otherwise negative value is returned.
>> + */
>> +__rte_experimental
>> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Get the current resume latency of this logical core.
>> + * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
>> + * if don't set it.
>> + *
>> + * @return
>> + * Negative value on failure.
>> + * >= 0 means the actual resume latency limit on this core.
>> + */
>> +__rte_experimental
>> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* RTE_POWER_QOS_H */
>> diff --git a/lib/power/version.map b/lib/power/version.map
>> index c9a226614e..4e4955a4cf 100644
>> --- a/lib/power/version.map
>> +++ b/lib/power/version.map
>> @@ -51,4 +51,8 @@ EXPERIMENTAL {
>> rte_power_set_uncore_env;
>> rte_power_uncore_freqs;
>> rte_power_unset_uncore_env;
>> +
>> + # added in 24.11
>> + rte_power_qos_set_cpu_resume_latency;
>> + rte_power_qos_get_cpu_resume_latency;
> order by alphabetic.
Ack
>
> another question, I think rename cpu with core maybe more accurate, despite sysfs export with cpu, but in DPDK it means core.
> and there are some rte_power_core_xxx name in rte_power library, I think better to keep the same.
Firstly, the rte_power_qos_set/get_cpu_resume_latency is just consistent
with linux sysfs interface. Having the same name is more releative for user.
In addition, Sivaprasad Tummala is reworking power library and the name
of rte_power_core_xxx also might be changed.
>
>> };
>>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 1/2] power: introduce PM QoS API on CPU wide
2024-09-10 9:32 ` lihuisong (C)
@ 2024-09-12 1:14 ` fengchengwen
0 siblings, 0 replies; 114+ messages in thread
From: fengchengwen @ 2024-09-12 1:14 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
On 2024/9/10 17:32, lihuisong (C) wrote:
> Hi Chengwen,
>
> Thanks for your review.
>
> 在 2024/9/10 10:00, fengchengwen 写道:
>> Hi Huisong
>>
>> Please see comments inline.
>>
>> Thanks
>>
>> On 2024/8/9 17:50, Huisong Li wrote:
>>> The deeper the idle state, the lower the power consumption, but the longer
>>> the resume time. Some service are delay sensitive and very except the low
>>> resume time, like interrupt packet receiving mode.
>>>
>>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>>> interface is used to set and get the resume latency limit on the cpuX for
>>> userspace. Each cpuidle governor in Linux select which idle state to enter
>>> based on this CPU resume latency in their idle task.
>>>
>>> The per-CPU PM QoS API can be used to control this CPU's idle state
>>> selection and limit just enter the shallowest idle state to low the delay
>>> after sleep by setting strict resume latency (zero value).
>>>
>>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>> ---
>> ...
>>
...
>>> diff --git a/lib/power/version.map b/lib/power/version.map
>>> index c9a226614e..4e4955a4cf 100644
>>> --- a/lib/power/version.map
>>> +++ b/lib/power/version.map
>>> @@ -51,4 +51,8 @@ EXPERIMENTAL {
>>> rte_power_set_uncore_env;
>>> rte_power_uncore_freqs;
>>> rte_power_unset_uncore_env;
>>> +
>>> + # added in 24.11
>>> + rte_power_qos_set_cpu_resume_latency;
>>> + rte_power_qos_get_cpu_resume_latency;
>> order by alphabetic.
> Ack
>>
>> another question, I think rename cpu with core maybe more accurate, despite sysfs export with cpu, but in DPDK it means core.
>> and there are some rte_power_core_xxx name in rte_power library, I think better to keep the same.
> Firstly, the rte_power_qos_set/get_cpu_resume_latency is just consistent with linux sysfs interface. Having the same name is more releative for user.
> In addition, Sivaprasad Tummala is reworking power library and the name of rte_power_core_xxx also might be changed.
ok
>>
>>> };
>>>
>> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration
2024-08-09 9:50 ` [PATCH v9 0/2] power: introduce PM QoS interface Huisong Li
2024-08-09 9:50 ` [PATCH v9 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-08-09 9:50 ` Huisong Li
2024-09-10 2:26 ` fengchengwen
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-08-09 9:50 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong,
lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..9b386c3710 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2260,6 +2261,22 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on CPU%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2299,6 +2316,13 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration
2024-08-09 9:50 ` [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-09-10 2:26 ` fengchengwen
2024-09-10 12:07 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: fengchengwen @ 2024-09-10 2:26 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi Huisong,
On 2024/8/9 17:50, Huisong Li wrote:
> Add PM QoS configuration to declease the delay after sleep in case of
> entering deeper idle state.
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 2bb6b092c3..9b386c3710 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -47,6 +47,7 @@
> #include <rte_telemetry.h>
> #include <rte_power_pmd_mgmt.h>
> #include <rte_power_uncore.h>
> +#include <rte_power_qos.h>
>
> #include "perf_core.h"
> #include "main.h"
> @@ -2260,6 +2261,22 @@ init_power_library(void)
> return -1;
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /*
> + * Set the worker lcore's to have strict latency limit to allow
> + * the CPU to enter the shallowest idle state.
> + */
> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> + if (ret < 0) {
> + RTE_LOG(ERR, L3FWD_POWER,
> + "Failed to set strict resume latency on CPU%u.\n",
suggest on core%u and use if (ret != 0)
and how about use warning, if current system don't support it, we just give a warning message
but let's it continue.
> + lcore_id);
> + return ret;
> + }
> + }
> +
> return ret;
> }
>
> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> }
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /* Restore the original value in kernel. */
> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> + }
> +
> return ret;
> }
>
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration
2024-09-10 2:26 ` fengchengwen
@ 2024-09-10 12:07 ` lihuisong (C)
2024-09-12 1:15 ` fengchengwen
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-09-10 12:07 UTC (permalink / raw)
To: fengchengwen, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Hi chengwen,
在 2024/9/10 10:26, fengchengwen 写道:
> Hi Huisong,
>
> On 2024/8/9 17:50, Huisong Li wrote:
>> Add PM QoS configuration to declease the delay after sleep in case of
>> entering deeper idle state.
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>> index 2bb6b092c3..9b386c3710 100644
>> --- a/examples/l3fwd-power/main.c
>> +++ b/examples/l3fwd-power/main.c
>> @@ -47,6 +47,7 @@
>> #include <rte_telemetry.h>
>> #include <rte_power_pmd_mgmt.h>
>> #include <rte_power_uncore.h>
>> +#include <rte_power_qos.h>
>>
>> #include "perf_core.h"
>> #include "main.h"
>> @@ -2260,6 +2261,22 @@ init_power_library(void)
>> return -1;
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /*
>> + * Set the worker lcore's to have strict latency limit to allow
>> + * the CPU to enter the shallowest idle state.
>> + */
>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
>> + if (ret < 0) {
>> + RTE_LOG(ERR, L3FWD_POWER,
>> + "Failed to set strict resume latency on CPU%u.\n",
> suggest on core%u and use if (ret != 0)
Ack
>
> and how about use warning, if current system don't support it, we just give a warning message
> but let's it continue.
Because power lib is just supported and compiled on Linux.
And Linux always enable this feature. So Linux always support it.
I don't know what it would be like to compile l3fwd-power on windows.
But this is the another issue and other power APIs, like rte_power_init,
are used directly in l3fwd-power without any condition.
So how about contiue to use error message?
>
>
>> + lcore_id);
>> + return ret;
>> + }
>> + }
>> +
>> return ret;
>> }
>>
>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
>> }
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /* Restore the original value in kernel. */
>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
>> + }
>> +
>> return ret;
>> }
>>
>>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v9 2/2] examples/l3fwd-power: add PM QoS configuration
2024-09-10 12:07 ` lihuisong (C)
@ 2024-09-12 1:15 ` fengchengwen
0 siblings, 0 replies; 114+ messages in thread
From: fengchengwen @ 2024-09-12 1:15 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
On 2024/9/10 20:07, lihuisong (C) wrote:
> Hi chengwen,
>
> 在 2024/9/10 10:26, fengchengwen 写道:
>> Hi Huisong,
>>
>> On 2024/8/9 17:50, Huisong Li wrote:
>>> Add PM QoS configuration to declease the delay after sleep in case of
>>> entering deeper idle state.
>>>
>>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>> ---
>>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>>> 1 file changed, 24 insertions(+)
>>>
>>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>>> index 2bb6b092c3..9b386c3710 100644
>>> --- a/examples/l3fwd-power/main.c
>>> +++ b/examples/l3fwd-power/main.c
>>> @@ -47,6 +47,7 @@
>>> #include <rte_telemetry.h>
>>> #include <rte_power_pmd_mgmt.h>
>>> #include <rte_power_uncore.h>
>>> +#include <rte_power_qos.h>
>>> #include "perf_core.h"
>>> #include "main.h"
>>> @@ -2260,6 +2261,22 @@ init_power_library(void)
>>> return -1;
>>> }
>>> }
>>> +
>>> + RTE_LCORE_FOREACH(lcore_id) {
>>> + /*
>>> + * Set the worker lcore's to have strict latency limit to allow
>>> + * the CPU to enter the shallowest idle state.
>>> + */
>>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
>>> + if (ret < 0) {
>>> + RTE_LOG(ERR, L3FWD_POWER,
>>> + "Failed to set strict resume latency on CPU%u.\n",
>> suggest on core%u and use if (ret != 0)
> Ack
>>
>> and how about use warning, if current system don't support it, we just give a warning message
>> but let's it continue.
>
>
> Because power lib is just supported and compiled on Linux.
> And Linux always enable this feature. So Linux always support it.
> I don't know what it would be like to compile l3fwd-power on windows.
> But this is the another issue and other power APIs, like rte_power_init, are used directly in l3fwd-power without any condition.
> So how about contiue to use error message?
ok
>>
>>
>>> + lcore_id);
>>> + return ret;
>>> + }
>>> + }
>>> +
>>> return ret;
>>> }
>>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
>>> }
>>> }
>>> }
>>> +
>>> + RTE_LCORE_FOREACH(lcore_id) {
>>> + /* Restore the original value in kernel. */
>>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
>>> + }
>>> +
>>> return ret;
>>> }
>>>
>> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v10 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (10 preceding siblings ...)
2024-08-09 9:50 ` [PATCH v9 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-09-12 2:38 ` Huisong Li
2024-09-12 2:38 ` [PATCH v10 1/2] power: introduce PM QoS API on CPU wide Huisong Li
` (3 more replies)
2024-10-21 11:42 ` [PATCH v11 " Huisong Li
` (3 subsequent siblings)
15 siblings, 4 replies; 114+ messages in thread
From: Huisong Li @ 2024-09-12 2:38 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, fengchengwen,
liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v10:
- replace LINE_MAX with a custom macro and fix two typos.
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_11.rst | 5 ++
examples/l3fwd-power/main.c | 24 ++++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 4 +
7 files changed, 243 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-09-12 2:38 ` Huisong Li
2024-10-13 1:10 ` Stephen Hemminger
2024-10-14 8:29 ` Konstantin Ananyev
2024-09-12 2:38 ` [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
` (2 subsequent siblings)
3 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-09-12 2:38 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, fengchengwen,
liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
after sleep by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 24 ++++++
doc/guides/rel_notes/release_24_11.rst | 5 ++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 ++++++++++++++++
lib/power/version.map | 4 +
6 files changed, 219 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..faa32b4320 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -249,6 +249,30 @@ Get Num Pkgs
Get Num Dies
Get the number of die's on a given package.
+
+PM QoS
+------
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are delay sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The per-CPU PM QoS API can be used to set and get the CPU resume latency based
+on this sysfs.
+
+The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
+idle state selection in Linux and limit just to enter the shallowest idle state
+to low the delay of resuming service after sleeping by setting strict resume
+latency (zero value).
+
+The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
+latency on specified CPU.
+
References
----------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..bd72d0a595 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,11 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the delay after sleep by controlling
+ CPU idle state selection.
+
Removed Items
-------------
diff --git a/lib/power/meson.build b/lib/power/meson.build
index b8426589b2..8222e178b0 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
if cc.has_argument('-Wno-cast-qual')
cflags += '-Wno-cast-qual'
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..8eb26cd41a
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0)
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ int latency = -1;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..08f178a39d 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_get_cpu_resume_latency;
+ rte_power_qos_set_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-09-12 2:38 ` [PATCH v10 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-13 1:10 ` Stephen Hemminger
2024-10-14 12:19 ` lihuisong (C)
2024-10-14 8:29 ` Konstantin Ananyev
1 sibling, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-10-13 1:10 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
On Thu, 12 Sep 2024 10:38:11 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> +
> +PM QoS
> +------
> +
> +The deeper the idle state, the lower the power consumption, but the longer
> +the resume time. Some service are delay sensitive and very except the low
> +resume time, like interrupt packet receiving mode.
> +
> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> +interface is used to set and get the resume latency limit on the cpuX for
> +userspace. Each cpuidle governor in Linux select which idle state to enter
> +based on this CPU resume latency in their idle task.
> +
> +The per-CPU PM QoS API can be used to set and get the CPU resume latency based
> +on this sysfs.
> +
> +The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
> +idle state selection in Linux and limit just to enter the shallowest idle state
> +to low the delay of resuming service after sleeping by setting strict resume
> +latency (zero value).
> +
> +The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
> +latency on specified CPU.
> +
These paragraphs need some editing help. The wording is awkward,
it uses passive voice, and does not seemed directed at a user audience.
If you need help, a writer or AI might help clarify.
It also ends up in the section associated with Intel UnCore.
It would be better after the Turbo Boost section.
> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + return ret;
> + }
The function open_core_sysfs_file() should have been written to return FILE *
and then it could have same attributes as fopen().
The message should include the error reason.
if (open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id)) {
POWER_LOG(ERR, "Failed to open " PM_QOS_SYSFILE_RESUME_LATENCY_US ": %s",
lcore_id, strerror(errno));
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-13 1:10 ` Stephen Hemminger
@ 2024-10-14 12:19 ` lihuisong (C)
2024-10-15 9:41 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-14 12:19 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
Hi Stephen,
在 2024/10/13 9:10, Stephen Hemminger 写道:
> On Thu, 12 Sep 2024 10:38:11 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> +
>> +PM QoS
>> +------
>> +
>> +The deeper the idle state, the lower the power consumption, but the longer
>> +the resume time. Some service are delay sensitive and very except the low
>> +resume time, like interrupt packet receiving mode.
>> +
>> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> +interface is used to set and get the resume latency limit on the cpuX for
>> +userspace. Each cpuidle governor in Linux select which idle state to enter
>> +based on this CPU resume latency in their idle task.
>> +
>> +The per-CPU PM QoS API can be used to set and get the CPU resume latency based
>> +on this sysfs.
>> +
>> +The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
>> +idle state selection in Linux and limit just to enter the shallowest idle state
>> +to low the delay of resuming service after sleeping by setting strict resume
>> +latency (zero value).
>> +
>> +The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
>> +latency on specified CPU.
>> +
> These paragraphs need some editing help. The wording is awkward,
> it uses passive voice, and does not seemed directed at a user audience.
> If you need help, a writer or AI might help clarify.
How about the following description:
-->
The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
The ``rte_power_qos_set_cpu_resume_latency()`` and
``rte_power_qos_get_cpu_resume_latency()``
can set and get the CPU resume latency based on this per-CPU sysfs.
The ``rte_power_qos_set_cpu_resume_latency()`` can control the CPU's
idle state selection and limit to enter the shallowest idle state
to low the delay of resuming service by setting strict resume
latency (zero value) so as to get better performance.
>
> It also ends up in the section associated with Intel UnCore.
> It would be better after the Turbo Boost section.
Ack
>
>
>
>> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + return ret;
>> + }
> The function open_core_sysfs_file() should have been written to return FILE *
> and then it could have same attributes as fopen().
Do you mean we should save this handle for directly using it to get or
set this resume latency?
If it is I understand, we cannot do it like this. Because qos driver in
Linux use the "noop_llseek()" as the .llseek API.
>
> The message should include the error reason.
>
> if (open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id)) {
> POWER_LOG(ERR, "Failed to open " PM_QOS_SYSFILE_RESUME_LATENCY_US ": %s",
> lcore_id, strerror(errno));
Good idea. Thanks.
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-14 12:19 ` lihuisong (C)
@ 2024-10-15 9:41 ` lihuisong (C)
2024-10-15 15:45 ` Stephen Hemminger
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-15 9:41 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
Hi Stephen,
Can you take a look at this reply so as to send out the next version ASAP?
Thanks.😁
/Huisong
在 2024/10/14 20:19, lihuisong (C) 写道:
> Hi Stephen,
>
>
> 在 2024/10/13 9:10, Stephen Hemminger 写道:
>> On Thu, 12 Sep 2024 10:38:11 +0800
>> Huisong Li <lihuisong@huawei.com> wrote:
>>
>>> +
>>> +PM QoS
>>> +------
>>> +
>>> +The deeper the idle state, the lower the power consumption, but the
>>> longer
>>> +the resume time. Some service are delay sensitive and very except
>>> the low
>>> +resume time, like interrupt packet receiving mode.
>>> +
>>> +And the
>>> "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>>> +interface is used to set and get the resume latency limit on the
>>> cpuX for
>>> +userspace. Each cpuidle governor in Linux select which idle state
>>> to enter
>>> +based on this CPU resume latency in their idle task.
>>> +
>>> +The per-CPU PM QoS API can be used to set and get the CPU resume
>>> latency based
>>> +on this sysfs.
>>> +
>>> +The ``rte_power_qos_set_cpu_resume_latency()`` function can control
>>> the CPU's
>>> +idle state selection in Linux and limit just to enter the
>>> shallowest idle state
>>> +to low the delay of resuming service after sleeping by setting
>>> strict resume
>>> +latency (zero value).
>>> +
>>> +The ``rte_power_qos_get_cpu_resume_latency()`` function can get the
>>> resume
>>> +latency on specified CPU.
>>> +
>> These paragraphs need some editing help. The wording is awkward,
>> it uses passive voice, and does not seemed directed at a user audience.
>> If you need help, a writer or AI might help clarify.
> How about the following description:
> -->
> The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Each cpuidle governor in Linux select which idle state to
> enter
> based on this CPU resume latency in their idle task.
>
> The deeper the idle state, the lower the power consumption, but the
> longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> The ``rte_power_qos_set_cpu_resume_latency()`` and
> ``rte_power_qos_get_cpu_resume_latency()``
> can set and get the CPU resume latency based on this per-CPU sysfs.
>
> The ``rte_power_qos_set_cpu_resume_latency()`` can control the CPU's
> idle state selection and limit to enter the shallowest idle state
> to low the delay of resuming service by setting strict resume
> latency (zero value) so as to get better performance.
>>
>> It also ends up in the section associated with Intel UnCore.
>> It would be better after the Turbo Boost section.
> Ack
>>
>>
>>
>>> + ret = open_core_sysfs_file(&f, "w",
>>> PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>>> + if (ret != 0) {
>>> + POWER_LOG(ERR, "Failed to open
>>> "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>>> + return ret;
>>> + }
>> The function open_core_sysfs_file() should have been written to
>> return FILE *
>> and then it could have same attributes as fopen().
> Do you mean we should save this handle for directly using it to get or
> set this resume latency?
> If it is I understand, we cannot do it like this. Because qos driver
> in Linux use the "noop_llseek()" as the .llseek API.
>>
>> The message should include the error reason.
>>
>> if (open_core_sysfs_file(&f, "w",
>> PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id)) {
>> POWER_LOG(ERR, "Failed to open "
>> PM_QOS_SYSFILE_RESUME_LATENCY_US ": %s",
>> lcore_id, strerror(errno));
> Good idea. Thanks.
>
>>
>> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-15 9:41 ` lihuisong (C)
@ 2024-10-15 15:45 ` Stephen Hemminger
2024-10-17 2:11 ` lihuisong (C)
2024-10-22 3:10 ` lihuisong (C)
0 siblings, 2 replies; 114+ messages in thread
From: Stephen Hemminger @ 2024-10-15 15:45 UTC (permalink / raw)
To: lihuisong (C)
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
On Tue, 15 Oct 2024 17:41:39 +0800
"lihuisong (C)" <lihuisong@huawei.com> wrote:
> Hi Stephen,
>
> Can you take a look at this reply so as to send out the next version ASAP?
> Thanks.😁
>
> /Huisong
> 在 2024/10/14 20:19, lihuisong (C) 写道:
The biggest issue is that lcore is not the same as cpu as far as kernel is concerned.
DPDK support mapping lcore to a cpuset, and that is not necessarily the same one-to-one mapping
as values in sysfs. In documentation of eal see.
For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
lcore 0 runs on cpuset 0x41 (cpu 0,6);
lcore 1 runs on cpuset 0x2 (cpu 1);
lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
lcore 6 runs on cpuset 0x41 (cpu 0,6);
lcore 7 runs on cpuset 0x80 (cpu 7);
lcore 8 runs on cpuset 0x100 (cpu 8).
This problem existed in power library and this new API still has it.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-15 15:45 ` Stephen Hemminger
@ 2024-10-17 2:11 ` lihuisong (C)
2024-10-17 3:20 ` Stephen Hemminger
2024-10-22 3:10 ` lihuisong (C)
1 sibling, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-17 2:11 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong,
konstantin.ananyev
Hi Stephen,
在 2024/10/15 23:45, Stephen Hemminger 写道:
> On Tue, 15 Oct 2024 17:41:39 +0800
> "lihuisong (C)" <lihuisong@huawei.com> wrote:
>
>> Hi Stephen,
>>
>> Can you take a look at this reply so as to send out the next version ASAP?
>> Thanks.😁
>>
>> /Huisong
>> 在 2024/10/14 20:19, lihuisong (C) 写道:
> The biggest issue is that lcore is not the same as cpu as far as kernel is concerned.
> DPDK support mapping lcore to a cpuset, and that is not necessarily the same one-to-one mapping
> as values in sysfs. In documentation of eal see.
Yes, you are right.
>
> For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
> lcore 0 runs on cpuset 0x41 (cpu 0,6);
> lcore 1 runs on cpuset 0x2 (cpu 1);
> lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
> lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
> lcore 6 runs on cpuset 0x41 (cpu 0,6);
> lcore 7 runs on cpuset 0x80 (cpu 7);
> lcore 8 runs on cpuset 0x100 (cpu 8).
>
> This problem existed in power library and this new API still has it.
How about use lcore_config[lcore_id].cpuset to get the real cpu_id?
And for this case that application use '--lcores', we simply do some
operations in power lib for all mapping CPUs in lcore's cpuset.
If it is ok, I will fix it for the entire power library and this new API.
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-17 2:11 ` lihuisong (C)
@ 2024-10-17 3:20 ` Stephen Hemminger
2024-10-17 8:37 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-10-17 3:20 UTC (permalink / raw)
To: lihuisong (C)
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong,
konstantin.ananyev
On Thu, 17 Oct 2024 10:11:13 +0800
"lihuisong (C)" <lihuisong@huawei.com> wrote:
> Hi Stephen,
>
> 在 2024/10/15 23:45, Stephen Hemminger 写道:
> > On Tue, 15 Oct 2024 17:41:39 +0800
> > "lihuisong (C)" <lihuisong@huawei.com> wrote:
> >
> >> Hi Stephen,
> >>
> >> Can you take a look at this reply so as to send out the next version ASAP?
> >> Thanks.😁
> >>
> >> /Huisong
> >> 在 2024/10/14 20:19, lihuisong (C) 写道:
> > The biggest issue is that lcore is not the same as cpu as far as kernel is concerned.
> > DPDK support mapping lcore to a cpuset, and that is not necessarily the same one-to-one mapping
> > as values in sysfs. In documentation of eal see.
> Yes, you are right.
> >
> > For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
> > lcore 0 runs on cpuset 0x41 (cpu 0,6);
> > lcore 1 runs on cpuset 0x2 (cpu 1);
> > lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
> > lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
> > lcore 6 runs on cpuset 0x41 (cpu 0,6);
> > lcore 7 runs on cpuset 0x80 (cpu 7);
> > lcore 8 runs on cpuset 0x100 (cpu 8).
> >
> > This problem existed in power library and this new API still has it.
> How about use lcore_config[lcore_id].cpuset to get the real cpu_id?
> And for this case that application use '--lcores', we simply do some
> operations in power lib for all mapping CPUs in lcore's cpuset.
> If it is ok, I will fix it for the entire power library and this new API.
> >
Using the lcore_config is the right direction but the cpuset may have more than
one cpu, so the code needs to iterate over those cpus. Probably safe to ignore problems
the case where user misconfigures to have two lcores using an overlapping set of cpu's
like the example in the doc.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-17 3:20 ` Stephen Hemminger
@ 2024-10-17 8:37 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-17 8:37 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong,
konstantin.ananyev
在 2024/10/17 11:20, Stephen Hemminger 写道:
> On Thu, 17 Oct 2024 10:11:13 +0800
> "lihuisong (C)" <lihuisong@huawei.com> wrote:
>
>> Hi Stephen,
>>
>> 在 2024/10/15 23:45, Stephen Hemminger 写道:
>>> On Tue, 15 Oct 2024 17:41:39 +0800
>>> "lihuisong (C)" <lihuisong@huawei.com> wrote:
>>>
>>>> Hi Stephen,
>>>>
>>>> Can you take a look at this reply so as to send out the next version ASAP?
>>>> Thanks.😁
>>>>
>>>> /Huisong
>>>> 在 2024/10/14 20:19, lihuisong (C) 写道:
>>> The biggest issue is that lcore is not the same as cpu as far as kernel is concerned.
>>> DPDK support mapping lcore to a cpuset, and that is not necessarily the same one-to-one mapping
>>> as values in sysfs. In documentation of eal see.
>> Yes, you are right.
>>> For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
>>> lcore 0 runs on cpuset 0x41 (cpu 0,6);
>>> lcore 1 runs on cpuset 0x2 (cpu 1);
>>> lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
>>> lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
>>> lcore 6 runs on cpuset 0x41 (cpu 0,6);
>>> lcore 7 runs on cpuset 0x80 (cpu 7);
>>> lcore 8 runs on cpuset 0x100 (cpu 8).
>>>
>>> This problem existed in power library and this new API still has it.
>> How about use lcore_config[lcore_id].cpuset to get the real cpu_id?
>> And for this case that application use '--lcores', we simply do some
>> operations in power lib for all mapping CPUs in lcore's cpuset.
>> If it is ok, I will fix it for the entire power library and this new API.
> Using the lcore_config is the right direction but the cpuset may have more than
> one cpu, so the code needs to iterate over those cpus. Probably safe to ignore problems
> the case where user misconfigures to have two lcores using an overlapping set of cpu's
> like the example in the doc.
> .
Yes, so we don't care this overlapping set case.
That's attributed to an usage issue and we just need to clearly comment
this case's influence in doc, ok?
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-15 15:45 ` Stephen Hemminger
2024-10-17 2:11 ` lihuisong (C)
@ 2024-10-22 3:10 ` lihuisong (C)
1 sibling, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-22 3:10 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
Hi Stephen,
The modification for this issue has been sent out.
please have a look at v11, thanks.
/Huisong
在 2024/10/15 23:45, Stephen Hemminger 写道:
> On Tue, 15 Oct 2024 17:41:39 +0800
> "lihuisong (C)" <lihuisong@huawei.com> wrote:
>
>> Hi Stephen,
>>
>> Can you take a look at this reply so as to send out the next version ASAP?
>> Thanks.😁
>>
>> /Huisong
>> 在 2024/10/14 20:19, lihuisong (C) 写道:
> The biggest issue is that lcore is not the same as cpu as far as kernel is concerned.
> DPDK support mapping lcore to a cpuset, and that is not necessarily the same one-to-one mapping
> as values in sysfs. In documentation of eal see.
>
>
> For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
> lcore 0 runs on cpuset 0x41 (cpu 0,6);
> lcore 1 runs on cpuset 0x2 (cpu 1);
> lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
> lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
> lcore 6 runs on cpuset 0x41 (cpu 0,6);
> lcore 7 runs on cpuset 0x80 (cpu 7);
> lcore 8 runs on cpuset 0x100 (cpu 8).
>
> This problem existed in power library and this new API still has it.
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-09-12 2:38 ` [PATCH v10 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-10-13 1:10 ` Stephen Hemminger
@ 2024-10-14 8:29 ` Konstantin Ananyev
2024-10-15 7:47 ` lihuisong (C)
1 sibling, 1 reply; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-14 8:29 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong, lihuisong (C)
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Each cpuidle governor in Linux select which idle state to enter
> based on this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> doc/guides/prog_guide/power_man.rst | 24 ++++++
> doc/guides/rel_notes/release_24_11.rst | 5 ++
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
> lib/power/rte_power_qos.h | 73 ++++++++++++++++
> lib/power/version.map | 4 +
> 6 files changed, 219 insertions(+)
> create mode 100644 lib/power/rte_power_qos.c
> create mode 100644 lib/power/rte_power_qos.h
>
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index f6674efe2d..faa32b4320 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -249,6 +249,30 @@ Get Num Pkgs
> Get Num Dies
> Get the number of die's on a given package.
>
> +
> +PM QoS
> +------
> +
> +The deeper the idle state, the lower the power consumption, but the longer
> +the resume time. Some service are delay sensitive and very except the low
> +resume time, like interrupt packet receiving mode.
> +
> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> +interface is used to set and get the resume latency limit on the cpuX for
> +userspace. Each cpuidle governor in Linux select which idle state to enter
> +based on this CPU resume latency in their idle task.
> +
> +The per-CPU PM QoS API can be used to set and get the CPU resume latency based
> +on this sysfs.
> +
> +The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
> +idle state selection in Linux and limit just to enter the shallowest idle state
> +to low the delay of resuming service after sleeping by setting strict resume
> +latency (zero value).
> +
> +The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
> +latency on specified CPU.
> +
> References
> ----------
>
> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
> index 0ff70d9057..bd72d0a595 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -55,6 +55,11 @@ New Features
> Also, make sure to start the actual text at the margin.
> =======================================================
>
> +* **Introduce per-CPU PM QoS interface.**
> +
> + * Add per-CPU PM QoS interface to low the delay after sleep by controlling
> + CPU idle state selection.
> +
>
> Removed Items
> -------------
> diff --git a/lib/power/meson.build b/lib/power/meson.build
> index b8426589b2..8222e178b0 100644
> --- a/lib/power/meson.build
> +++ b/lib/power/meson.build
> @@ -23,12 +23,14 @@ sources = files(
> 'rte_power.c',
> 'rte_power_uncore.c',
> 'rte_power_pmd_mgmt.c',
> + 'rte_power_qos.c',
> )
> headers = files(
> 'rte_power.h',
> 'rte_power_guest_channel.h',
> 'rte_power_pmd_mgmt.h',
> 'rte_power_uncore.h',
> + 'rte_power_qos.h',
> )
> if cc.has_argument('-Wno-cast-qual')
> cflags += '-Wno-cast-qual'
> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
> new file mode 100644
> index 0000000000..8eb26cd41a
> --- /dev/null
> +++ b/lib/power/rte_power_qos.c
> @@ -0,0 +1,111 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#include <errno.h>
> +#include <stdlib.h>
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +#include "power_common.h"
> +#include "rte_power_qos.h"
> +
> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> +
> +#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
> +
> +int
> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
> +{
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> +
> + if (latency < 0) {
> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
That was already brought by Morten:
lcore_id is not always equal to cpu_core_id (cpu affinity).
Looking through power library it is not specific to that particular patch,
but sort of common limitation (bug?) in rte_power lib.
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + return ret;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
> + * is as follows for different input string.
> + * 1> the resume latency is 0 if the input is "n/a".
> + * 2> the resume latency is no constraint if the input is "0".
> + * 3> the resume latency is the actual value to be set.
> + */
> + if (latency == 0)
> + snprintf(buf, sizeof(buf), "%s", "n/a");
> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
> + snprintf(buf, sizeof(buf), "%u", 0);
> + else
> + snprintf(buf, sizeof(buf), "%u", latency);
> +
> + ret = write_core_sysfs_s(f, buf);
> + if (ret != 0)
> + POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> +
> + fclose(f);
> +
> + return ret;
> +}
> +
> +int
> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
> +{
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + int latency = -1;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + return ret;
> + }
> +
> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> + goto out;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
> + * is as follows for different output string.
> + * 1> the resume latency is 0 if the output is "n/a".
> + * 2> the resume latency is no constraint if the output is "0".
> + * 3> the resume latency is the actual value in used for other string.
> + */
> + if (strcmp(buf, "n/a") == 0)
> + latency = 0;
> + else {
> + latency = strtoul(buf, NULL, 10);
> + latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
> + }
> +
> +out:
> + fclose(f);
> +
> + return latency != -1 ? latency : ret;
> +}
> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
> new file mode 100644
> index 0000000000..990c488373
> --- /dev/null
> +++ b/lib/power/rte_power_qos.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef RTE_POWER_QOS_H
> +#define RTE_POWER_QOS_H
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file rte_power_qos.h
> + *
> + * PM QoS API.
> + *
> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
> + * state selection in each cpuidle governor.
> + * Please see the PM QoS on CPU wide in the following link:
> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-
> power-pm-qos-resume-latency-us
> + *
> + * The deeper the idle state, the lower the power consumption, but the
> + * longer the resume time. Some service are delay sensitive and very except the
> + * low resume time, like interrupt packet receiving mode.
> + *
> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
> + * state selection and limit just enter the shallowest idle state to low the
> + * delay after sleep by setting strict resume latency (zero value).
> + */
> +
> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * @param lcore_id
> + * target logical core id
> + *
> + * @param latency
> + * The latency should be greater than and equal to zero in microseconds unit.
> + *
> + * @return
> + * 0 on success. Otherwise negative value is returned.
> + */
> +__rte_experimental
> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the current resume latency of this logical core.
> + * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
> + * if don't set it.
> + *
> + * @return
> + * Negative value on failure.
> + * >= 0 means the actual resume latency limit on this core.
> + */
> +__rte_experimental
> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* RTE_POWER_QOS_H */
> diff --git a/lib/power/version.map b/lib/power/version.map
> index c9a226614e..08f178a39d 100644
> --- a/lib/power/version.map
> +++ b/lib/power/version.map
> @@ -51,4 +51,8 @@ EXPERIMENTAL {
> rte_power_set_uncore_env;
> rte_power_uncore_freqs;
> rte_power_unset_uncore_env;
> +
> + # added in 24.11
> + rte_power_qos_get_cpu_resume_latency;
> + rte_power_qos_set_cpu_resume_latency;
> };
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 1/2] power: introduce PM QoS API on CPU wide
2024-10-14 8:29 ` Konstantin Ananyev
@ 2024-10-15 7:47 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-15 7:47 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
Hi Konstantin Ananyev,
Thanks for your review.
在 2024/10/14 16:29, Konstantin Ananyev 写道:
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> interface is used to set and get the resume latency limit on the cpuX for
>> userspace. Each cpuidle governor in Linux select which idle state to enter
>> based on this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the delay
>> after sleep by setting strict resume latency (zero value).
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> doc/guides/prog_guide/power_man.rst | 24 ++++++
>> doc/guides/rel_notes/release_24_11.rst | 5 ++
>> lib/power/meson.build | 2 +
>> lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
>> lib/power/rte_power_qos.h | 73 ++++++++++++++++
>> lib/power/version.map | 4 +
>> 6 files changed, 219 insertions(+)
>> create mode 100644 lib/power/rte_power_qos.c
>> create mode 100644 lib/power/rte_power_qos.h
>>
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index f6674efe2d..faa32b4320 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -249,6 +249,30 @@ Get Num Pkgs
>> Get Num Dies
>> Get the number of die's on a given package.
>>
>> +
>> +PM QoS
>> +------
>> +
>> +The deeper the idle state, the lower the power consumption, but the longer
>> +the resume time. Some service are delay sensitive and very except the low
>> +resume time, like interrupt packet receiving mode.
>> +
>> +And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> +interface is used to set and get the resume latency limit on the cpuX for
>> +userspace. Each cpuidle governor in Linux select which idle state to enter
>> +based on this CPU resume latency in their idle task.
>> +
>> +The per-CPU PM QoS API can be used to set and get the CPU resume latency based
>> +on this sysfs.
>> +
>> +The ``rte_power_qos_set_cpu_resume_latency()`` function can control the CPU's
>> +idle state selection in Linux and limit just to enter the shallowest idle state
>> +to low the delay of resuming service after sleeping by setting strict resume
>> +latency (zero value).
>> +
>> +The ``rte_power_qos_get_cpu_resume_latency()`` function can get the resume
>> +latency on specified CPU.
>> +
>> References
>> ----------
>>
>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>> index 0ff70d9057..bd72d0a595 100644
>> --- a/doc/guides/rel_notes/release_24_11.rst
>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -55,6 +55,11 @@ New Features
>> Also, make sure to start the actual text at the margin.
>> =======================================================
>>
>> +* **Introduce per-CPU PM QoS interface.**
>> +
>> + * Add per-CPU PM QoS interface to low the delay after sleep by controlling
>> + CPU idle state selection.
>> +
>>
>> Removed Items
>> -------------
>> diff --git a/lib/power/meson.build b/lib/power/meson.build
>> index b8426589b2..8222e178b0 100644
>> --- a/lib/power/meson.build
>> +++ b/lib/power/meson.build
>> @@ -23,12 +23,14 @@ sources = files(
>> 'rte_power.c',
>> 'rte_power_uncore.c',
>> 'rte_power_pmd_mgmt.c',
>> + 'rte_power_qos.c',
>> )
>> headers = files(
>> 'rte_power.h',
>> 'rte_power_guest_channel.h',
>> 'rte_power_pmd_mgmt.h',
>> 'rte_power_uncore.h',
>> + 'rte_power_qos.h',
>> )
>> if cc.has_argument('-Wno-cast-qual')
>> cflags += '-Wno-cast-qual'
>> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
>> new file mode 100644
>> index 0000000000..8eb26cd41a
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.c
>> @@ -0,0 +1,111 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#include <errno.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +
>> +#include <rte_lcore.h>
>> +#include <rte_log.h>
>> +
>> +#include "power_common.h"
>> +#include "rte_power_qos.h"
>> +
>> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
>> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
>> +
>> +#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
>> +
>> +int
>> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
>> +{
>> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
>> + FILE *f;
>> + int ret;
>> +
>> + if (!rte_lcore_is_enabled(lcore_id)) {
>> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
>> + return -EINVAL;
>> + }
>> +
>> + if (latency < 0) {
>> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
>> + return -EINVAL;
>> + }
>> +
>> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
> That was already brought by Morten:
> lcore_id is not always equal to cpu_core_id (cpu affinity).
Yes, Morten also mentioned it.
And I tried to answer him, please find our previous disscussion, thanks.
I think it's ok😁
> Looking through power library it is not specific to that particular patch,
> but sort of common limitation (bug?) in rte_power lib.
Yes it is very common in power lib.
>
>
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US, lcore_id);
>> + return ret;
>> + }
>> +
<...>
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
2024-09-12 2:38 ` [PATCH v10 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-09-12 2:38 ` Huisong Li
2024-10-14 8:24 ` Konstantin Ananyev
2024-09-12 3:07 ` [PATCH v10 0/2] power: introduce PM QoS interface fengchengwen
2024-10-14 15:27 ` Stephen Hemminger
3 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-09-12 2:38 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, fengchengwen,
liuyonglong, lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..b0ddb54ee2 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2260,6 +2261,22 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on core%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2299,6 +2316,13 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-09-12 2:38 ` [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-10-14 8:24 ` Konstantin Ananyev
2024-10-14 8:46 ` Konstantin Ananyev
2024-10-15 7:32 ` lihuisong (C)
0 siblings, 2 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-14 8:24 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong, lihuisong (C)
> Add PM QoS configuration to declease the delay after sleep in case of
> entering deeper idle state.
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 2bb6b092c3..b0ddb54ee2 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -47,6 +47,7 @@
> #include <rte_telemetry.h>
> #include <rte_power_pmd_mgmt.h>
> #include <rte_power_uncore.h>
> +#include <rte_power_qos.h>
>
> #include "perf_core.h"
> #include "main.h"
> @@ -2260,6 +2261,22 @@ init_power_library(void)
> return -1;
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /*
> + * Set the worker lcore's to have strict latency limit to allow
> + * the CPU to enter the shallowest idle state.
> + */
> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
I wonder why it is set to all worker cores silently and unconditionally?
Wouldn't it be a change from current behavior of the power library?
> + if (ret != 0) {
> + RTE_LOG(ERR, L3FWD_POWER,
> + "Failed to set strict resume latency on core%u.\n",
> + lcore_id);
> + return ret;
> + }
> + }
> +
> return ret;
> }
>
> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> }
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /* Restore the original value in kernel. */
> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> + }
> +
> return ret;
> }
>
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-14 8:24 ` Konstantin Ananyev
@ 2024-10-14 8:46 ` Konstantin Ananyev
2024-10-15 7:32 ` lihuisong (C)
1 sibling, 0 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-14 8:46 UTC (permalink / raw)
To: Konstantin Ananyev, lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong, lihuisong (C)
> > Add PM QoS configuration to declease the delay after sleep in case of
> > entering deeper idle state.
> >
> > Signed-off-by: Huisong Li <lihuisong@huawei.com>
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> > examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> > 1 file changed, 24 insertions(+)
> >
> > diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> > index 2bb6b092c3..b0ddb54ee2 100644
> > --- a/examples/l3fwd-power/main.c
> > +++ b/examples/l3fwd-power/main.c
> > @@ -47,6 +47,7 @@
> > #include <rte_telemetry.h>
> > #include <rte_power_pmd_mgmt.h>
> > #include <rte_power_uncore.h>
> > +#include <rte_power_qos.h>
> >
> > #include "perf_core.h"
> > #include "main.h"
> > @@ -2260,6 +2261,22 @@ init_power_library(void)
> > return -1;
> > }
> > }
> > +
> > + RTE_LCORE_FOREACH(lcore_id) {
> > + /*
> > + * Set the worker lcore's to have strict latency limit to allow
> > + * the CPU to enter the shallowest idle state.
> > + */
> > + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> > + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
>
> I wonder why it is set to all worker cores silently and unconditionally?
> Wouldn't it be a change from current behavior of the power library?
s/library/sample app/
>
> > + if (ret != 0) {
> > + RTE_LOG(ERR, L3FWD_POWER,
> > + "Failed to set strict resume latency on core%u.\n",
> > + lcore_id);
> > + return ret;
> > + }
> > + }
> > +
> > return ret;
> > }
> >
> > @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> > }
> > }
> > }
> > +
> > + RTE_LCORE_FOREACH(lcore_id) {
> > + /* Restore the original value in kernel. */
> > + rte_power_qos_set_cpu_resume_latency(lcore_id,
> > + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> > + }
> > +
> > return ret;
> > }
> >
> > --
> > 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-14 8:24 ` Konstantin Ananyev
2024-10-14 8:46 ` Konstantin Ananyev
@ 2024-10-15 7:32 ` lihuisong (C)
2024-10-16 0:24 ` Konstantin Ananyev
1 sibling, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-15 7:32 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
Hi Konstantin Ananyev,
在 2024/10/14 16:24, Konstantin Ananyev 写道:
>
>> Add PM QoS configuration to declease the delay after sleep in case of
>> entering deeper idle state.
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>> index 2bb6b092c3..b0ddb54ee2 100644
>> --- a/examples/l3fwd-power/main.c
>> +++ b/examples/l3fwd-power/main.c
>> @@ -47,6 +47,7 @@
>> #include <rte_telemetry.h>
>> #include <rte_power_pmd_mgmt.h>
>> #include <rte_power_uncore.h>
>> +#include <rte_power_qos.h>
>>
>> #include "perf_core.h"
>> #include "main.h"
>> @@ -2260,6 +2261,22 @@ init_power_library(void)
>> return -1;
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /*
>> + * Set the worker lcore's to have strict latency limit to allow
>> + * the CPU to enter the shallowest idle state.
>> + */
>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> I wonder why it is set to all worker cores silently and unconditionally?
> Wouldn't it be a change from current behavior of the power library?
L3fwd-power uses Rx interrupt to receive packet. Do you mean this
setting should be for the core of Rx queue, right?
This setting doesn't change the behavior of l3fwd-power. It is just for
getting low resume latency when worker core wakes up from sleeping.
>
>> + if (ret != 0) {
>> + RTE_LOG(ERR, L3FWD_POWER,
>> + "Failed to set strict resume latency on core%u.\n",
>> + lcore_id);
>> + return ret;
>> + }
>> + }
>> +
>> return ret;
>> }
>>
>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
>> }
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /* Restore the original value in kernel. */
>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
>> + }
>> +
>> return ret;
>> }
>>
>> --
>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-15 7:32 ` lihuisong (C)
@ 2024-10-16 0:24 ` Konstantin Ananyev
2024-10-17 2:25 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-16 0:24 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> >> Add PM QoS configuration to declease the delay after sleep in case of
> >> entering deeper idle state.
> >>
> >> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> >> 1 file changed, 24 insertions(+)
> >>
> >> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> >> index 2bb6b092c3..b0ddb54ee2 100644
> >> --- a/examples/l3fwd-power/main.c
> >> +++ b/examples/l3fwd-power/main.c
> >> @@ -47,6 +47,7 @@
> >> #include <rte_telemetry.h>
> >> #include <rte_power_pmd_mgmt.h>
> >> #include <rte_power_uncore.h>
> >> +#include <rte_power_qos.h>
> >>
> >> #include "perf_core.h"
> >> #include "main.h"
> >> @@ -2260,6 +2261,22 @@ init_power_library(void)
> >> return -1;
> >> }
> >> }
> >> +
> >> + RTE_LCORE_FOREACH(lcore_id) {
> >> + /*
> >> + * Set the worker lcore's to have strict latency limit to allow
> >> + * the CPU to enter the shallowest idle state.
> >> + */
> >> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> >> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> > I wonder why it is set to all worker cores silently and unconditionally?
> > Wouldn't it be a change from current behavior of the power library?
> L3fwd-power uses Rx interrupt to receive packet.
AFAIK, not exactly.
From what I remember l3fwd-power still runs RX in poll mode,
thought it counts number of idle rx bursts.
As that number goes above some threshold, it puts itself into
sleep with some timeout value.
> Do you mean this
> setting should be for the core of Rx queue, right?
> This setting doesn't change the behavior of l3fwd-power. It is just for
> getting low resume latency when worker core wakes up from sleeping.
As understand your patch - you force CPU to select more shallow C state
when entering such sleep.
Then it means that possible packet loss will be smaller,
but power consumption probably higher, correct?
If so, then it looks like a change from current behavior for that app,
and we probably need to document what will be an expected change.
Or probably as a better way - provider user with a way to choose,
new cmdline option or so.
> >> + if (ret != 0) {
> >> + RTE_LOG(ERR, L3FWD_POWER,
> >> + "Failed to set strict resume latency on core%u.\n",
> >> + lcore_id);
> >> + return ret;
> >> + }
> >> + }
> >> +
> >> return ret;
> >> }
> >>
> >> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> >> }
> >> }
> >> }
> >> +
> >> + RTE_LCORE_FOREACH(lcore_id) {
> >> + /* Restore the original value in kernel. */
> >> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> >> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> >> + }
> >> +
> >> return ret;
> >> }
> >>
> >> --
> >> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-16 0:24 ` Konstantin Ananyev
@ 2024-10-17 2:25 ` lihuisong (C)
2024-10-17 11:14 ` Konstantin Ananyev
0 siblings, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-17 2:25 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
在 2024/10/16 8:24, Konstantin Ananyev 写道:
>
>>>> Add PM QoS configuration to declease the delay after sleep in case of
>>>> entering deeper idle state.
>>>>
>>>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>> ---
>>>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>>>> 1 file changed, 24 insertions(+)
>>>>
>>>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>>>> index 2bb6b092c3..b0ddb54ee2 100644
>>>> --- a/examples/l3fwd-power/main.c
>>>> +++ b/examples/l3fwd-power/main.c
>>>> @@ -47,6 +47,7 @@
>>>> #include <rte_telemetry.h>
>>>> #include <rte_power_pmd_mgmt.h>
>>>> #include <rte_power_uncore.h>
>>>> +#include <rte_power_qos.h>
>>>>
>>>> #include "perf_core.h"
>>>> #include "main.h"
>>>> @@ -2260,6 +2261,22 @@ init_power_library(void)
>>>> return -1;
>>>> }
>>>> }
>>>> +
>>>> + RTE_LCORE_FOREACH(lcore_id) {
>>>> + /*
>>>> + * Set the worker lcore's to have strict latency limit to allow
>>>> + * the CPU to enter the shallowest idle state.
>>>> + */
>>>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>>>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
>>> I wonder why it is set to all worker cores silently and unconditionally?
>>> Wouldn't it be a change from current behavior of the power library?
>> L3fwd-power uses Rx interrupt to receive packet.
> AFAIK, not exactly.
> From what I remember l3fwd-power still runs RX in poll mode,
> thought it counts number of idle rx bursts.
> As that number goes above some threshold, it puts itself into
> sleep with some timeout value.
Exactly.
>
>> Do you mean this
>> setting should be for the core of Rx queue, right?
>> This setting doesn't change the behavior of l3fwd-power. It is just for
>> getting low resume latency when worker core wakes up from sleeping.
> As understand your patch - you force CPU to select more shallow C state
> when entering such sleep.
> Then it means that possible packet loss will be smaller,
> but power consumption probably higher, correct?
correct.
> If so, then it looks like a change from current behavior for that app,
> and we probably need to document what will be an expected change.
> Or probably as a better way - provider user with a way to choose,
> new cmdline option or so.
Yes.
The power consumption may increase but the performance is better due to
this patch if the platform enables cpuidle funtion.
After all, this is just a very little point. It is enough to document
this change or impact in doc of this API. Just let it more clear for user.
What do you think?
>
>
>>>> + if (ret != 0) {
>>>> + RTE_LOG(ERR, L3FWD_POWER,
>>>> + "Failed to set strict resume latency on core%u.\n",
>>>> + lcore_id);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> +
>>>> return ret;
>>>> }
>>>>
>>>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
>>>> }
>>>> }
>>>> }
>>>> +
>>>> + RTE_LCORE_FOREACH(lcore_id) {
>>>> + /* Restore the original value in kernel. */
>>>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>>>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
>>>> + }
>>>> +
>>>> return ret;
>>>> }
>>>>
>>>> --
>>>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-17 2:25 ` lihuisong (C)
@ 2024-10-17 11:14 ` Konstantin Ananyev
0 siblings, 0 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-17 11:14 UTC (permalink / raw)
To: lihuisong (C), dev, david.hunt
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> >
> >>>> Add PM QoS configuration to declease the delay after sleep in case of
> >>>> entering deeper idle state.
> >>>>
> >>>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> >>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>>> ---
> >>>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> >>>> 1 file changed, 24 insertions(+)
> >>>>
> >>>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> >>>> index 2bb6b092c3..b0ddb54ee2 100644
> >>>> --- a/examples/l3fwd-power/main.c
> >>>> +++ b/examples/l3fwd-power/main.c
> >>>> @@ -47,6 +47,7 @@
> >>>> #include <rte_telemetry.h>
> >>>> #include <rte_power_pmd_mgmt.h>
> >>>> #include <rte_power_uncore.h>
> >>>> +#include <rte_power_qos.h>
> >>>>
> >>>> #include "perf_core.h"
> >>>> #include "main.h"
> >>>> @@ -2260,6 +2261,22 @@ init_power_library(void)
> >>>> return -1;
> >>>> }
> >>>> }
> >>>> +
> >>>> + RTE_LCORE_FOREACH(lcore_id) {
> >>>> + /*
> >>>> + * Set the worker lcore's to have strict latency limit to allow
> >>>> + * the CPU to enter the shallowest idle state.
> >>>> + */
> >>>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> >>>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> >>> I wonder why it is set to all worker cores silently and unconditionally?
> >>> Wouldn't it be a change from current behavior of the power library?
> >> L3fwd-power uses Rx interrupt to receive packet.
> > AFAIK, not exactly.
> > From what I remember l3fwd-power still runs RX in poll mode,
> > thought it counts number of idle rx bursts.
> > As that number goes above some threshold, it puts itself into
> > sleep with some timeout value.
> Exactly.
> >
> >> Do you mean this
> >> setting should be for the core of Rx queue, right?
> >> This setting doesn't change the behavior of l3fwd-power. It is just for
> >> getting low resume latency when worker core wakes up from sleeping.
> > As understand your patch - you force CPU to select more shallow C state
> > when entering such sleep.
> > Then it means that possible packet loss will be smaller,
> > but power consumption probably higher, correct?
> correct.
> > If so, then it looks like a change from current behavior for that app,
> > and we probably need to document what will be an expected change.
> > Or probably as a better way - provider user with a way to choose,
> > new cmdline option or so.
> Yes.
> The power consumption may increase but the performance is better due to
> this patch if the platform enables cpuidle funtion.
Yes, that what I expect, and personally I am ok with that.
Though I suspect different users who use this sample as some test-app
might have different priorities in that tradeoff (power vs performance).
> After all, this is just a very little point. It is enough to document
> this change or impact in doc of this API. Just let it more clear for user.
> What do you think?
I think yes, probably just updating docs (rel-notes, SG ?) will be enough.
David Hunt, what are your thoughts here?
> >
> >>>> + if (ret != 0) {
> >>>> + RTE_LOG(ERR, L3FWD_POWER,
> >>>> + "Failed to set strict resume latency on core%u.\n",
> >>>> + lcore_id);
> >>>> + return ret;
> >>>> + }
> >>>> + }
> >>>> +
> >>>> return ret;
> >>>> }
> >>>>
> >>>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> >>>> }
> >>>> }
> >>>> }
> >>>> +
> >>>> + RTE_LCORE_FOREACH(lcore_id) {
> >>>> + /* Restore the original value in kernel. */
> >>>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> >>>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> >>>> + }
> >>>> +
> >>>> return ret;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 0/2] power: introduce PM QoS interface
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
2024-09-12 2:38 ` [PATCH v10 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-09-12 2:38 ` [PATCH v10 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-09-12 3:07 ` fengchengwen
2024-10-12 2:07 ` lihuisong (C)
2024-10-14 15:27 ` Stephen Hemminger
3 siblings, 1 reply; 114+ messages in thread
From: fengchengwen @ 2024-09-12 3:07 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, liuyonglong
Series-reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
On 2024/9/12 10:38, Huisong Li wrote:
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Please see the description in kernel document[1].
> Each cpuidle governor in Linux select which idle state to enter based on
> this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>
> ---
> v10:
> - replace LINE_MAX with a custom macro and fix two typos.
> v9:
> - move new feature description from release_24_07.rst to release_24_11.rst.
> v8:
> - update the latest code to resolve CI warning
> v7:
> - remove a dead code rte_lcore_is_enabled in patch[2/2]
> v6:
> - update release_24_07.rst based on dpdk repo to resolve CI warning.
> v5:
> - use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
> v4:
> - fix some comments basd on Stephen
> - add stdint.h include
> - add Acked-by Morten Brørup <mb@smartsharesystems.com>
> v3:
> - add RTE_POWER_xxx prefix for some macro in header
> - add the check for lcore_id with rte_lcore_is_enabled
> v2:
> - use PM QoS on CPU wide to replace the one on system wide
>
> Huisong Li (2):
> power: introduce PM QoS API on CPU wide
> examples/l3fwd-power: add PM QoS configuration
>
> doc/guides/prog_guide/power_man.rst | 24 ++++++
> doc/guides/rel_notes/release_24_11.rst | 5 ++
> examples/l3fwd-power/main.c | 24 ++++++
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
> lib/power/rte_power_qos.h | 73 ++++++++++++++++
> lib/power/version.map | 4 +
> 7 files changed, 243 insertions(+)
> create mode 100644 lib/power/rte_power_qos.c
> create mode 100644 lib/power/rte_power_qos.h
>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 0/2] power: introduce PM QoS interface
2024-09-12 3:07 ` [PATCH v10 0/2] power: introduce PM QoS interface fengchengwen
@ 2024-10-12 2:07 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-12 2:07 UTC (permalink / raw)
To: dev, thomas, ferruh.yigit, fengchengwen
Cc: mb, anatoly.burakov, david.hunt, sivaprasad.tummala, stephen,
david.marchand, liuyonglong
Hi Thomas and Ferruh,
Kindly ping for merge.
/Huisong
在 2024/9/12 11:07, fengchengwen 写道:
> Series-reviewed-by: Chengwen Feng <fengchengwen@huawei.com>
Thanks chengwen.
>
> On 2024/9/12 10:38, Huisong Li wrote:
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> interface is used to set and get the resume latency limit on the cpuX for
>> userspace. Please see the description in kernel document[1].
>> Each cpuidle governor in Linux select which idle state to enter based on
>> this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the delay
>> after sleep by setting strict resume latency (zero value).
>>
>> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>>
>> ---
>> v10:
>> - replace LINE_MAX with a custom macro and fix two typos.
>> v9:
>> - move new feature description from release_24_07.rst to release_24_11.rst.
>> v8:
>> - update the latest code to resolve CI warning
>> v7:
>> - remove a dead code rte_lcore_is_enabled in patch[2/2]
>> v6:
>> - update release_24_07.rst based on dpdk repo to resolve CI warning.
>> v5:
>> - use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
>> v4:
>> - fix some comments basd on Stephen
>> - add stdint.h include
>> - add Acked-by Morten Brørup <mb@smartsharesystems.com>
>> v3:
>> - add RTE_POWER_xxx prefix for some macro in header
>> - add the check for lcore_id with rte_lcore_is_enabled
>> v2:
>> - use PM QoS on CPU wide to replace the one on system wide
>>
>> Huisong Li (2):
>> power: introduce PM QoS API on CPU wide
>> examples/l3fwd-power: add PM QoS configuration
>>
>> doc/guides/prog_guide/power_man.rst | 24 ++++++
>> doc/guides/rel_notes/release_24_11.rst | 5 ++
>> examples/l3fwd-power/main.c | 24 ++++++
>> lib/power/meson.build | 2 +
>> lib/power/rte_power_qos.c | 111 +++++++++++++++++++++++++
>> lib/power/rte_power_qos.h | 73 ++++++++++++++++
>> lib/power/version.map | 4 +
>> 7 files changed, 243 insertions(+)
>> create mode 100644 lib/power/rte_power_qos.c
>> create mode 100644 lib/power/rte_power_qos.h
>>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 0/2] power: introduce PM QoS interface
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
` (2 preceding siblings ...)
2024-09-12 3:07 ` [PATCH v10 0/2] power: introduce PM QoS interface fengchengwen
@ 2024-10-14 15:27 ` Stephen Hemminger
2024-10-15 9:30 ` lihuisong (C)
3 siblings, 1 reply; 114+ messages in thread
From: Stephen Hemminger @ 2024-10-14 15:27 UTC (permalink / raw)
To: Huisong Li
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
On Thu, 12 Sep 2024 10:38:10 +0800
Huisong Li <lihuisong@huawei.com> wrote:
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Please see the description in kernel document[1].
> Each cpuidle governor in Linux select which idle state to enter based on
> this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> after sleep by setting strict resume latency (zero value).
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
This is not a direct critique of this patch.
The power library should have been designed to take a single configuration structure
specifying CPU frequencies, wake up latency, and all the parameters from the kernel.
And there would be a simple API with: rte_power_config_set() and rte_power_config_get().
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v10 0/2] power: introduce PM QoS interface
2024-10-14 15:27 ` Stephen Hemminger
@ 2024-10-15 9:30 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-15 9:30 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dev, mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, david.marchand, fengchengwen, liuyonglong
在 2024/10/14 23:27, Stephen Hemminger 写道:
> On Thu, 12 Sep 2024 10:38:10 +0800
> Huisong Li <lihuisong@huawei.com> wrote:
>
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> interface is used to set and get the resume latency limit on the cpuX for
>> userspace. Please see the description in kernel document[1].
>> Each cpuidle governor in Linux select which idle state to enter based on
>> this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the delay
>> after sleep by setting strict resume latency (zero value).
>>
>> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>
> This is not a direct critique of this patch.
> The power library should have been designed to take a single configuration structure
> specifying CPU frequencies, wake up latency, and all the parameters from the kernel.
> And there would be a simple API with: rte_power_config_set() and rte_power_config_get().
Agreed. There are several different configuration objects in power library.
It would be better if we could put the relevant configurations together.
This may be able to do it after Sivaprasad's optimized patches for core
and uncore codes in power library.
>
> .
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v11 0/2] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (11 preceding siblings ...)
2024-09-12 2:38 ` [PATCH v10 0/2] power: introduce PM QoS interface Huisong Li
@ 2024-10-21 11:42 ` Huisong Li
2024-10-21 11:42 ` [PATCH v11 1/2] power: introduce PM QoS API on CPU wide Huisong Li
2024-10-21 11:42 ` [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-10-23 4:09 ` [PATCH v12 0/3] power: introduce PM QoS interface Huisong Li
` (2 subsequent siblings)
15 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-21 11:42 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from idle state by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v11:
- operate the cpu id the lcore mapped by the new function
power_get_lcore_mapped_cpu_id().
v10:
- replace LINE_MAX with a custom macro and fix two typos.
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (2):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 19 ++++
doc/guides/rel_notes/release_24_11.rst | 5 +
examples/l3fwd-power/main.c | 24 +++++
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++++++
lib/power/version.map | 4 +
7 files changed, 250 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v11 1/2] power: introduce PM QoS API on CPU wide
2024-10-21 11:42 ` [PATCH v11 " Huisong Li
@ 2024-10-21 11:42 ` Huisong Li
2024-10-22 9:08 ` Konstantin Ananyev
2024-10-21 11:42 ` [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-21 11:42 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/prog_guide/power_man.rst | 19 ++++
doc/guides/rel_notes/release_24_11.rst | 5 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++++++
lib/power/version.map | 4 +
6 files changed, 226 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..91358b04f3 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -107,6 +107,25 @@ User Cases
The power management mechanism is used to save power when performing L3 forwarding.
+PM QoS
+------
+
+The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are latency sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+Applications can set and get the CPU resume latency by the
+``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
+respectively. Applications can set a strict resume latency (zero value) by
+the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
+get better performance (instead, the power consumption of platform may increase).
+
+
Ethernet PMD Power Management API
---------------------------------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..d9e268274b 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -237,6 +237,11 @@ New Features
This field is used to pass an extra configuration settings such as ability
to lookup IPv4 addresses in network byte order.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the resume latency when wake up from
+ idle state.
+
* **Added new API to register telemetry endpoint callbacks with private arguments.**
A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 2f0f3d26e9..9b5d3e8315 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..09692b2161
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == 0)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0)
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ int latency = -1;
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = 0;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..990c488373
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..08f178a39d 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_get_cpu_resume_latency;
+ rte_power_qos_set_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v11 1/2] power: introduce PM QoS API on CPU wide
2024-10-21 11:42 ` [PATCH v11 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-22 9:08 ` Konstantin Ananyev
2024-10-22 9:41 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-22 9:08 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Each cpuidle governor in Linux select which idle state to enter
> based on this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> when wake up from by setting strict resume latency (zero value).
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
LGTM overall, few nits, see below.
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
> doc/guides/prog_guide/power_man.rst | 19 ++++
> doc/guides/rel_notes/release_24_11.rst | 5 +
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
> lib/power/rte_power_qos.h | 73 +++++++++++++++
> lib/power/version.map | 4 +
> 6 files changed, 226 insertions(+)
> create mode 100644 lib/power/rte_power_qos.c
> create mode 100644 lib/power/rte_power_qos.h
>
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index f6674efe2d..91358b04f3 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -107,6 +107,25 @@ User Cases
> The power management mechanism is used to save power when performing L3 forwarding.
>
>
> +PM QoS
> +------
> +
> +The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> +interface is used to set and get the resume latency limit on the cpuX for
> +userspace. Each cpuidle governor in Linux select which idle state to enter
> +based on this CPU resume latency in their idle task.
> +
> +The deeper the idle state, the lower the power consumption, but the longer
> +the resume time. Some service are latency sensitive and very except the low
> +resume time, like interrupt packet receiving mode.
> +
> +Applications can set and get the CPU resume latency by the
> +``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
> +respectively. Applications can set a strict resume latency (zero value) by
> +the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
> +get better performance (instead, the power consumption of platform may increase).
> +
> +
> Ethernet PMD Power Management API
> ---------------------------------
>
> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
> index fa4822d928..d9e268274b 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -237,6 +237,11 @@ New Features
> This field is used to pass an extra configuration settings such as ability
> to lookup IPv4 addresses in network byte order.
>
> +* **Introduce per-CPU PM QoS interface.**
> +
> + * Add per-CPU PM QoS interface to low the resume latency when wake up from
> + idle state.
> +
> * **Added new API to register telemetry endpoint callbacks with private arguments.**
>
> A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
> diff --git a/lib/power/meson.build b/lib/power/meson.build
> index 2f0f3d26e9..9b5d3e8315 100644
> --- a/lib/power/meson.build
> +++ b/lib/power/meson.build
> @@ -23,12 +23,14 @@ sources = files(
> 'rte_power.c',
> 'rte_power_uncore.c',
> 'rte_power_pmd_mgmt.c',
> + 'rte_power_qos.c',
> )
> headers = files(
> 'rte_power.h',
> 'rte_power_guest_channel.h',
> 'rte_power_pmd_mgmt.h',
> 'rte_power_uncore.h',
> + 'rte_power_qos.h',
> )
>
> deps += ['timer', 'ethdev']
> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
> new file mode 100644
> index 0000000000..09692b2161
> --- /dev/null
> +++ b/lib/power/rte_power_qos.c
> @@ -0,0 +1,123 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#include <errno.h>
> +#include <stdlib.h>
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +#include "power_common.h"
> +#include "rte_power_qos.h"
> +
> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> +
> +#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
> +
> +int
> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
> +{
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + uint32_t cpu_id;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
> + if (ret != 0)
> + return ret;
> +
> + if (latency < 0) {
> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + return ret;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
> + * is as follows for different input string.
> + * 1> the resume latency is 0 if the input is "n/a".
> + * 2> the resume latency is no constraint if the input is "0".
> + * 3> the resume latency is the actual value to be set.
> + */
> + if (latency == 0)
Why not to use your own macro:
RTE_POWER_QOS_STRICT_LATENCY_VALUE
Instead of hard-coded constant here?
> + snprintf(buf, sizeof(buf), "%s", "n/a");
> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
> + snprintf(buf, sizeof(buf), "%u", 0);
> + else
> + snprintf(buf, sizeof(buf), "%u", latency);
> +
> + ret = write_core_sysfs_s(f, buf);
> + if (ret != 0)
> + POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> +
> + fclose(f);
> +
> + return ret;
> +}
> +
> +int
> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
> +{
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + int latency = -1;
> + uint32_t cpu_id;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
> + if (ret != 0)
> + return ret;
> +
> + ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + return ret;
> + }
> +
> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + goto out;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
> + * is as follows for different output string.
> + * 1> the resume latency is 0 if the output is "n/a".
> + * 2> the resume latency is no constraint if the output is "0".
> + * 3> the resume latency is the actual value in used for other string.
> + */
> + if (strcmp(buf, "n/a") == 0)
> + latency = 0;
RTE_POWER_QOS_STRICT_LATENCY_VALUE
?
> + else {
> + latency = strtoul(buf, NULL, 10);
> + latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
> + }
> +
> +out:
> + fclose(f);
> +
> + return latency != -1 ? latency : ret;
> +}
> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
> new file mode 100644
> index 0000000000..990c488373
> --- /dev/null
> +++ b/lib/power/rte_power_qos.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef RTE_POWER_QOS_H
> +#define RTE_POWER_QOS_H
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file rte_power_qos.h
> + *
> + * PM QoS API.
> + *
> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
> + * state selection in each cpuidle governor.
> + * Please see the PM QoS on CPU wide in the following link:
> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-
> power-pm-qos-resume-latency-us
> + *
> + * The deeper the idle state, the lower the power consumption, but the
> + * longer the resume time. Some service are delay sensitive and very except the
> + * low resume time, like interrupt packet receiving mode.
> + *
> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
> + * state selection and limit just enter the shallowest idle state to low the
> + * delay after sleep by setting strict resume latency (zero value).
> + */
> +
> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
Isn't it just INT32_MAX?
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * @param lcore_id
> + * target logical core id
> + *
> + * @param latency
> + * The latency should be greater than and equal to zero in microseconds unit.
> + *
> + * @return
> + * 0 on success. Otherwise negative value is returned.
> + */
> +__rte_experimental
> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the current resume latency of this logical core.
> + * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
> + * if don't set it.
> + *
> + * @return
> + * Negative value on failure.
> + * >= 0 means the actual resume latency limit on this core.
> + */
> +__rte_experimental
> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* RTE_POWER_QOS_H */
> diff --git a/lib/power/version.map b/lib/power/version.map
> index c9a226614e..08f178a39d 100644
> --- a/lib/power/version.map
> +++ b/lib/power/version.map
> @@ -51,4 +51,8 @@ EXPERIMENTAL {
> rte_power_set_uncore_env;
> rte_power_uncore_freqs;
> rte_power_unset_uncore_env;
> +
> + # added in 24.11
> + rte_power_qos_get_cpu_resume_latency;
> + rte_power_qos_set_cpu_resume_latency;
> };
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v11 1/2] power: introduce PM QoS API on CPU wide
2024-10-22 9:08 ` Konstantin Ananyev
@ 2024-10-22 9:41 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-22 9:41 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
在 2024/10/22 17:08, Konstantin Ananyev 写道:
>
>> The deeper the idle state, the lower the power consumption, but the longer
>> the resume time. Some service are delay sensitive and very except the low
>> resume time, like interrupt packet receiving mode.
>>
>> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> interface is used to set and get the resume latency limit on the cpuX for
>> userspace. Each cpuidle governor in Linux select which idle state to enter
>> based on this CPU resume latency in their idle task.
>>
>> The per-CPU PM QoS API can be used to control this CPU's idle state
>> selection and limit just enter the shallowest idle state to low the delay
>> when wake up from by setting strict resume latency (zero value).
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> LGTM overall, few nits, see below.
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>
>> ---
>> doc/guides/prog_guide/power_man.rst | 19 ++++
>> doc/guides/rel_notes/release_24_11.rst | 5 +
>> lib/power/meson.build | 2 +
>> lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
>> lib/power/rte_power_qos.h | 73 +++++++++++++++
>> lib/power/version.map | 4 +
>> 6 files changed, 226 insertions(+)
>> create mode 100644 lib/power/rte_power_qos.c
>> create mode 100644 lib/power/rte_power_qos.h
>>
>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>> index f6674efe2d..91358b04f3 100644
>> --- a/doc/guides/prog_guide/power_man.rst
>> +++ b/doc/guides/prog_guide/power_man.rst
>> @@ -107,6 +107,25 @@ User Cases
>> The power management mechanism is used to save power when performing L3 forwarding.
>>
>>
>> +PM QoS
>> +------
>> +
>> +The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
>> +interface is used to set and get the resume latency limit on the cpuX for
>> +userspace. Each cpuidle governor in Linux select which idle state to enter
>> +based on this CPU resume latency in their idle task.
>> +
>> +The deeper the idle state, the lower the power consumption, but the longer
>> +the resume time. Some service are latency sensitive and very except the low
>> +resume time, like interrupt packet receiving mode.
>> +
>> +Applications can set and get the CPU resume latency by the
>> +``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
>> +respectively. Applications can set a strict resume latency (zero value) by
>> +the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
>> +get better performance (instead, the power consumption of platform may increase).
>> +
>> +
>> Ethernet PMD Power Management API
>> ---------------------------------
>>
>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>> index fa4822d928..d9e268274b 100644
>> --- a/doc/guides/rel_notes/release_24_11.rst
>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -237,6 +237,11 @@ New Features
>> This field is used to pass an extra configuration settings such as ability
>> to lookup IPv4 addresses in network byte order.
>>
>> +* **Introduce per-CPU PM QoS interface.**
>> +
>> + * Add per-CPU PM QoS interface to low the resume latency when wake up from
>> + idle state.
>> +
>> * **Added new API to register telemetry endpoint callbacks with private arguments.**
>>
>> A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
>> diff --git a/lib/power/meson.build b/lib/power/meson.build
>> index 2f0f3d26e9..9b5d3e8315 100644
>> --- a/lib/power/meson.build
>> +++ b/lib/power/meson.build
>> @@ -23,12 +23,14 @@ sources = files(
>> 'rte_power.c',
>> 'rte_power_uncore.c',
>> 'rte_power_pmd_mgmt.c',
>> + 'rte_power_qos.c',
>> )
>> headers = files(
>> 'rte_power.h',
>> 'rte_power_guest_channel.h',
>> 'rte_power_pmd_mgmt.h',
>> 'rte_power_uncore.h',
>> + 'rte_power_qos.h',
>> )
>>
>> deps += ['timer', 'ethdev']
>> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
>> new file mode 100644
>> index 0000000000..09692b2161
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.c
>> @@ -0,0 +1,123 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#include <errno.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +
>> +#include <rte_lcore.h>
>> +#include <rte_log.h>
>> +
>> +#include "power_common.h"
>> +#include "rte_power_qos.h"
>> +
>> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
>> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
>> +
>> +#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
>> +
>> +int
>> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
>> +{
>> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
>> + uint32_t cpu_id;
>> + FILE *f;
>> + int ret;
>> +
>> + if (!rte_lcore_is_enabled(lcore_id)) {
>> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
>> + return -EINVAL;
>> + }
>> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
>> + if (ret != 0)
>> + return ret;
>> +
>> + if (latency < 0) {
>> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
>> + return -EINVAL;
>> + }
>> +
>> + ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
>> + cpu_id, strerror(errno));
>> + return ret;
>> + }
>> +
>> + /*
>> + * Based on the sysfs interface pm_qos_resume_latency_us under
>> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
>> + * is as follows for different input string.
>> + * 1> the resume latency is 0 if the input is "n/a".
>> + * 2> the resume latency is no constraint if the input is "0".
>> + * 3> the resume latency is the actual value to be set.
>> + */
>> + if (latency == 0)
>
> Why not to use your own macro:
> RTE_POWER_QOS_STRICT_LATENCY_VALUE
> Instead of hard-coded constant here?
you are right. will fix it in next version.
>
>> + snprintf(buf, sizeof(buf), "%s", "n/a");
>> + else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
>> + snprintf(buf, sizeof(buf), "%u", 0);
>> + else
>> + snprintf(buf, sizeof(buf), "%u", latency);
>> +
>> + ret = write_core_sysfs_s(f, buf);
>> + if (ret != 0)
>> + POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
>> + cpu_id, strerror(errno));
>> +
>> + fclose(f);
>> +
>> + return ret;
>> +}
>> +
>> +int
>> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
>> +{
>> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
>> + int latency = -1;
>> + uint32_t cpu_id;
>> + FILE *f;
>> + int ret;
>> +
>> + if (!rte_lcore_is_enabled(lcore_id)) {
>> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
>> + return -EINVAL;
>> + }
>> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
>> + if (ret != 0)
>> + return ret;
>> +
>> + ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
>> + cpu_id, strerror(errno));
>> + return ret;
>> + }
>> +
>> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
>> + if (ret != 0) {
>> + POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
>> + cpu_id, strerror(errno));
>> + goto out;
>> + }
>> +
>> + /*
>> + * Based on the sysfs interface pm_qos_resume_latency_us under
>> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
>> + * is as follows for different output string.
>> + * 1> the resume latency is 0 if the output is "n/a".
>> + * 2> the resume latency is no constraint if the output is "0".
>> + * 3> the resume latency is the actual value in used for other string.
>> + */
>> + if (strcmp(buf, "n/a") == 0)
>> + latency = 0;
>
> RTE_POWER_QOS_STRICT_LATENCY_VALUE
Ack
> ?
>
>> + else {
>> + latency = strtoul(buf, NULL, 10);
>> + latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
>> + }
>> +
>> +out:
>> + fclose(f);
>> +
>> + return latency != -1 ? latency : ret;
>> +}
>> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
>> new file mode 100644
>> index 0000000000..990c488373
>> --- /dev/null
>> +++ b/lib/power/rte_power_qos.h
>> @@ -0,0 +1,73 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 HiSilicon Limited
>> + */
>> +
>> +#ifndef RTE_POWER_QOS_H
>> +#define RTE_POWER_QOS_H
>> +
>> +#include <stdint.h>
>> +
>> +#include <rte_compat.h>
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * @file rte_power_qos.h
>> + *
>> + * PM QoS API.
>> + *
>> + * The CPU-wide resume latency limit has a positive impact on this CPU's idle
>> + * state selection in each cpuidle governor.
>> + * Please see the PM QoS on CPU wide in the following link:
>> + * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-
>> power-pm-qos-resume-latency-us
>> + *
>> + * The deeper the idle state, the lower the power consumption, but the
>> + * longer the resume time. Some service are delay sensitive and very except the
>> + * low resume time, like interrupt packet receiving mode.
>> + *
>> + * In these case, per-CPU PM QoS API can be used to control this CPU's idle
>> + * state selection and limit just enter the shallowest idle state to low the
>> + * delay after sleep by setting strict resume latency (zero value).
>> + */
>> +
>> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
>> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT ((int)(UINT32_MAX >> 1))
> Isn't it just INT32_MAX?
will fix it.
>
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * @param lcore_id
>> + * target logical core id
>> + *
>> + * @param latency
>> + * The latency should be greater than and equal to zero in microseconds unit.
>> + *
>> + * @return
>> + * 0 on success. Otherwise negative value is returned.
>> + */
>> +__rte_experimental
>> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Get the current resume latency of this logical core.
>> + * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
>> + * if don't set it.
>> + *
>> + * @return
>> + * Negative value on failure.
>> + * >= 0 means the actual resume latency limit on this core.
>> + */
>> +__rte_experimental
>> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* RTE_POWER_QOS_H */
>> diff --git a/lib/power/version.map b/lib/power/version.map
>> index c9a226614e..08f178a39d 100644
>> --- a/lib/power/version.map
>> +++ b/lib/power/version.map
>> @@ -51,4 +51,8 @@ EXPERIMENTAL {
>> rte_power_set_uncore_env;
>> rte_power_uncore_freqs;
>> rte_power_unset_uncore_env;
>> +
>> + # added in 24.11
>> + rte_power_qos_get_cpu_resume_latency;
>> + rte_power_qos_set_cpu_resume_latency;
>> };
>> --
>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-21 11:42 ` [PATCH v11 " Huisong Li
2024-10-21 11:42 ` [PATCH v11 1/2] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-21 11:42 ` Huisong Li
2024-10-22 9:10 ` Konstantin Ananyev
1 sibling, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-21 11:42 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
Add PM QoS configuration to declease the delay after sleep in case of
entering deeper idle state.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..b0ddb54ee2 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -2260,6 +2261,22 @@ init_power_library(void)
return -1;
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the worker lcore's to have strict latency limit to allow
+ * the CPU to enter the shallowest idle state.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_STRICT_LATENCY_VALUE);
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set strict resume latency on core%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+
return ret;
}
@@ -2299,6 +2316,13 @@ deinit_power_library(void)
}
}
}
+
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-21 11:42 ` [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-10-22 9:10 ` Konstantin Ananyev
2024-10-22 9:44 ` lihuisong (C)
2024-10-23 6:27 ` lihuisong (C)
0 siblings, 2 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-22 9:10 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
>
> Add PM QoS configuration to declease the delay after sleep in case of
> entering deeper idle state.
I still think it is worth to mention this behavior change somewhere in the docs.
Probably release_notes or sample app guides.
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 2bb6b092c3..b0ddb54ee2 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -47,6 +47,7 @@
> #include <rte_telemetry.h>
> #include <rte_power_pmd_mgmt.h>
> #include <rte_power_uncore.h>
> +#include <rte_power_qos.h>
>
> #include "perf_core.h"
> #include "main.h"
> @@ -2260,6 +2261,22 @@ init_power_library(void)
> return -1;
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /*
> + * Set the worker lcore's to have strict latency limit to allow
> + * the CPU to enter the shallowest idle state.
> + */
> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> + if (ret != 0) {
> + RTE_LOG(ERR, L3FWD_POWER,
> + "Failed to set strict resume latency on core%u.\n",
> + lcore_id);
> + return ret;
> + }
> + }
> +
> return ret;
> }
>
> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> }
> }
> }
> +
> + RTE_LCORE_FOREACH(lcore_id) {
> + /* Restore the original value in kernel. */
> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> + }
> +
> return ret;
> }
>
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-22 9:10 ` Konstantin Ananyev
@ 2024-10-22 9:44 ` lihuisong (C)
2024-10-22 12:15 ` Konstantin Ananyev
2024-10-23 6:27 ` lihuisong (C)
1 sibling, 1 reply; 114+ messages in thread
From: lihuisong (C) @ 2024-10-22 9:44 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
在 2024/10/22 17:10, Konstantin Ananyev 写道:
>
>> Add PM QoS configuration to declease the delay after sleep in case of
>> entering deeper idle state.
> I still think it is worth to mention this behavior change somewhere in the docs.
> Probably release_notes or sample app guides.
I already added this influence to power_man.rst.
ok, I will add some comments about this into l3_forward_power_man.rst.
How about put this comments at the end of "Overview"?
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>> index 2bb6b092c3..b0ddb54ee2 100644
>> --- a/examples/l3fwd-power/main.c
>> +++ b/examples/l3fwd-power/main.c
>> @@ -47,6 +47,7 @@
>> #include <rte_telemetry.h>
>> #include <rte_power_pmd_mgmt.h>
>> #include <rte_power_uncore.h>
>> +#include <rte_power_qos.h>
>>
>> #include "perf_core.h"
>> #include "main.h"
>> @@ -2260,6 +2261,22 @@ init_power_library(void)
>> return -1;
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /*
>> + * Set the worker lcore's to have strict latency limit to allow
>> + * the CPU to enter the shallowest idle state.
>> + */
>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
>> + if (ret != 0) {
>> + RTE_LOG(ERR, L3FWD_POWER,
>> + "Failed to set strict resume latency on core%u.\n",
>> + lcore_id);
>> + return ret;
>> + }
>> + }
>> +
>> return ret;
>> }
>>
>> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
>> }
>> }
>> }
>> +
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /* Restore the original value in kernel. */
>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
>> + }
>> +
>> return ret;
>> }
>>
>> --
>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-22 9:44 ` lihuisong (C)
@ 2024-10-22 12:15 ` Konstantin Ananyev
0 siblings, 0 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-22 12:15 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> >
> >> Add PM QoS configuration to declease the delay after sleep in case of
> >> entering deeper idle state.
> > I still think it is worth to mention this behavior change somewhere in the docs.
> > Probably release_notes or sample app guides.
> I already added this influence to power_man.rst.
> ok, I will add some comments about this into l3_forward_power_man.rst.
> How about put this comments at the end of "Overview"?
Works for me.
Konstantin
> >> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
> >> 1 file changed, 24 insertions(+)
> >>
> >> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> >> index 2bb6b092c3..b0ddb54ee2 100644
> >> --- a/examples/l3fwd-power/main.c
> >> +++ b/examples/l3fwd-power/main.c
> >> @@ -47,6 +47,7 @@
> >> #include <rte_telemetry.h>
> >> #include <rte_power_pmd_mgmt.h>
> >> #include <rte_power_uncore.h>
> >> +#include <rte_power_qos.h>
> >>
> >> #include "perf_core.h"
> >> #include "main.h"
> >> @@ -2260,6 +2261,22 @@ init_power_library(void)
> >> return -1;
> >> }
> >> }
> >> +
> >> + RTE_LCORE_FOREACH(lcore_id) {
> >> + /*
> >> + * Set the worker lcore's to have strict latency limit to allow
> >> + * the CPU to enter the shallowest idle state.
> >> + */
> >> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> >> + RTE_POWER_QOS_STRICT_LATENCY_VALUE);
> >> + if (ret != 0) {
> >> + RTE_LOG(ERR, L3FWD_POWER,
> >> + "Failed to set strict resume latency on core%u.\n",
> >> + lcore_id);
> >> + return ret;
> >> + }
> >> + }
> >> +
> >> return ret;
> >> }
> >>
> >> @@ -2299,6 +2316,13 @@ deinit_power_library(void)
> >> }
> >> }
> >> }
> >> +
> >> + RTE_LCORE_FOREACH(lcore_id) {
> >> + /* Restore the original value in kernel. */
> >> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> >> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> >> + }
> >> +
> >> return ret;
> >> }
> >>
> >> --
> >> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v11 2/2] examples/l3fwd-power: add PM QoS configuration
2024-10-22 9:10 ` Konstantin Ananyev
2024-10-22 9:44 ` lihuisong (C)
@ 2024-10-23 6:27 ` lihuisong (C)
1 sibling, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-23 6:27 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
Hi Konstantin,
在 2024/10/22 17:10, Konstantin Ananyev 写道:
>
>> Add PM QoS configuration to declease the delay after sleep in case of
>> entering deeper idle state.
> I still think it is worth to mention this behavior change somewhere in the docs.
> Probably release_notes or sample app guides.
I see some descriptions that application may enter deeper C-state in the
"Overview" chapter of l3_forward_power_man.rst.
So I feel it's better to add a new command parameter for it as you
suggested.
Please have a look again, thanks.😁
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>> examples/l3fwd-power/main.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>> index 2bb6b092c3..b0ddb54ee2 100644
>> --- a/examples/l3fwd-power/main.c
>> +++ b/examples/l3fwd-power/main.c
>> @@ -47,6 +47,7 @@
>> #include <rte_telemetry.h>
>> #include <rte_power_pmd_mgmt.h>
>> #include <rte_power_uncore.h>
>> +#include <rte_power_qos.h>
<...>
>> }
>>
>> --
>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v12 0/3] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (12 preceding siblings ...)
2024-10-21 11:42 ` [PATCH v11 " Huisong Li
@ 2024-10-23 4:09 ` Huisong Li
2024-10-23 4:09 ` [PATCH v12 1/3] power: introduce PM QoS API on CPU wide Huisong Li
` (2 more replies)
2024-10-25 9:18 ` [PATCH v13 0/3] power: introduce PM QoS interface Huisong Li
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
15 siblings, 3 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-23 4:09 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from idle state by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v12:
- add Acked-by Chengwen and Konstantin
- fix overflow issue in l3fwd-power when parse command line
- add a command parameter to set CPU resume latency
v11:
- operate the cpu id the lcore mapped by the new function
power_get_lcore_mapped_cpu_id().
v10:
- replace LINE_MAX with a custom macro and fix two typos.
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (3):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: fix data overflow when parse command line
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 19 +++
doc/guides/rel_notes/release_24_11.rst | 5 +
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 92 +++++++++++--
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 ++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++
lib/power/version.map | 4 +
8 files changed, 308 insertions(+), 15 deletions(-)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v12 1/3] power: introduce PM QoS API on CPU wide
2024-10-23 4:09 ` [PATCH v12 0/3] power: introduce PM QoS interface Huisong Li
@ 2024-10-23 4:09 ` Huisong Li
2024-10-23 4:09 ` [PATCH v12 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
2024-10-23 4:09 ` [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
2 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-23 4:09 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
doc/guides/prog_guide/power_man.rst | 19 ++++
doc/guides/rel_notes/release_24_11.rst | 5 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++++++
lib/power/version.map | 4 +
6 files changed, 226 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..91358b04f3 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -107,6 +107,25 @@ User Cases
The power management mechanism is used to save power when performing L3 forwarding.
+PM QoS
+------
+
+The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are latency sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+Applications can set and get the CPU resume latency by the
+``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
+respectively. Applications can set a strict resume latency (zero value) by
+the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
+get better performance (instead, the power consumption of platform may increase).
+
+
Ethernet PMD Power Management API
---------------------------------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..d9e268274b 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -237,6 +237,11 @@ New Features
This field is used to pass an extra configuration settings such as ability
to lookup IPv4 addresses in network byte order.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the resume latency when wake up from
+ idle state.
+
* **Added new API to register telemetry endpoint callbacks with private arguments.**
A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 2f0f3d26e9..9b5d3e8315 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..4dd0532b36
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == RTE_POWER_QOS_STRICT_LATENCY_VALUE)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0)
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ int latency = -1;
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = RTE_POWER_QOS_STRICT_LATENCY_VALUE;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..7a8dab9272
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT INT32_MAX
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..08f178a39d 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_get_cpu_resume_latency;
+ rte_power_qos_set_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v12 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-23 4:09 ` [PATCH v12 0/3] power: introduce PM QoS interface Huisong Li
2024-10-23 4:09 ` [PATCH v12 1/3] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-23 4:09 ` Huisong Li
2024-10-24 16:34 ` Konstantin Ananyev
2024-10-23 4:09 ` [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
2 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-23 4:09 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
and so on. They use parse_int() to parse it from command line.
But overflow problem occurs when this function return.
Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
Cc: stable@dpdk.org
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
examples/l3fwd-power/main.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..0ce4aa04d4 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -1525,7 +1525,7 @@ print_usage(const char *prgname)
}
static int
-parse_int(const char *opt)
+parse_uint32(const char *opt, uint32_t *res)
{
char *end = NULL;
unsigned long val;
@@ -1535,23 +1535,14 @@ parse_int(const char *opt)
if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
return -1;
- return val;
-}
-
-static int parse_max_pkt_len(const char *pktlen)
-{
- char *end = NULL;
- unsigned long len;
-
- /* parse decimal string */
- len = strtoul(pktlen, &end, 10);
- if ((pktlen[0] == '\0') || (end == NULL) || (*end != '\0'))
+ if (val > UINT32_MAX) {
+ RTE_LOG(ERR, L3FWD_POWER, "parameter shouldn't exceed %u.\n", UINT32_MAX);
return -1;
+ }
- if (len == 0)
- return -1;
+ *res = val;
- return len;
+ return 0;
}
static int
@@ -1898,8 +1889,9 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_PKT_LEN,
sizeof(CMD_LINE_OPT_MAX_PKT_LEN))) {
+ if (parse_uint32(optarg, &max_pkt_len) != 0)
+ return -1;
printf("Custom frame size is configured\n");
- max_pkt_len = parse_max_pkt_len(optarg);
}
if (!strncmp(lgopts[option_index].name,
@@ -1912,29 +1904,33 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_EMPTY_POLLS,
sizeof(CMD_LINE_OPT_MAX_EMPTY_POLLS))) {
+ if (parse_uint32(optarg, &max_empty_polls) != 0)
+ return -1;
printf("Maximum empty polls configured\n");
- max_empty_polls = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_PAUSE_DURATION,
sizeof(CMD_LINE_OPT_PAUSE_DURATION))) {
+ if (parse_uint32(optarg, &pause_duration) != 0)
+ return -1;
printf("Pause duration configured\n");
- pause_duration = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MIN,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MIN))) {
+ if (parse_uint32(optarg, &scale_freq_min) != 0)
+ return -1;
printf("Scaling frequency minimum configured\n");
- scale_freq_min = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MAX,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MAX))) {
+ if (parse_uint32(optarg, &scale_freq_max) != 0)
+ return -1;
printf("Scaling frequency maximum configured\n");
- scale_freq_max = parse_int(optarg);
}
break;
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v12 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-23 4:09 ` [PATCH v12 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
@ 2024-10-24 16:34 ` Konstantin Ananyev
0 siblings, 0 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-24 16:34 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
> and so on. They use parse_int() to parse it from command line.
> But overflow problem occurs when this function return.
>
> Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
> Cc: stable@dpdk.org
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> ---
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-23 4:09 ` [PATCH v12 0/3] power: introduce PM QoS interface Huisong Li
2024-10-23 4:09 ` [PATCH v12 1/3] power: introduce PM QoS API on CPU wide Huisong Li
2024-10-23 4:09 ` [PATCH v12 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
@ 2024-10-23 4:09 ` Huisong Li
2024-10-24 16:44 ` Konstantin Ananyev
2 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-23 4:09 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The '--cpu-resume-latency' can use to control C-state selection.
Setting the CPU resume latency to 0 can limit the CPU just to enter
C0-state to improve performance, which also may increase the power
consumption of platform.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
---
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 68 +++++++++++++++++++
2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 9c9684fea7..70fa83669a 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
if there is no Rx packet received on recent polls.
In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
-instead of always running to the C0 state waiting for packets.
+instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
+Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power consumption of platform.
.. note::
@@ -105,6 +106,8 @@ where,
* --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
+* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
+
* --max-pkt-len: optional, maximum packet length in decimal (64-9600)
* --no-numa: optional, disables numa awareness
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 0ce4aa04d4..e58f4e301c 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -265,6 +266,9 @@ static uint32_t pause_duration = 1;
static uint32_t scale_freq_min;
static uint32_t scale_freq_max;
+static int cpu_resume_latency;
+static bool pm_qos_en;
+
static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
@@ -1501,6 +1505,8 @@ print_usage(const char *prgname)
" -U: set min/max frequency for uncore to maximum value\n"
" -i (frequency index): set min/max frequency for uncore to specified frequency index\n"
" --config (port,queue,lcore): rx queues configuration\n"
+ " --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection,"
+ " 0 : just allow to enter C0-state\n"
" --high-perf-cores CORELIST: list of high performance cores\n"
" --perf-config: similar as config, cores specified as indices"
" for bins containing high or regular performance cores\n"
@@ -1545,6 +1551,28 @@ parse_uint32(const char *opt, uint32_t *res)
return 0;
}
+static int
+parse_int(const char *opt, int *res)
+{
+ char *end = NULL;
+ signed long val;
+
+ /* parse integer string */
+ val = strtol(opt, &end, 10);
+ if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
+ return -1;
+
+ if (val < INT_MIN || val > INT_MAX) {
+ RTE_LOG(ERR, L3FWD_POWER, "parameter should be range from %d to %d.\n",
+ INT_MIN, INT_MAX);
+ return -1;
+ }
+
+ *res = val;
+
+ return 0;
+}
+
static int
parse_uncore_options(enum uncore_choice choice, const char *argument)
{
@@ -1734,6 +1762,7 @@ parse_pmd_mgmt_config(const char *name)
#define CMD_LINE_OPT_PAUSE_DURATION "pause-duration"
#define CMD_LINE_OPT_SCALE_FREQ_MIN "scale-freq-min"
#define CMD_LINE_OPT_SCALE_FREQ_MAX "scale-freq-max"
+#define CMD_LINE_OPT_CPU_RESUME_LATENCY "cpu-resume-latency"
/* Parse the argument given in the command line of the application */
static int
@@ -1748,6 +1777,7 @@ parse_args(int argc, char **argv)
{"perf-config", 1, 0, 0},
{"high-perf-cores", 1, 0, 0},
{"no-numa", 0, 0, 0},
+ {CMD_LINE_OPT_CPU_RESUME_LATENCY, 1, 0, 0},
{CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, 0},
{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
{CMD_LINE_OPT_LEGACY, 0, 0, 0},
@@ -1933,6 +1963,15 @@ parse_args(int argc, char **argv)
printf("Scaling frequency maximum configured\n");
}
+ if (!strncmp(lgopts[option_index].name,
+ CMD_LINE_OPT_CPU_RESUME_LATENCY,
+ sizeof(CMD_LINE_OPT_CPU_RESUME_LATENCY))) {
+ if (parse_int(optarg, &cpu_resume_latency) != 0)
+ return -1;
+ printf("PM QoS configured\n");
+ pm_qos_en = true;
+ }
+
break;
default:
@@ -2256,6 +2295,26 @@ init_power_library(void)
return -1;
}
}
+
+ if (pm_qos_en) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /*
+ * Set the cpu resume latency of the worker lcore based
+ * on user's request. If set strict latency (0), just
+ * allow the CPU to enter the shallowest idle state to
+ * improve performance.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ cpu_resume_latency);
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set cpu resume latency on lcore-%u.\n",
+ lcore_id);
+ return ret;
+ }
+ }
+ }
+
return ret;
}
@@ -2295,6 +2354,15 @@ deinit_power_library(void)
}
}
}
+
+ if (pm_qos_en) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value in kernel. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
+ }
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-23 4:09 ` [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-10-24 16:44 ` Konstantin Ananyev
2024-10-25 3:35 ` lihuisong (C)
0 siblings, 1 reply; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-24 16:44 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
> The '--cpu-resume-latency' can use to control C-state selection.
> Setting the CPU resume latency to 0 can limit the CPU just to enter
> C0-state to improve performance, which also may increase the power
> consumption of platform.
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
> .../sample_app_ug/l3_forward_power_man.rst | 5 +-
> examples/l3fwd-power/main.c | 68 +++++++++++++++++++
> 2 files changed, 72 insertions(+), 1 deletion(-)
>
> diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
> index 9c9684fea7..70fa83669a 100644
> --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
> +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
> @@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
> In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
> if there is no Rx packet received on recent polls.
> In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
> -instead of always running to the C0 state waiting for packets.
> +instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
> +Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power
> consumption of platform.
>
> .. note::
>
> @@ -105,6 +106,8 @@ where,
>
> * --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
>
> +* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
> +
> * --max-pkt-len: optional, maximum packet length in decimal (64-9600)
>
> * --no-numa: optional, disables numa awareness
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 0ce4aa04d4..e58f4e301c 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -47,6 +47,7 @@
> #include <rte_telemetry.h>
> #include <rte_power_pmd_mgmt.h>
> #include <rte_power_uncore.h>
> +#include <rte_power_qos.h>
>
> #include "perf_core.h"
> #include "main.h"
> @@ -265,6 +266,9 @@ static uint32_t pause_duration = 1;
> static uint32_t scale_freq_min;
> static uint32_t scale_freq_max;
>
> +static int cpu_resume_latency;
> +static bool pm_qos_en;
> +
> static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
>
>
> @@ -1501,6 +1505,8 @@ print_usage(const char *prgname)
> " -U: set min/max frequency for uncore to maximum value\n"
> " -i (frequency index): set min/max frequency for uncore to specified frequency index\n"
> " --config (port,queue,lcore): rx queues configuration\n"
> + " --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection,"
> + " 0 : just allow to enter C0-state\n"
> " --high-perf-cores CORELIST: list of high performance cores\n"
> " --perf-config: similar as config, cores specified as indices"
> " for bins containing high or regular performance cores\n"
> @@ -1545,6 +1551,28 @@ parse_uint32(const char *opt, uint32_t *res)
> return 0;
> }
>
> +static int
> +parse_int(const char *opt, int *res)
> +{
> + char *end = NULL;
> + signed long val;
> +
> + /* parse integer string */
> + val = strtol(opt, &end, 10);
> + if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
> + return -1;
> +
> + if (val < INT_MIN || val > INT_MAX) {
> + RTE_LOG(ERR, L3FWD_POWER, "parameter should be range from %d to %d.\n",
> + INT_MIN, INT_MAX);
> + return -1;
> + }
> +
> + *res = val;
> +
> + return 0;
> +}
> +
> static int
> parse_uncore_options(enum uncore_choice choice, const char *argument)
> {
> @@ -1734,6 +1762,7 @@ parse_pmd_mgmt_config(const char *name)
> #define CMD_LINE_OPT_PAUSE_DURATION "pause-duration"
> #define CMD_LINE_OPT_SCALE_FREQ_MIN "scale-freq-min"
> #define CMD_LINE_OPT_SCALE_FREQ_MAX "scale-freq-max"
> +#define CMD_LINE_OPT_CPU_RESUME_LATENCY "cpu-resume-latency"
>
> /* Parse the argument given in the command line of the application */
> static int
> @@ -1748,6 +1777,7 @@ parse_args(int argc, char **argv)
> {"perf-config", 1, 0, 0},
> {"high-perf-cores", 1, 0, 0},
> {"no-numa", 0, 0, 0},
> + {CMD_LINE_OPT_CPU_RESUME_LATENCY, 1, 0, 0},
> {CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, 0},
> {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
> {CMD_LINE_OPT_LEGACY, 0, 0, 0},
> @@ -1933,6 +1963,15 @@ parse_args(int argc, char **argv)
> printf("Scaling frequency maximum configured\n");
> }
>
> + if (!strncmp(lgopts[option_index].name,
> + CMD_LINE_OPT_CPU_RESUME_LATENCY,
> + sizeof(CMD_LINE_OPT_CPU_RESUME_LATENCY))) {
> + if (parse_int(optarg, &cpu_resume_latency) != 0)
Do you really need a to support a negative values for that variable?
> + return -1;
> + printf("PM QoS configured\n");
> + pm_qos_en = true;
> + }
> +
> break;
>
> default:
> @@ -2256,6 +2295,26 @@ init_power_library(void)
> return -1;
> }
> }
> +
> + if (pm_qos_en) {
> + RTE_LCORE_FOREACH(lcore_id) {
> + /*
> + * Set the cpu resume latency of the worker lcore based
> + * on user's request. If set strict latency (0), just
> + * allow the CPU to enter the shallowest idle state to
> + * improve performance.
> + */
> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
> + cpu_resume_latency);
> + if (ret != 0) {
> + RTE_LOG(ERR, L3FWD_POWER,
> + "Failed to set cpu resume latency on lcore-%u.\n",
> + lcore_id);
> + return ret;
> + }
> + }
> + }
> +
> return ret;
> }
>
> @@ -2295,6 +2354,15 @@ deinit_power_library(void)
> }
> }
> }
> +
> + if (pm_qos_en) {
> + RTE_LCORE_FOREACH(lcore_id) {
> + /* Restore the original value in kernel. */
> + rte_power_qos_set_cpu_resume_latency(lcore_id,
> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
If we are going to 'restore original' shouldn't we:
At startup old_value=get()
At termination: set(old_value)
?
> + }
> + }
> +
> return ret;
> }
>
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v12 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-24 16:44 ` Konstantin Ananyev
@ 2024-10-25 3:35 ` lihuisong (C)
0 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-10-25 3:35 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
在 2024/10/25 0:44, Konstantin Ananyev 写道:
>
>> The '--cpu-resume-latency' can use to control C-state selection.
>> Setting the CPU resume latency to 0 can limit the CPU just to enter
>> C0-state to improve performance, which also may increase the power
>> consumption of platform.
>>
>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
>> ---
>> .../sample_app_ug/l3_forward_power_man.rst | 5 +-
>> examples/l3fwd-power/main.c | 68 +++++++++++++++++++
>> 2 files changed, 72 insertions(+), 1 deletion(-)
>>
>> diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
>> index 9c9684fea7..70fa83669a 100644
>> --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
>> +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
>> @@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
>> In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
>> if there is no Rx packet received on recent polls.
>> In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
>> -instead of always running to the C0 state waiting for packets.
>> +instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
>> +Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power
>> consumption of platform.
>>
>> .. note::
>>
>> @@ -105,6 +106,8 @@ where,
>>
>> * --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
>>
>> +* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
>> +
>> * --max-pkt-len: optional, maximum packet length in decimal (64-9600)
>>
>> * --no-numa: optional, disables numa awareness
>> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
>> index 0ce4aa04d4..e58f4e301c 100644
>> --- a/examples/l3fwd-power/main.c
>> +++ b/examples/l3fwd-power/main.c
>> @@ -47,6 +47,7 @@
>> #include <rte_telemetry.h>
>> #include <rte_power_pmd_mgmt.h>
>> #include <rte_power_uncore.h>
>> +#include <rte_power_qos.h>
>>
>> #include "perf_core.h"
>> #include "main.h"
>> @@ -265,6 +266,9 @@ static uint32_t pause_duration = 1;
>> static uint32_t scale_freq_min;
>> static uint32_t scale_freq_max;
>>
>> +static int cpu_resume_latency;
>> +static bool pm_qos_en;
>> +
>> static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
>>
>>
>> @@ -1501,6 +1505,8 @@ print_usage(const char *prgname)
>> " -U: set min/max frequency for uncore to maximum value\n"
>> " -i (frequency index): set min/max frequency for uncore to specified frequency index\n"
>> " --config (port,queue,lcore): rx queues configuration\n"
>> + " --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection,"
>> + " 0 : just allow to enter C0-state\n"
>> " --high-perf-cores CORELIST: list of high performance cores\n"
>> " --perf-config: similar as config, cores specified as indices"
>> " for bins containing high or regular performance cores\n"
>> @@ -1545,6 +1551,28 @@ parse_uint32(const char *opt, uint32_t *res)
>> return 0;
>> }
>>
>> +static int
>> +parse_int(const char *opt, int *res)
>> +{
>> + char *end = NULL;
>> + signed long val;
>> +
>> + /* parse integer string */
>> + val = strtol(opt, &end, 10);
>> + if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
>> + return -1;
>> +
>> + if (val < INT_MIN || val > INT_MAX) {
>> + RTE_LOG(ERR, L3FWD_POWER, "parameter should be range from %d to %d.\n",
>> + INT_MIN, INT_MAX);
>> + return -1;
>> + }
>> +
>> + *res = val;
>> +
>> + return 0;
>> +}
>> +
>> static int
>> parse_uncore_options(enum uncore_choice choice, const char *argument)
>> {
>> @@ -1734,6 +1762,7 @@ parse_pmd_mgmt_config(const char *name)
>> #define CMD_LINE_OPT_PAUSE_DURATION "pause-duration"
>> #define CMD_LINE_OPT_SCALE_FREQ_MIN "scale-freq-min"
>> #define CMD_LINE_OPT_SCALE_FREQ_MAX "scale-freq-max"
>> +#define CMD_LINE_OPT_CPU_RESUME_LATENCY "cpu-resume-latency"
>>
>> /* Parse the argument given in the command line of the application */
>> static int
>> @@ -1748,6 +1777,7 @@ parse_args(int argc, char **argv)
>> {"perf-config", 1, 0, 0},
>> {"high-perf-cores", 1, 0, 0},
>> {"no-numa", 0, 0, 0},
>> + {CMD_LINE_OPT_CPU_RESUME_LATENCY, 1, 0, 0},
>> {CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, 0},
>> {CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
>> {CMD_LINE_OPT_LEGACY, 0, 0, 0},
>> @@ -1933,6 +1963,15 @@ parse_args(int argc, char **argv)
>> printf("Scaling frequency maximum configured\n");
>> }
>>
>> + if (!strncmp(lgopts[option_index].name,
>> + CMD_LINE_OPT_CPU_RESUME_LATENCY,
>> + sizeof(CMD_LINE_OPT_CPU_RESUME_LATENCY))) {
>> + if (parse_int(optarg, &cpu_resume_latency) != 0)
> Do you really need a to support a negative values for that variable?
will fix it.
>
>> + return -1;
>> + printf("PM QoS configured\n");
>> + pm_qos_en = true;
>> + }
>> +
>> break;
>>
>> default:
>> @@ -2256,6 +2295,26 @@ init_power_library(void)
>> return -1;
>> }
>> }
>> +
>> + if (pm_qos_en) {
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /*
>> + * Set the cpu resume latency of the worker lcore based
>> + * on user's request. If set strict latency (0), just
>> + * allow the CPU to enter the shallowest idle state to
>> + * improve performance.
>> + */
>> + ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + cpu_resume_latency);
>> + if (ret != 0) {
>> + RTE_LOG(ERR, L3FWD_POWER,
>> + "Failed to set cpu resume latency on lcore-%u.\n",
>> + lcore_id);
>> + return ret;
>> + }
>> + }
>> + }
>> +
>> return ret;
>> }
>>
>> @@ -2295,6 +2354,15 @@ deinit_power_library(void)
>> }
>> }
>> }
>> +
>> + if (pm_qos_en) {
>> + RTE_LCORE_FOREACH(lcore_id) {
>> + /* Restore the original value in kernel. */
>> + rte_power_qos_set_cpu_resume_latency(lcore_id,
>> + RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT);
> If we are going to 'restore original' shouldn't we:
> At startup old_value=get()
> At termination: set(old_value)
> ?
Ack
>
>> + }
>> + }
>> +
>> return ret;
>> }
>>
>> --
>> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v13 0/3] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (13 preceding siblings ...)
2024-10-23 4:09 ` [PATCH v12 0/3] power: introduce PM QoS interface Huisong Li
@ 2024-10-25 9:18 ` Huisong Li
2024-10-25 9:18 ` [PATCH v13 1/3] power: introduce PM QoS API on CPU wide Huisong Li
` (2 more replies)
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
15 siblings, 3 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-25 9:18 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from idle state by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v13:
- not allow negative value for --cpu-resume-latency.
- restore to the original value as Konstantin suggested.
v12:
- add Acked-by Chengwen and Konstantin
- fix overflow issue in l3fwd-power when parse command line
- add a command parameter to set CPU resume latency
v11:
- operate the cpu id the lcore mapped by the new function
power_get_lcore_mapped_cpu_id().
v10:
- replace LINE_MAX with a custom macro and fix two typos.
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (3):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: fix data overflow when parse command line
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 19 +++
doc/guides/rel_notes/release_24_11.rst | 5 +
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 115 ++++++++++++++--
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 ++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++
lib/power/version.map | 4 +
8 files changed, 331 insertions(+), 15 deletions(-)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v13 1/3] power: introduce PM QoS API on CPU wide
2024-10-25 9:18 ` [PATCH v13 0/3] power: introduce PM QoS interface Huisong Li
@ 2024-10-25 9:18 ` Huisong Li
2024-10-25 12:08 ` Tummala, Sivaprasad
2024-10-25 9:18 ` [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
2024-10-25 9:18 ` [PATCH v13 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
2 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-25 9:18 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
doc/guides/prog_guide/power_man.rst | 19 ++++
doc/guides/rel_notes/release_24_11.rst | 5 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++++++
lib/power/version.map | 4 +
6 files changed, 226 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..91358b04f3 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -107,6 +107,25 @@ User Cases
The power management mechanism is used to save power when performing L3 forwarding.
+PM QoS
+------
+
+The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are latency sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+Applications can set and get the CPU resume latency by the
+``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
+respectively. Applications can set a strict resume latency (zero value) by
+the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
+get better performance (instead, the power consumption of platform may increase).
+
+
Ethernet PMD Power Management API
---------------------------------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..d9e268274b 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -237,6 +237,11 @@ New Features
This field is used to pass an extra configuration settings such as ability
to lookup IPv4 addresses in network byte order.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the resume latency when wake up from
+ idle state.
+
* **Added new API to register telemetry endpoint callbacks with private arguments.**
A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 2f0f3d26e9..9b5d3e8315 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..4dd0532b36
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == RTE_POWER_QOS_STRICT_LATENCY_VALUE)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0)
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ int latency = -1;
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = RTE_POWER_QOS_STRICT_LATENCY_VALUE;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..7a8dab9272
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT INT32_MAX
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..08f178a39d 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_get_cpu_resume_latency;
+ rte_power_qos_set_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v13 1/3] power: introduce PM QoS API on CPU wide
2024-10-25 9:18 ` [PATCH v13 1/3] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-25 12:08 ` Tummala, Sivaprasad
0 siblings, 0 replies; 114+ messages in thread
From: Tummala, Sivaprasad @ 2024-10-25 12:08 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, Yigit, Ferruh, anatoly.burakov, david.hunt, stephen,
konstantin.ananyev, david.marchand, fengchengwen, liuyonglong
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Huisong,
LGTM! One comment to update the doxygen documentation for the new APIs.
> -----Original Message-----
> From: Huisong Li <lihuisong@huawei.com>
> Sent: Friday, October 25, 2024 2:49 PM
> To: dev@dpdk.org
> Cc: mb@smartsharesystems.com; thomas@monjalon.net; Yigit, Ferruh
> <Ferruh.Yigit@amd.com>; anatoly.burakov@intel.com; david.hunt@intel.com;
> Tummala, Sivaprasad <Sivaprasad.Tummala@amd.com>;
> stephen@networkplumber.org; konstantin.ananyev@huawei.com;
> david.marchand@redhat.com; fengchengwen@huawei.com;
> liuyonglong@huawei.com; lihuisong@huawei.com
> Subject: [PATCH v13 1/3] power: introduce PM QoS API on CPU wide
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> The deeper the idle state, the lower the power consumption, but the longer the
> resume time. Some service are delay sensitive and very except the low resume
> time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for userspace.
> Each cpuidle governor in Linux select which idle state to enter based on this CPU
> resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state selection
> and limit just enter the shallowest idle state to low the delay when wake up from by
> setting strict resume latency (zero value).
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
> doc/guides/prog_guide/power_man.rst | 19 ++++
> doc/guides/rel_notes/release_24_11.rst | 5 +
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
> lib/power/rte_power_qos.h | 73 +++++++++++++++
> lib/power/version.map | 4 +
> 6 files changed, 226 insertions(+)
> create mode 100644 lib/power/rte_power_qos.c create mode 100644
> lib/power/rte_power_qos.h
>
> diff --git a/doc/guides/prog_guide/power_man.rst
> b/doc/guides/prog_guide/power_man.rst
> index f6674efe2d..91358b04f3 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -107,6 +107,25 @@ User Cases
> The power management mechanism is used to save power when performing L3
> forwarding.
>
>
> +PM QoS
> +------
> +
> +The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> +interface is used to set and get the resume latency limit on the cpuX
> +for userspace. Each cpuidle governor in Linux select which idle state
> +to enter based on this CPU resume latency in their idle task.
> +
> +The deeper the idle state, the lower the power consumption, but the
> +longer the resume time. Some service are latency sensitive and very
> +except the low resume time, like interrupt packet receiving mode.
> +
> +Applications can set and get the CPU resume latency by the
> +``rte_power_qos_set_cpu_resume_latency()`` and
> +``rte_power_qos_get_cpu_resume_latency()``
> +respectively. Applications can set a strict resume latency (zero value)
> +by the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume
> +latency and get better performance (instead, the power consumption of platform
> may increase).
> +
> +
> Ethernet PMD Power Management API
> ---------------------------------
>
> diff --git a/doc/guides/rel_notes/release_24_11.rst
> b/doc/guides/rel_notes/release_24_11.rst
> index fa4822d928..d9e268274b 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -237,6 +237,11 @@ New Features
> This field is used to pass an extra configuration settings such as ability
> to lookup IPv4 addresses in network byte order.
>
> +* **Introduce per-CPU PM QoS interface.**
> +
> + * Add per-CPU PM QoS interface to low the resume latency when wake up from
> + idle state.
> +
> * **Added new API to register telemetry endpoint callbacks with private
> arguments.**
>
> A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque
> value to diff --git a/lib/power/meson.build b/lib/power/meson.build index
> 2f0f3d26e9..9b5d3e8315 100644
> --- a/lib/power/meson.build
> +++ b/lib/power/meson.build
> @@ -23,12 +23,14 @@ sources = files(
> 'rte_power.c',
> 'rte_power_uncore.c',
> 'rte_power_pmd_mgmt.c',
> + 'rte_power_qos.c',
> )
> headers = files(
> 'rte_power.h',
> 'rte_power_guest_channel.h',
> 'rte_power_pmd_mgmt.h',
> 'rte_power_uncore.h',
> + 'rte_power_qos.h',
> )
>
> deps += ['timer', 'ethdev']
> diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c new file mode
> 100644 index 0000000000..4dd0532b36
> --- /dev/null
> +++ b/lib/power/rte_power_qos.c
> @@ -0,0 +1,123 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#include <errno.h>
> +#include <stdlib.h>
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +#include "power_common.h"
> +#include "rte_power_qos.h"
> +
> +#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
> + "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
> +
> +#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
> +
> +int
> +rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency) {
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + uint32_t cpu_id;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
> + if (ret != 0)
> + return ret;
> +
> + if (latency < 0) {
> + POWER_LOG(ERR, "latency should be greater than and equal to 0");
> + return -EINVAL;
> + }
> +
> + ret = open_core_sysfs_file(&f, "w",
> PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open
> "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + return ret;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their
> meaning
> + * is as follows for different input string.
> + * 1> the resume latency is 0 if the input is "n/a".
> + * 2> the resume latency is no constraint if the input is "0".
> + * 3> the resume latency is the actual value to be set.
> + */
> + if (latency == RTE_POWER_QOS_STRICT_LATENCY_VALUE)
> + snprintf(buf, sizeof(buf), "%s", "n/a");
> + else if (latency ==
> RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
> + snprintf(buf, sizeof(buf), "%u", 0);
> + else
> + snprintf(buf, sizeof(buf), "%u", latency);
> +
> + ret = write_core_sysfs_s(f, buf);
> + if (ret != 0)
> + POWER_LOG(ERR, "Failed to write
> "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> +
> + fclose(f);
> +
> + return ret;
> +}
> +
> +int
> +rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id) {
> + char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
> + int latency = -1;
> + uint32_t cpu_id;
> + FILE *f;
> + int ret;
> +
> + if (!rte_lcore_is_enabled(lcore_id)) {
> + POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
> + return -EINVAL;
> + }
> + ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
> + if (ret != 0)
> + return ret;
> +
> + ret = open_core_sysfs_file(&f, "r",
> PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to open
> "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + return ret;
> + }
> +
> + ret = read_core_sysfs_s(f, buf, sizeof(buf));
> + if (ret != 0) {
> + POWER_LOG(ERR, "Failed to read
> "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
> + cpu_id, strerror(errno));
> + goto out;
> + }
> +
> + /*
> + * Based on the sysfs interface pm_qos_resume_latency_us under
> + * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their
> meaning
> + * is as follows for different output string.
> + * 1> the resume latency is 0 if the output is "n/a".
> + * 2> the resume latency is no constraint if the output is "0".
> + * 3> the resume latency is the actual value in used for other string.
> + */
> + if (strcmp(buf, "n/a") == 0)
> + latency = RTE_POWER_QOS_STRICT_LATENCY_VALUE;
> + else {
> + latency = strtoul(buf, NULL, 10);
> + latency = latency == 0 ?
> RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
> + }
> +
> +out:
> + fclose(f);
> +
> + return latency != -1 ? latency : ret; }
> diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h new file mode
> 100644 index 0000000000..7a8dab9272
> --- /dev/null
> +++ b/lib/power/rte_power_qos.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 HiSilicon Limited
> + */
> +
> +#ifndef RTE_POWER_QOS_H
> +#define RTE_POWER_QOS_H
> +
> +#include <stdint.h>
> +
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * @file rte_power_qos.h
> + *
> + * PM QoS API.
> + *
> + * The CPU-wide resume latency limit has a positive impact on this
> +CPU's idle
> + * state selection in each cpuidle governor.
> + * Please see the PM QoS on CPU wide in the following link:
> + *
> +https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?hig
> +hlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-lat
> +ency-us
> + *
> + * The deeper the idle state, the lower the power consumption, but the
> + * longer the resume time. Some service are delay sensitive and very
> +except the
> + * low resume time, like interrupt packet receiving mode.
> + *
> + * In these case, per-CPU PM QoS API can be used to control this CPU's
> +idle
> + * state selection and limit just enter the shallowest idle state to
> +low the
> + * delay after sleep by setting strict resume latency (zero value).
> + */
> +
> +#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
> +#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
> INT32_MAX
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * @param lcore_id
> + * target logical core id
> + *
> + * @param latency
> + * The latency should be greater than and equal to zero in microseconds unit.
> + *
> + * @return
> + * 0 on success. Otherwise negative value is returned.
> + */
> +__rte_experimental
> +int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int
> +latency);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Get the current resume latency of this logical core.
> + * The default value in kernel is @see
> +RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
> + * if don't set it.
> + *
> + * @return
> + * Negative value on failure.
> + * >= 0 means the actual resume latency limit on this core.
> + */
> +__rte_experimental
> +int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* RTE_POWER_QOS_H */
> diff --git a/lib/power/version.map b/lib/power/version.map index
> c9a226614e..08f178a39d 100644
> --- a/lib/power/version.map
> +++ b/lib/power/version.map
> @@ -51,4 +51,8 @@ EXPERIMENTAL {
> rte_power_set_uncore_env;
> rte_power_uncore_freqs;
> rte_power_unset_uncore_env;
> +
> + # added in 24.11
> + rte_power_qos_get_cpu_resume_latency;
> + rte_power_qos_set_cpu_resume_latency;
> };
> --
> 2.22.0
Acked-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-25 9:18 ` [PATCH v13 0/3] power: introduce PM QoS interface Huisong Li
2024-10-25 9:18 ` [PATCH v13 1/3] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-25 9:18 ` Huisong Li
2024-10-25 11:27 ` fengchengwen
2024-10-29 5:48 ` Tummala, Sivaprasad
2024-10-25 9:18 ` [PATCH v13 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
2 siblings, 2 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-25 9:18 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
and so on. They use parse_int() to parse it from command line.
But overflow problem occurs when this function return.
Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
Cc: stable@dpdk.org
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
examples/l3fwd-power/main.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..0ce4aa04d4 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -1525,7 +1525,7 @@ print_usage(const char *prgname)
}
static int
-parse_int(const char *opt)
+parse_uint32(const char *opt, uint32_t *res)
{
char *end = NULL;
unsigned long val;
@@ -1535,23 +1535,14 @@ parse_int(const char *opt)
if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
return -1;
- return val;
-}
-
-static int parse_max_pkt_len(const char *pktlen)
-{
- char *end = NULL;
- unsigned long len;
-
- /* parse decimal string */
- len = strtoul(pktlen, &end, 10);
- if ((pktlen[0] == '\0') || (end == NULL) || (*end != '\0'))
+ if (val > UINT32_MAX) {
+ RTE_LOG(ERR, L3FWD_POWER, "parameter shouldn't exceed %u.\n", UINT32_MAX);
return -1;
+ }
- if (len == 0)
- return -1;
+ *res = val;
- return len;
+ return 0;
}
static int
@@ -1898,8 +1889,9 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_PKT_LEN,
sizeof(CMD_LINE_OPT_MAX_PKT_LEN))) {
+ if (parse_uint32(optarg, &max_pkt_len) != 0)
+ return -1;
printf("Custom frame size is configured\n");
- max_pkt_len = parse_max_pkt_len(optarg);
}
if (!strncmp(lgopts[option_index].name,
@@ -1912,29 +1904,33 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_EMPTY_POLLS,
sizeof(CMD_LINE_OPT_MAX_EMPTY_POLLS))) {
+ if (parse_uint32(optarg, &max_empty_polls) != 0)
+ return -1;
printf("Maximum empty polls configured\n");
- max_empty_polls = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_PAUSE_DURATION,
sizeof(CMD_LINE_OPT_PAUSE_DURATION))) {
+ if (parse_uint32(optarg, &pause_duration) != 0)
+ return -1;
printf("Pause duration configured\n");
- pause_duration = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MIN,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MIN))) {
+ if (parse_uint32(optarg, &scale_freq_min) != 0)
+ return -1;
printf("Scaling frequency minimum configured\n");
- scale_freq_min = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MAX,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MAX))) {
+ if (parse_uint32(optarg, &scale_freq_max) != 0)
+ return -1;
printf("Scaling frequency maximum configured\n");
- scale_freq_max = parse_int(optarg);
}
break;
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-25 9:18 ` [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
@ 2024-10-25 11:27 ` fengchengwen
2024-10-29 5:48 ` Tummala, Sivaprasad
1 sibling, 0 replies; 114+ messages in thread
From: fengchengwen @ 2024-10-25 11:27 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
liuyonglong
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
On 2024/10/25 17:18, Huisong Li wrote:
> Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
> and so on. They use parse_int() to parse it from command line.
> But overflow problem occurs when this function return.
>
> Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
> Cc: stable@dpdk.org
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-25 9:18 ` [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
2024-10-25 11:27 ` fengchengwen
@ 2024-10-29 5:48 ` Tummala, Sivaprasad
1 sibling, 0 replies; 114+ messages in thread
From: Tummala, Sivaprasad @ 2024-10-29 5:48 UTC (permalink / raw)
To: Huisong Li, dev
Cc: mb, thomas, Yigit, Ferruh, anatoly.burakov, david.hunt, stephen,
konstantin.ananyev, david.marchand, fengchengwen, liuyonglong
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Huisong,
LGTM!
> -----Original Message-----
> From: Huisong Li <lihuisong@huawei.com>
> Sent: Friday, October 25, 2024 2:49 PM
> To: dev@dpdk.org
> Cc: mb@smartsharesystems.com; thomas@monjalon.net; Yigit, Ferruh
> <Ferruh.Yigit@amd.com>; anatoly.burakov@intel.com; david.hunt@intel.com;
> Tummala, Sivaprasad <Sivaprasad.Tummala@amd.com>;
> stephen@networkplumber.org; konstantin.ananyev@huawei.com;
> david.marchand@redhat.com; fengchengwen@huawei.com;
> liuyonglong@huawei.com; lihuisong@huawei.com
> Subject: [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse
> command line
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
> and so on. They use parse_int() to parse it from command line.
> But overflow problem occurs when this function return.
>
> Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
> Cc: stable@dpdk.org
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
> examples/l3fwd-power/main.c | 36 ++++++++++++++++--------------------
> 1 file changed, 16 insertions(+), 20 deletions(-)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index
> 2bb6b092c3..0ce4aa04d4 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -1525,7 +1525,7 @@ print_usage(const char *prgname) }
>
> static int
> -parse_int(const char *opt)
> +parse_uint32(const char *opt, uint32_t *res)
> {
> char *end = NULL;
> unsigned long val;
> @@ -1535,23 +1535,14 @@ parse_int(const char *opt)
> if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
> return -1;
>
> - return val;
> -}
> -
> -static int parse_max_pkt_len(const char *pktlen) -{
> - char *end = NULL;
> - unsigned long len;
> -
> - /* parse decimal string */
> - len = strtoul(pktlen, &end, 10);
> - if ((pktlen[0] == '\0') || (end == NULL) || (*end != '\0'))
> + if (val > UINT32_MAX) {
> + RTE_LOG(ERR, L3FWD_POWER, "parameter shouldn't exceed
> + %u.\n", UINT32_MAX);
> return -1;
> + }
>
> - if (len == 0)
> - return -1;
> + *res = val;
>
> - return len;
> + return 0;
> }
>
> static int
> @@ -1898,8 +1889,9 @@ parse_args(int argc, char **argv)
> if (!strncmp(lgopts[option_index].name,
> CMD_LINE_OPT_MAX_PKT_LEN,
> sizeof(CMD_LINE_OPT_MAX_PKT_LEN))) {
> + if (parse_uint32(optarg, &max_pkt_len) != 0)
> + return -1;
> printf("Custom frame size is configured\n");
> - max_pkt_len = parse_max_pkt_len(optarg);
> }
>
> if (!strncmp(lgopts[option_index].name,
> @@ -1912,29 +1904,33 @@ parse_args(int argc, char **argv)
> if (!strncmp(lgopts[option_index].name,
> CMD_LINE_OPT_MAX_EMPTY_POLLS,
> sizeof(CMD_LINE_OPT_MAX_EMPTY_POLLS))) {
> + if (parse_uint32(optarg, &max_empty_polls) != 0)
> + return -1;
> printf("Maximum empty polls configured\n");
> - max_empty_polls = parse_int(optarg);
> }
>
> if (!strncmp(lgopts[option_index].name,
> CMD_LINE_OPT_PAUSE_DURATION,
> sizeof(CMD_LINE_OPT_PAUSE_DURATION))) {
> + if (parse_uint32(optarg, &pause_duration) != 0)
> + return -1;
> printf("Pause duration configured\n");
> - pause_duration = parse_int(optarg);
> }
>
> if (!strncmp(lgopts[option_index].name,
> CMD_LINE_OPT_SCALE_FREQ_MIN,
> sizeof(CMD_LINE_OPT_SCALE_FREQ_MIN))) {
> + if (parse_uint32(optarg, &scale_freq_min) != 0)
> + return -1;
> printf("Scaling frequency minimum configured\n");
> - scale_freq_min = parse_int(optarg);
> }
>
> if (!strncmp(lgopts[option_index].name,
> CMD_LINE_OPT_SCALE_FREQ_MAX,
> sizeof(CMD_LINE_OPT_SCALE_FREQ_MAX))) {
> + if (parse_uint32(optarg, &scale_freq_max) != 0)
> + return -1;
> printf("Scaling frequency maximum configured\n");
> - scale_freq_max = parse_int(optarg);
> }
>
> break;
> --
> 2.22.0
Acked-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v13 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-25 9:18 ` [PATCH v13 0/3] power: introduce PM QoS interface Huisong Li
2024-10-25 9:18 ` [PATCH v13 1/3] power: introduce PM QoS API on CPU wide Huisong Li
2024-10-25 9:18 ` [PATCH v13 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
@ 2024-10-25 9:18 ` Huisong Li
2 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-25 9:18 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The '--cpu-resume-latency' can use to control C-state selection.
Setting the CPU resume latency to 0 can limit the CPU just to enter
C0-state to improve performance, which also may increase the power
consumption of platform.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
---
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 91 +++++++++++++++++++
2 files changed, 95 insertions(+), 1 deletion(-)
diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 9c9684fea7..70fa83669a 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
if there is no Rx packet received on recent polls.
In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
-instead of always running to the C0 state waiting for packets.
+instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
+Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power consumption of platform.
.. note::
@@ -105,6 +106,8 @@ where,
* --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
+* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
+
* --max-pkt-len: optional, maximum packet length in decimal (64-9600)
* --no-numa: optional, disables numa awareness
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 0ce4aa04d4..647e37e7ea 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -265,6 +266,9 @@ static uint32_t pause_duration = 1;
static uint32_t scale_freq_min;
static uint32_t scale_freq_max;
+static int cpu_resume_latency = -1;
+static int resume_latency_bk[RTE_MAX_LCORE];
+
static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
@@ -1501,6 +1505,8 @@ print_usage(const char *prgname)
" -U: set min/max frequency for uncore to maximum value\n"
" -i (frequency index): set min/max frequency for uncore to specified frequency index\n"
" --config (port,queue,lcore): rx queues configuration\n"
+ " --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection,"
+ " 0 : just allow to enter C0-state\n"
" --high-perf-cores CORELIST: list of high performance cores\n"
" --perf-config: similar as config, cores specified as indices"
" for bins containing high or regular performance cores\n"
@@ -1545,6 +1551,42 @@ parse_uint32(const char *opt, uint32_t *res)
return 0;
}
+static int
+parse_int(const char *opt, int *res)
+{
+ char *end = NULL;
+ signed long val;
+
+ /* parse integer string */
+ val = strtol(opt, &end, 10);
+ if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
+ return -1;
+
+ if (val < INT_MIN || val > INT_MAX) {
+ RTE_LOG(ERR, L3FWD_POWER, "parameter should be range from %d to %d.\n",
+ INT_MIN, INT_MAX);
+ return -1;
+ }
+
+ *res = val;
+
+ return 0;
+}
+
+static int
+parse_cpu_resume_latency(const char *opt, int *val)
+{
+ if (parse_int(opt, val) != 0)
+ return -1;
+
+ if (*val < 0) {
+ RTE_LOG(ERR, L3FWD_POWER, "CPU resume latency should be >= 0.\n");
+ return -1;
+ }
+
+ return 0;
+}
+
static int
parse_uncore_options(enum uncore_choice choice, const char *argument)
{
@@ -1734,6 +1776,7 @@ parse_pmd_mgmt_config(const char *name)
#define CMD_LINE_OPT_PAUSE_DURATION "pause-duration"
#define CMD_LINE_OPT_SCALE_FREQ_MIN "scale-freq-min"
#define CMD_LINE_OPT_SCALE_FREQ_MAX "scale-freq-max"
+#define CMD_LINE_OPT_CPU_RESUME_LATENCY "cpu-resume-latency"
/* Parse the argument given in the command line of the application */
static int
@@ -1748,6 +1791,7 @@ parse_args(int argc, char **argv)
{"perf-config", 1, 0, 0},
{"high-perf-cores", 1, 0, 0},
{"no-numa", 0, 0, 0},
+ {CMD_LINE_OPT_CPU_RESUME_LATENCY, 1, 0, 0},
{CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, 0},
{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
{CMD_LINE_OPT_LEGACY, 0, 0, 0},
@@ -1933,6 +1977,15 @@ parse_args(int argc, char **argv)
printf("Scaling frequency maximum configured\n");
}
+ if (!strncmp(lgopts[option_index].name,
+ CMD_LINE_OPT_CPU_RESUME_LATENCY,
+ sizeof(CMD_LINE_OPT_CPU_RESUME_LATENCY))) {
+ if (parse_cpu_resume_latency(optarg,
+ &cpu_resume_latency) != 0)
+ return -1;
+ printf("PM QoS configured\n");
+ }
+
break;
default:
@@ -2256,6 +2309,35 @@ init_power_library(void)
return -1;
}
}
+
+ if (cpu_resume_latency != -1) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Back old CPU resume latency. */
+ ret = rte_power_qos_get_cpu_resume_latency(lcore_id);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to get cpu resume latency on lcore-%u, ret=%d.\n",
+ lcore_id, ret);
+ }
+ resume_latency_bk[lcore_id] = ret;
+
+ /*
+ * Set the cpu resume latency of the worker lcore based
+ * on user's request. If set strict latency (0), just
+ * allow the CPU to enter the shallowest idle state to
+ * improve performance.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ cpu_resume_latency);
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set cpu resume latency on lcore-%u, ret=%d.\n",
+ lcore_id, ret);
+ return ret;
+ }
+ }
+ }
+
return ret;
}
@@ -2295,6 +2377,15 @@ deinit_power_library(void)
}
}
}
+
+ if (cpu_resume_latency != -1) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ resume_latency_bk[lcore_id]);
+ }
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v14 0/3] power: introduce PM QoS interface
2024-03-20 10:55 [PATCH 0/2] introduce PM QoS interface Huisong Li
` (14 preceding siblings ...)
2024-10-25 9:18 ` [PATCH v13 0/3] power: introduce PM QoS interface Huisong Li
@ 2024-10-29 13:28 ` Huisong Li
2024-10-29 13:28 ` [PATCH v14 1/3] power: introduce PM QoS API on CPU wide Huisong Li
` (3 more replies)
15 siblings, 4 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-29 13:28 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Please see the description in kernel document[1].
Each cpuidle governor in Linux select which idle state to enter based on
this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from idle state by setting strict resume latency (zero value).
[1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
---
v14:
- use parse_uint to parse --cpu-resume-latency instead of adding a new
parse_int()
v13:
- not allow negative value for --cpu-resume-latency.
- restore to the original value as Konstantin suggested.
v12:
- add Acked-by Chengwen and Konstantin
- fix overflow issue in l3fwd-power when parse command line
- add a command parameter to set CPU resume latency
v11:
- operate the cpu id the lcore mapped by the new function
power_get_lcore_mapped_cpu_id().
v10:
- replace LINE_MAX with a custom macro and fix two typos.
v9:
- move new feature description from release_24_07.rst to release_24_11.rst.
v8:
- update the latest code to resolve CI warning
v7:
- remove a dead code rte_lcore_is_enabled in patch[2/2]
v6:
- update release_24_07.rst based on dpdk repo to resolve CI warning.
v5:
- use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
v4:
- fix some comments basd on Stephen
- add stdint.h include
- add Acked-by Morten Brørup <mb@smartsharesystems.com>
v3:
- add RTE_POWER_xxx prefix for some macro in header
- add the check for lcore_id with rte_lcore_is_enabled
v2:
- use PM QoS on CPU wide to replace the one on system wide
Huisong Li (3):
power: introduce PM QoS API on CPU wide
examples/l3fwd-power: fix data overflow when parse command line
examples/l3fwd-power: add PM QoS configuration
doc/guides/prog_guide/power_man.rst | 19 +++
doc/guides/rel_notes/release_24_11.rst | 5 +
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 96 +++++++++++---
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 ++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++
lib/power/version.map | 4 +
8 files changed, 306 insertions(+), 21 deletions(-)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v14 1/3] power: introduce PM QoS API on CPU wide
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
@ 2024-10-29 13:28 ` Huisong Li
2024-10-29 13:28 ` [PATCH v14 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
` (2 subsequent siblings)
3 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-29 13:28 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The deeper the idle state, the lower the power consumption, but the longer
the resume time. Some service are delay sensitive and very except the low
resume time, like interrupt packet receiving mode.
And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
interface is used to set and get the resume latency limit on the cpuX for
userspace. Each cpuidle governor in Linux select which idle state to enter
based on this CPU resume latency in their idle task.
The per-CPU PM QoS API can be used to control this CPU's idle state
selection and limit just enter the shallowest idle state to low the delay
when wake up from by setting strict resume latency (zero value).
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
---
doc/guides/prog_guide/power_man.rst | 19 ++++
doc/guides/rel_notes/release_24_11.rst | 5 +
lib/power/meson.build | 2 +
lib/power/rte_power_qos.c | 123 +++++++++++++++++++++++++
lib/power/rte_power_qos.h | 73 +++++++++++++++
lib/power/version.map | 4 +
6 files changed, 226 insertions(+)
create mode 100644 lib/power/rte_power_qos.c
create mode 100644 lib/power/rte_power_qos.h
diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index f6674efe2d..91358b04f3 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -107,6 +107,25 @@ User Cases
The power management mechanism is used to save power when performing L3 forwarding.
+PM QoS
+------
+
+The "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
+interface is used to set and get the resume latency limit on the cpuX for
+userspace. Each cpuidle governor in Linux select which idle state to enter
+based on this CPU resume latency in their idle task.
+
+The deeper the idle state, the lower the power consumption, but the longer
+the resume time. Some service are latency sensitive and very except the low
+resume time, like interrupt packet receiving mode.
+
+Applications can set and get the CPU resume latency by the
+``rte_power_qos_set_cpu_resume_latency()`` and ``rte_power_qos_get_cpu_resume_latency()``
+respectively. Applications can set a strict resume latency (zero value) by
+the ``rte_power_qos_set_cpu_resume_latency()`` to low the resume latency and
+get better performance (instead, the power consumption of platform may increase).
+
+
Ethernet PMD Power Management API
---------------------------------
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..d9e268274b 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -237,6 +237,11 @@ New Features
This field is used to pass an extra configuration settings such as ability
to lookup IPv4 addresses in network byte order.
+* **Introduce per-CPU PM QoS interface.**
+
+ * Add per-CPU PM QoS interface to low the resume latency when wake up from
+ idle state.
+
* **Added new API to register telemetry endpoint callbacks with private arguments.**
A new ``rte_telemetry_register_cmd_arg`` function is available to pass an opaque value to
diff --git a/lib/power/meson.build b/lib/power/meson.build
index 2f0f3d26e9..9b5d3e8315 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -23,12 +23,14 @@ sources = files(
'rte_power.c',
'rte_power_uncore.c',
'rte_power_pmd_mgmt.c',
+ 'rte_power_qos.c',
)
headers = files(
'rte_power.h',
'rte_power_guest_channel.h',
'rte_power_pmd_mgmt.h',
'rte_power_uncore.h',
+ 'rte_power_qos.h',
)
deps += ['timer', 'ethdev']
diff --git a/lib/power/rte_power_qos.c b/lib/power/rte_power_qos.c
new file mode 100644
index 0000000000..4dd0532b36
--- /dev/null
+++ b/lib/power/rte_power_qos.c
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+#include "power_common.h"
+#include "rte_power_qos.h"
+
+#define PM_QOS_SYSFILE_RESUME_LATENCY_US \
+ "/sys/devices/system/cpu/cpu%u/power/pm_qos_resume_latency_us"
+
+#define PM_QOS_CPU_RESUME_LATENCY_BUF_LEN 32
+
+int
+rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ if (latency < 0) {
+ POWER_LOG(ERR, "latency should be greater than and equal to 0");
+ return -EINVAL;
+ }
+
+ ret = open_core_sysfs_file(&f, "w", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different input string.
+ * 1> the resume latency is 0 if the input is "n/a".
+ * 2> the resume latency is no constraint if the input is "0".
+ * 3> the resume latency is the actual value to be set.
+ */
+ if (latency == RTE_POWER_QOS_STRICT_LATENCY_VALUE)
+ snprintf(buf, sizeof(buf), "%s", "n/a");
+ else if (latency == RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT)
+ snprintf(buf, sizeof(buf), "%u", 0);
+ else
+ snprintf(buf, sizeof(buf), "%u", latency);
+
+ ret = write_core_sysfs_s(f, buf);
+ if (ret != 0)
+ POWER_LOG(ERR, "Failed to write "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+
+ fclose(f);
+
+ return ret;
+}
+
+int
+rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id)
+{
+ char buf[PM_QOS_CPU_RESUME_LATENCY_BUF_LEN];
+ int latency = -1;
+ uint32_t cpu_id;
+ FILE *f;
+ int ret;
+
+ if (!rte_lcore_is_enabled(lcore_id)) {
+ POWER_LOG(ERR, "lcore id %u is not enabled", lcore_id);
+ return -EINVAL;
+ }
+ ret = power_get_lcore_mapped_cpu_id(lcore_id, &cpu_id);
+ if (ret != 0)
+ return ret;
+
+ ret = open_core_sysfs_file(&f, "r", PM_QOS_SYSFILE_RESUME_LATENCY_US, cpu_id);
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to open "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ return ret;
+ }
+
+ ret = read_core_sysfs_s(f, buf, sizeof(buf));
+ if (ret != 0) {
+ POWER_LOG(ERR, "Failed to read "PM_QOS_SYSFILE_RESUME_LATENCY_US" : %s",
+ cpu_id, strerror(errno));
+ goto out;
+ }
+
+ /*
+ * Based on the sysfs interface pm_qos_resume_latency_us under
+ * @PM_QOS_SYSFILE_RESUME_LATENCY_US directory in kernel, their meaning
+ * is as follows for different output string.
+ * 1> the resume latency is 0 if the output is "n/a".
+ * 2> the resume latency is no constraint if the output is "0".
+ * 3> the resume latency is the actual value in used for other string.
+ */
+ if (strcmp(buf, "n/a") == 0)
+ latency = RTE_POWER_QOS_STRICT_LATENCY_VALUE;
+ else {
+ latency = strtoul(buf, NULL, 10);
+ latency = latency == 0 ? RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT : latency;
+ }
+
+out:
+ fclose(f);
+
+ return latency != -1 ? latency : ret;
+}
diff --git a/lib/power/rte_power_qos.h b/lib/power/rte_power_qos.h
new file mode 100644
index 0000000000..7a8dab9272
--- /dev/null
+++ b/lib/power/rte_power_qos.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 HiSilicon Limited
+ */
+
+#ifndef RTE_POWER_QOS_H
+#define RTE_POWER_QOS_H
+
+#include <stdint.h>
+
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * @file rte_power_qos.h
+ *
+ * PM QoS API.
+ *
+ * The CPU-wide resume latency limit has a positive impact on this CPU's idle
+ * state selection in each cpuidle governor.
+ * Please see the PM QoS on CPU wide in the following link:
+ * https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
+ *
+ * The deeper the idle state, the lower the power consumption, but the
+ * longer the resume time. Some service are delay sensitive and very except the
+ * low resume time, like interrupt packet receiving mode.
+ *
+ * In these case, per-CPU PM QoS API can be used to control this CPU's idle
+ * state selection and limit just enter the shallowest idle state to low the
+ * delay after sleep by setting strict resume latency (zero value).
+ */
+
+#define RTE_POWER_QOS_STRICT_LATENCY_VALUE 0
+#define RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT INT32_MAX
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param lcore_id
+ * target logical core id
+ *
+ * @param latency
+ * The latency should be greater than and equal to zero in microseconds unit.
+ *
+ * @return
+ * 0 on success. Otherwise negative value is returned.
+ */
+__rte_experimental
+int rte_power_qos_set_cpu_resume_latency(uint16_t lcore_id, int latency);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Get the current resume latency of this logical core.
+ * The default value in kernel is @see RTE_POWER_QOS_RESUME_LATENCY_NO_CONSTRAINT
+ * if don't set it.
+ *
+ * @return
+ * Negative value on failure.
+ * >= 0 means the actual resume latency limit on this core.
+ */
+__rte_experimental
+int rte_power_qos_get_cpu_resume_latency(uint16_t lcore_id);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_POWER_QOS_H */
diff --git a/lib/power/version.map b/lib/power/version.map
index c9a226614e..08f178a39d 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -51,4 +51,8 @@ EXPERIMENTAL {
rte_power_set_uncore_env;
rte_power_uncore_freqs;
rte_power_unset_uncore_env;
+
+ # added in 24.11
+ rte_power_qos_get_cpu_resume_latency;
+ rte_power_qos_set_cpu_resume_latency;
};
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v14 2/3] examples/l3fwd-power: fix data overflow when parse command line
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
2024-10-29 13:28 ` [PATCH v14 1/3] power: introduce PM QoS API on CPU wide Huisong Li
@ 2024-10-29 13:28 ` Huisong Li
2024-10-29 13:28 ` [PATCH v14 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
2024-11-04 9:13 ` [PATCH v14 0/3] power: introduce PM QoS interface lihuisong (C)
3 siblings, 0 replies; 114+ messages in thread
From: Huisong Li @ 2024-10-29 13:28 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
Many variables are 'uint32_t', like, 'pause_duration', 'scale_freq_min'
and so on. They use parse_int() to parse it from command line.
But overflow problem occurs when this function return.
Fixes: 59f2853c4cae ("examples/l3fwd_power: add configuration options")
Cc: stable@dpdk.org
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
---
examples/l3fwd-power/main.c | 41 +++++++++++++++++++------------------
1 file changed, 21 insertions(+), 20 deletions(-)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..96fac45c61 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -1524,8 +1524,12 @@ print_usage(const char *prgname)
prgname);
}
+/*
+ * Caller must give the right upper limit so as to ensure receiver variable
+ * doesn't overflow.
+ */
static int
-parse_int(const char *opt)
+parse_uint(const char *opt, uint32_t max, uint32_t *res)
{
char *end = NULL;
unsigned long val;
@@ -1535,23 +1539,15 @@ parse_int(const char *opt)
if ((opt[0] == '\0') || (end == NULL) || (*end != '\0'))
return -1;
- return val;
-}
-
-static int parse_max_pkt_len(const char *pktlen)
-{
- char *end = NULL;
- unsigned long len;
-
- /* parse decimal string */
- len = strtoul(pktlen, &end, 10);
- if ((pktlen[0] == '\0') || (end == NULL) || (*end != '\0'))
+ if (val > max) {
+ RTE_LOG(ERR, L3FWD_POWER, "%s parameter shouldn't exceed %u.\n",
+ opt, max);
return -1;
+ }
- if (len == 0)
- return -1;
+ *res = val;
- return len;
+ return 0;
}
static int
@@ -1898,8 +1894,9 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_PKT_LEN,
sizeof(CMD_LINE_OPT_MAX_PKT_LEN))) {
+ if (parse_uint(optarg, UINT32_MAX, &max_pkt_len) != 0)
+ return -1;
printf("Custom frame size is configured\n");
- max_pkt_len = parse_max_pkt_len(optarg);
}
if (!strncmp(lgopts[option_index].name,
@@ -1912,29 +1909,33 @@ parse_args(int argc, char **argv)
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_MAX_EMPTY_POLLS,
sizeof(CMD_LINE_OPT_MAX_EMPTY_POLLS))) {
+ if (parse_uint(optarg, UINT32_MAX, &max_empty_polls) != 0)
+ return -1;
printf("Maximum empty polls configured\n");
- max_empty_polls = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_PAUSE_DURATION,
sizeof(CMD_LINE_OPT_PAUSE_DURATION))) {
+ if (parse_uint(optarg, UINT32_MAX, &pause_duration) != 0)
+ return -1;
printf("Pause duration configured\n");
- pause_duration = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MIN,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MIN))) {
+ if (parse_uint(optarg, UINT32_MAX, &scale_freq_min) != 0)
+ return -1;
printf("Scaling frequency minimum configured\n");
- scale_freq_min = parse_int(optarg);
}
if (!strncmp(lgopts[option_index].name,
CMD_LINE_OPT_SCALE_FREQ_MAX,
sizeof(CMD_LINE_OPT_SCALE_FREQ_MAX))) {
+ if (parse_uint(optarg, UINT32_MAX, &scale_freq_max) != 0)
+ return -1;
printf("Scaling frequency maximum configured\n");
- scale_freq_max = parse_int(optarg);
}
break;
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v14 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
2024-10-29 13:28 ` [PATCH v14 1/3] power: introduce PM QoS API on CPU wide Huisong Li
2024-10-29 13:28 ` [PATCH v14 2/3] examples/l3fwd-power: fix data overflow when parse command line Huisong Li
@ 2024-10-29 13:28 ` Huisong Li
2024-10-29 13:45 ` Konstantin Ananyev
2024-11-04 9:13 ` [PATCH v14 0/3] power: introduce PM QoS interface lihuisong (C)
3 siblings, 1 reply; 114+ messages in thread
From: Huisong Li @ 2024-10-29 13:28 UTC (permalink / raw)
To: dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, konstantin.ananyev, david.marchand,
fengchengwen, liuyonglong, lihuisong
The '--cpu-resume-latency' can use to control C-state selection.
Setting the CPU resume latency to 0 can limit the CPU just to enter
C0-state to improve performance, which also may increase the power
consumption of platform.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
---
.../sample_app_ug/l3_forward_power_man.rst | 5 +-
examples/l3fwd-power/main.c | 55 +++++++++++++++++++
2 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 9c9684fea7..70fa83669a 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
if there is no Rx packet received on recent polls.
In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
-instead of always running to the C0 state waiting for packets.
+instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
+Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power consumption of platform.
.. note::
@@ -105,6 +106,8 @@ where,
* --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
+* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
+
* --max-pkt-len: optional, maximum packet length in decimal (64-9600)
* --no-numa: optional, disables numa awareness
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 96fac45c61..7b04fd06dc 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -47,6 +47,7 @@
#include <rte_telemetry.h>
#include <rte_power_pmd_mgmt.h>
#include <rte_power_uncore.h>
+#include <rte_power_qos.h>
#include "perf_core.h"
#include "main.h"
@@ -265,6 +266,9 @@ static uint32_t pause_duration = 1;
static uint32_t scale_freq_min;
static uint32_t scale_freq_max;
+static int cpu_resume_latency = -1;
+static int resume_latency_bk[RTE_MAX_LCORE];
+
static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
@@ -1501,6 +1505,8 @@ print_usage(const char *prgname)
" -U: set min/max frequency for uncore to maximum value\n"
" -i (frequency index): set min/max frequency for uncore to specified frequency index\n"
" --config (port,queue,lcore): rx queues configuration\n"
+ " --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection,"
+ " 0 : just allow to enter C0-state\n"
" --high-perf-cores CORELIST: list of high performance cores\n"
" --perf-config: similar as config, cores specified as indices"
" for bins containing high or regular performance cores\n"
@@ -1739,6 +1745,7 @@ parse_pmd_mgmt_config(const char *name)
#define CMD_LINE_OPT_PAUSE_DURATION "pause-duration"
#define CMD_LINE_OPT_SCALE_FREQ_MIN "scale-freq-min"
#define CMD_LINE_OPT_SCALE_FREQ_MAX "scale-freq-max"
+#define CMD_LINE_OPT_CPU_RESUME_LATENCY "cpu-resume-latency"
/* Parse the argument given in the command line of the application */
static int
@@ -1753,6 +1760,7 @@ parse_args(int argc, char **argv)
{"perf-config", 1, 0, 0},
{"high-perf-cores", 1, 0, 0},
{"no-numa", 0, 0, 0},
+ {CMD_LINE_OPT_CPU_RESUME_LATENCY, 1, 0, 0},
{CMD_LINE_OPT_MAX_PKT_LEN, 1, 0, 0},
{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
{CMD_LINE_OPT_LEGACY, 0, 0, 0},
@@ -1938,6 +1946,15 @@ parse_args(int argc, char **argv)
printf("Scaling frequency maximum configured\n");
}
+ if (!strncmp(lgopts[option_index].name,
+ CMD_LINE_OPT_CPU_RESUME_LATENCY,
+ sizeof(CMD_LINE_OPT_CPU_RESUME_LATENCY))) {
+ if (parse_uint(optarg, INT_MAX,
+ (uint32_t *)&cpu_resume_latency) != 0)
+ return -1;
+ printf("PM QoS configured\n");
+ }
+
break;
default:
@@ -2261,6 +2278,35 @@ init_power_library(void)
return -1;
}
}
+
+ if (cpu_resume_latency != -1) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Back old CPU resume latency. */
+ ret = rte_power_qos_get_cpu_resume_latency(lcore_id);
+ if (ret < 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to get cpu resume latency on lcore-%u, ret=%d.\n",
+ lcore_id, ret);
+ }
+ resume_latency_bk[lcore_id] = ret;
+
+ /*
+ * Set the cpu resume latency of the worker lcore based
+ * on user's request. If set strict latency (0), just
+ * allow the CPU to enter the shallowest idle state to
+ * improve performance.
+ */
+ ret = rte_power_qos_set_cpu_resume_latency(lcore_id,
+ cpu_resume_latency);
+ if (ret != 0) {
+ RTE_LOG(ERR, L3FWD_POWER,
+ "Failed to set cpu resume latency on lcore-%u, ret=%d.\n",
+ lcore_id, ret);
+ return ret;
+ }
+ }
+ }
+
return ret;
}
@@ -2300,6 +2346,15 @@ deinit_power_library(void)
}
}
}
+
+ if (cpu_resume_latency != -1) {
+ RTE_LCORE_FOREACH(lcore_id) {
+ /* Restore the original value. */
+ rte_power_qos_set_cpu_resume_latency(lcore_id,
+ resume_latency_bk[lcore_id]);
+ }
+ }
+
return ret;
}
--
2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* RE: [PATCH v14 3/3] examples/l3fwd-power: add PM QoS configuration
2024-10-29 13:28 ` [PATCH v14 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-10-29 13:45 ` Konstantin Ananyev
0 siblings, 0 replies; 114+ messages in thread
From: Konstantin Ananyev @ 2024-10-29 13:45 UTC (permalink / raw)
To: lihuisong (C), dev
Cc: mb, thomas, ferruh.yigit, anatoly.burakov, david.hunt,
sivaprasad.tummala, stephen, david.marchand, Fengchengwen,
liuyonglong
>
> The '--cpu-resume-latency' can use to control C-state selection.
> Setting the CPU resume latency to 0 can limit the CPU just to enter
> C0-state to improve performance, which also may increase the power
> consumption of platform.
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> ---
> .../sample_app_ug/l3_forward_power_man.rst | 5 +-
> examples/l3fwd-power/main.c | 55 +++++++++++++++++++
> 2 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
> index 9c9684fea7..70fa83669a 100644
> --- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
> +++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
> @@ -67,7 +67,8 @@ based on the speculative sleep duration of the core.
> In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
> if there is no Rx packet received on recent polls.
> In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
> -instead of always running to the C0 state waiting for packets.
> +instead of always running to the C0 state waiting for packets. But user can set the CPU resume latency to control C-state selection.
> +Setting the CPU resume latency to 0 can limit the CPU just to enter C0-state to improve performance, which may increase power
> consumption of platform.
>
> .. note::
>
> @@ -105,6 +106,8 @@ where,
>
> * --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
>
> +* --cpu-resume-latency LATENCY: set CPU resume latency to control C-state selection, 0 : just allow to enter C0-state.
> +
> * --max-pkt-len: optional, maximum packet length in decimal (64-9600)
>
> * --no-numa: optional, disables numa awareness
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> --
> 2.22.0
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v14 0/3] power: introduce PM QoS interface
2024-10-29 13:28 ` [PATCH v14 0/3] power: introduce PM QoS interface Huisong Li
` (2 preceding siblings ...)
2024-10-29 13:28 ` [PATCH v14 3/3] examples/l3fwd-power: add PM QoS configuration Huisong Li
@ 2024-11-04 9:13 ` lihuisong (C)
3 siblings, 0 replies; 114+ messages in thread
From: lihuisong (C) @ 2024-11-04 9:13 UTC (permalink / raw)
To: dev, ferruh.yigit, thomas
Cc: mb, anatoly.burakov, david.hunt, sivaprasad.tummala, stephen,
konstantin.ananyev, david.marchand, fengchengwen, liuyonglong
Hi Ferruh and Thomas,
Kindly ping for merge.
在 2024/10/29 21:28, Huisong Li 写道:
> The deeper the idle state, the lower the power consumption, but the longer
> the resume time. Some service are delay sensitive and very except the low
> resume time, like interrupt packet receiving mode.
>
> And the "/sys/devices/system/cpu/cpuX/power/pm_qos_resume_latency_us" sysfs
> interface is used to set and get the resume latency limit on the cpuX for
> userspace. Please see the description in kernel document[1].
> Each cpuidle governor in Linux select which idle state to enter based on
> this CPU resume latency in their idle task.
>
> The per-CPU PM QoS API can be used to control this CPU's idle state
> selection and limit just enter the shallowest idle state to low the delay
> when wake up from idle state by setting strict resume latency (zero value).
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html?highlight=pm_qos_resume_latency_us#abi-sys-devices-power-pm-qos-resume-latency-us
>
> ---
> v14:
> - use parse_uint to parse --cpu-resume-latency instead of adding a new
> parse_int()
> v13:
> - not allow negative value for --cpu-resume-latency.
> - restore to the original value as Konstantin suggested.
> v12:
> - add Acked-by Chengwen and Konstantin
> - fix overflow issue in l3fwd-power when parse command line
> - add a command parameter to set CPU resume latency
> v11:
> - operate the cpu id the lcore mapped by the new function
> power_get_lcore_mapped_cpu_id().
> v10:
> - replace LINE_MAX with a custom macro and fix two typos.
> v9:
> - move new feature description from release_24_07.rst to release_24_11.rst.
> v8:
> - update the latest code to resolve CI warning
> v7:
> - remove a dead code rte_lcore_is_enabled in patch[2/2]
> v6:
> - update release_24_07.rst based on dpdk repo to resolve CI warning.
> v5:
> - use LINE_MAX to replace BUFSIZ, and use snprintf to replace sprintf.
> v4:
> - fix some comments basd on Stephen
> - add stdint.h include
> - add Acked-by Morten Brørup <mb@smartsharesystems.com>
> v3:
> - add RTE_POWER_xxx prefix for some macro in header
> - add the check for lcore_id with rte_lcore_is_enabled
> v2:
> - use PM QoS on CPU wide to replace the one on system wide
>
>
> Huisong Li (3):
> power: introduce PM QoS API on CPU wide
> examples/l3fwd-power: fix data overflow when parse command line
> examples/l3fwd-power: add PM QoS configuration
>
> doc/guides/prog_guide/power_man.rst | 19 +++
> doc/guides/rel_notes/release_24_11.rst | 5 +
> .../sample_app_ug/l3_forward_power_man.rst | 5 +-
> examples/l3fwd-power/main.c | 96 +++++++++++---
> lib/power/meson.build | 2 +
> lib/power/rte_power_qos.c | 123 ++++++++++++++++++
> lib/power/rte_power_qos.h | 73 +++++++++++
> lib/power/version.map | 4 +
> 8 files changed, 306 insertions(+), 21 deletions(-)
> create mode 100644 lib/power/rte_power_qos.c
> create mode 100644 lib/power/rte_power_qos.h
>
^ permalink raw reply [flat|nested] 114+ messages in thread