* [PATCH 1/2] latencystats: fix receive sample MP issues
[not found] <20250613003547.39239-1-stephen@networkplumber.org>
@ 2025-06-13 0:34 ` Stephen Hemminger
[not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
[not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
2 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev
Cc: Stephen Hemminger, stable, Reshma Pattan, Harry van Haaren, Remy Horton
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 47 ++++++++++++++++++-----------
1 file changed, 29 insertions(+), 18 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..181c53dd0e 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -45,10 +45,19 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
static uint64_t prev_tsc;
#define LATENCY_AVG_SCALE 4
@@ -147,25 +156,27 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
+ uint64_t now = rte_rdtsc();
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ /* Check without locking */
+ if (likely(tsc_before(now, prev_tsc + samp_intvl)))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ prev_tsc = now;
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
--
2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/2] latencystats: fix receive sample MP issues
[not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
@ 2025-06-16 16:04 ` Stephen Hemminger
0 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, stable
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 51 ++++++++++++++++++-----------
1 file changed, 32 insertions(+), 19 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..2b994656fb 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
#include <rte_metrics.h>
#include <rte_spinlock.h>
#include <rte_string_fns.h>
+#include <rte_stdatomic.h>
#include "rte_latencystats.h"
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
#define LATENCY_AVG_SCALE 4
#define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,27 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
+ uint64_t now = rte_rdtsc();
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ /* Check without locking */
+ if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, rte_memory_order_relaxed))))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ rte_atomic_store_explicit(&next_tsc, now + samp_intvl, rte_memory_order_relaxed);
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
@@ -270,6 +282,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
glob_stats = mz->addr;
rte_spinlock_init(&glob_stats->lock);
samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+ next_tsc = rte_rdtsc();
/** Register latency stats with stats library */
for (i = 0; i < NUM_LATENCY_STATS; i++)
--
2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v3 1/2] latencystats: fix receive sample MP issues
[not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
@ 2025-06-17 15:00 ` Stephen Hemminger
2025-06-25 11:31 ` Varghese, Vipin
0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, stable
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Bugzilla ID: 1723
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
1 file changed, 35 insertions(+), 20 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..72a58d78d1 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
#include <rte_metrics.h>
#include <rte_spinlock.h>
#include <rte_string_fns.h>
+#include <rte_stdatomic.h>
#include "rte_latencystats.h"
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
#define LATENCY_AVG_SCALE 4
#define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
-
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ uint64_t now = rte_rdtsc();
+
+ /* Check without locking */
+ if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
+ rte_memory_order_relaxed))))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
+ rte_memory_order_relaxed);
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
@@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
glob_stats = mz->addr;
rte_spinlock_init(&glob_stats->lock);
samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+ next_tsc = rte_rdtsc();
/** Register latency stats with stats library */
for (i = 0; i < NUM_LATENCY_STATS; i++)
--
2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH v3 1/2] latencystats: fix receive sample MP issues
2025-06-17 15:00 ` [PATCH v3 " Stephen Hemminger
@ 2025-06-25 11:31 ` Varghese, Vipin
0 siblings, 0 replies; 4+ messages in thread
From: Varghese, Vipin @ 2025-06-25 11:31 UTC (permalink / raw)
To: Stephen Hemminger, dev, David Marchand; +Cc: stable
[AMD Official Use Only - AMD Internal Distribution Only]
Hi David & Stephen,
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 17, 2025 8:30 PM
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>; stable@dpdk.org
> Subject: [PATCH v3 1/2] latencystats: fix receive sample MP issues
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> The receive callback was not safe with multiple queues.
> If one receive queue callback decides to take a sample it needs to add that sample
> and do atomic update to the previous TSC sample value. Add a new lock for that.
>
> Optimize the check for when to take sample so that it only needs to lock when likely
> to need a sample.
>
> Also, add code to handle TSC wraparound in comparison.
> Perhaps this should move to rte_cycles.h?
>
> Bugzilla ID: 1723
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
> Cc: stable@dpdk.org
> ---
> lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 20 deletions(-)
>
> diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
> index 6873a44a92..72a58d78d1 100644
> --- a/lib/latencystats/rte_latencystats.c
> +++ b/lib/latencystats/rte_latencystats.c
> @@ -22,6 +22,7 @@
> #include <rte_metrics.h>
> #include <rte_spinlock.h>
> #include <rte_string_fns.h>
> +#include <rte_stdatomic.h>
>
> #include "rte_latencystats.h"
>
> @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
> timestamp_dynfield_offset, rte_mbuf_timestamp_t *); }
>
> +/* Compare two 64 bit timer counter but deal with wraparound correctly.
> +*/ static inline bool tsc_after(uint64_t t0, uint64_t t1) {
> + return (int64_t)(t1 - t0) < 0;
> +}
> +
> +#define tsc_before(a, b) tsc_after(b, a)
> +
> static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int
> latency_stats_index;
> +
> +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
> static uint64_t samp_intvl;
> -static uint64_t timer_tsc;
> -static uint64_t prev_tsc;
> +static RTE_ATOMIC(uint64_t) next_tsc;
>
> #define LATENCY_AVG_SCALE 4
> #define LATENCY_JITTER_SCALE 16
> @@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
> void *user_cb __rte_unused) {
> unsigned int i;
> - uint64_t diff_tsc, now;
> -
> - /*
> - * For every sample interval,
> - * time stamp is marked on one received packet.
> - */
> - now = rte_rdtsc();
> - for (i = 0; i < nb_pkts; i++) {
> - diff_tsc = now - prev_tsc;
> - timer_tsc += diff_tsc;
> -
> - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
> - && (timer_tsc >= samp_intvl)) {
> - *timestamp_dynfield(pkts[i]) = now;
> - pkts[i]->ol_flags |= timestamp_dynflag;
> - timer_tsc = 0;
> + uint64_t now = rte_rdtsc();
> +
> + /* Check without locking */
> + if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
> + rte_memory_order_relaxed))))
> + return nb_pkts;
> +
> + /* Try and get sample, skip if sample is being done by other core. */
> + if (likely(rte_spinlock_trylock(&sample_lock))) {
> + for (i = 0; i < nb_pkts; i++) {
> + struct rte_mbuf *m = pkts[i];
> +
> + /* skip if already timestamped */
> + if (unlikely(m->ol_flags & timestamp_dynflag))
> + continue;
> +
> + m->ol_flags |= timestamp_dynflag;
> + *timestamp_dynfield(m) = now;
> + rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
> + rte_memory_order_relaxed);
> + break;
> }
> - prev_tsc = now;
> - now = rte_rdtsc();
> + rte_spinlock_unlock(&sample_lock);
> }
>
> return nb_pkts;
> @@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
> glob_stats = mz->addr;
> rte_spinlock_init(&glob_stats->lock);
> samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
> + next_tsc = rte_rdtsc();
>
> /** Register latency stats with stats library */
> for (i = 0; i < NUM_LATENCY_STATS; i++)
Application: testpmd io mode with latency-stats enabled
CPU: AMD EPYC 7713 64-Core Processor (AVX2) Huge page: 1GB pages * 32
Nic: Intel E810 1CQ DA2, 1 * 100Gbps
+++++++++++++++++++++++++++++
Firmware: 3.20 0x8000d83e 1.3146.0
DDP: comms package 1.3.53
With no args, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20
With no args, After Patch (min, max, avg, jitter)
- 1Q: 40ns, 19136ns, 47ns, 5
- 4Q: 10ns, 18334ns, 194ns, 64
With args: rx_low_latency=1, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20
With args: rx_low_latency=1, After Patch
- 1Q: 40ns, 21631ns, 74ns, 12
- 4Q: 10ns, 23725ns, 116ns, 112
With Solar flare NIC:
+++++++++++++++
throughput profile; After Patch (min, max, avg, jitter)
- 1Q: 10ns, 23115ns, 96ns, 65
- 4Q: 10ns, 2981ns, 136ns, 140
low-latency profile , After Patch
- 1Q: 10ns, 19399ns, 367ns, 238
- 4Q: 10ns, 19970ns, 127ns, 100
Following are our understanding
1. increase in multi-queue latency is attributed by spinlock.
2. the lower latency with patch for multi-queue is because the lowest of all queues are taken into account.
Question: will there be per queue min, max, avg stats be enhanced in future?
Tested-by: Thiyagarajan P <Thiyagarajan.P@amd.com>
Reviewed-by: Vipin Varghese <Vipin.Varghese@amd.com>
> --
> 2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-25 11:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20250613003547.39239-1-stephen@networkplumber.org>
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
[not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
2025-06-16 16:04 ` [PATCH v2 " Stephen Hemminger
[not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
2025-06-17 15:00 ` [PATCH v3 " Stephen Hemminger
2025-06-25 11:31 ` Varghese, Vipin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).