* [PATCH 0/2] Latencystat optimization and fix
@ 2025-06-13 0:34 Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger
After investigating latencystat test failures, discovered it poorly
designed to handle bursts and multiple queues.
Stephen Hemminger (2):
latencystats: fix receive sample MP issues
latencystats: optimize locking on transmit
lib/latencystats/rte_latencystats.c | 64 +++++++++++++++++++----------
1 file changed, 42 insertions(+), 22 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] latencystats: fix receive sample MP issues
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
@ 2025-06-13 0:34 ` Stephen Hemminger
2025-06-13 0:34 ` [PATCH 2/2] latencystats: optimize locking on transmit Stephen Hemminger
` (3 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev
Cc: Stephen Hemminger, stable, Reshma Pattan, Harry van Haaren, Remy Horton
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 47 ++++++++++++++++++-----------
1 file changed, 29 insertions(+), 18 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..181c53dd0e 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -45,10 +45,19 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
static uint64_t prev_tsc;
#define LATENCY_AVG_SCALE 4
@@ -147,25 +156,27 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
+ uint64_t now = rte_rdtsc();
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ /* Check without locking */
+ if (likely(tsc_before(now, prev_tsc + samp_intvl)))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ prev_tsc = now;
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] latencystats: optimize locking on transmit
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
@ 2025-06-13 0:34 ` Stephen Hemminger
2025-06-13 11:13 ` [PATCH 0/2] Latencystat optimization and fix Varghese, Vipin
` (2 subsequent siblings)
4 siblings, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, Reshma Pattan
If transmit callback is called, and there are no packets
in the burst with timestamp set, then the expensive operations
of locking and accessing TSC can be skipped.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
lib/latencystats/rte_latencystats.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 181c53dd0e..dc415c4682 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -190,17 +190,26 @@ calc_latency(uint16_t pid __rte_unused,
void *_ __rte_unused)
{
unsigned int i;
- uint64_t now, latency;
+ uint64_t ts_flags = 0;
static uint64_t prev_latency;
- now = rte_rdtsc();
+ for (i = 0; i < nb_pkts; i++)
+ ts_flags |= (pkts[i]->ol_flags & timestamp_dynflag);
+ /* no samples in this burst, skip locking */
+ if (likely(ts_flags == 0))
+ return nb_pkts;
+
+ uint64_t now = rte_rdtsc();
rte_spinlock_lock(&glob_stats->lock);
for (i = 0; i < nb_pkts; i++) {
- if (!(pkts[i]->ol_flags & timestamp_dynflag))
+ struct rte_mbuf *m = pkts[i];
+ uint64_t latency;
+
+ if (!(m->ol_flags & timestamp_dynflag))
continue;
- latency = now - *timestamp_dynfield(pkts[i]);
+ latency = now - *timestamp_dynfield(m);
if (glob_stats->samples++ == 0) {
glob_stats->min_latency = latency;
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH 0/2] Latencystat optimization and fix
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-13 0:34 ` [PATCH 2/2] latencystats: optimize locking on transmit Stephen Hemminger
@ 2025-06-13 11:13 ` Varghese, Vipin
2025-06-16 16:04 ` [PATCH v2 0/2] Latencystats optimizations " Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 0/2] latencystats: fix receive sample MP issues Stephen Hemminger
4 siblings, 0 replies; 12+ messages in thread
From: Varghese, Vipin @ 2025-06-13 11:13 UTC (permalink / raw)
To: Stephen Hemminger, dev
[Public]
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Friday, June 13, 2025 6:04 AM
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Subject: [PATCH 0/2] Latencystat optimization and fix
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> After investigating latencystat test failures, discovered it poorly designed to handle
> bursts and multiple queues.
Acked-by: Vipin.Varghese@amd.com
Reviewed-by: Vipin.Varghese@amd.com
Tested-by: Thiyagarajan.P@amd.com
We tested the changes using DPDK-TestPmd, following is the summary of the test shared below
1. before vs after patch with 1Q: min: -10ns, max: -5000ns, avg: -10ns, jitter: +5ns
2. before vs after patch with 4Q: min: -10ns, max: -7000ns, avg: +5ns, jitter: +80ns
Since we see a greater variation with multiple queues compared to single queue, we also collect HW Traffic Generator before vs after.
Values we got are ` min: -50000ns, max: -62000ns, avg: -5300ns, jitter: 6ns `.
NIC: Intel E810 2CQ-DA2 (used 1 100Gbps port), firmware: 4.7, DDP: default
Processor: AMD EPYC 8534P 64-Core Processor
Traffic Rate: 100Gbps
Application CMD:
1. DPDK: ` sudo build/app/dpdk-testpmd -l 11,16,17,18,19 -a 42:00.0 --file-prefix tr01 --proc-type=primary --force-max-simd-bitwidth=512 -- --nb-cores=4 --nb-ports=1 --txq=4 --rxq=4 --txd=2048 --rxd=2048 --burst=64 --rss-udp --latencystats=16 -i -a `
2. Proc-Info: ` sudo build/app/dpdk-proc-info --file-prefix=tr01 --proc-type=secondary -a 42:00.0 -- --metrics `
>
> Stephen Hemminger (2):
> latencystats: fix receive sample MP issues
> latencystats: optimize locking on transmit
>
> lib/latencystats/rte_latencystats.c | 64 +++++++++++++++++++----------
> 1 file changed, 42 insertions(+), 22 deletions(-)
>
> --
> 2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 0/2] Latencystats optimizations and fix
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
` (2 preceding siblings ...)
2025-06-13 11:13 ` [PATCH 0/2] Latencystat optimization and fix Varghese, Vipin
@ 2025-06-16 16:04 ` Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 2/2] latencystats: optimize locking on transmit Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 0/2] latencystats: fix receive sample MP issues Stephen Hemminger
4 siblings, 2 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger
After investigating latencystat test failures, discovered it poorly
designed to handle bursts and multiple queues.
v2
- need atomic to be safe on non ordered platforms
Stephen Hemminger (2):
latencystats: fix receive sample MP issues
latencystats: optimize locking on transmit
lib/latencystats/rte_latencystats.c | 68 +++++++++++++++++++----------
1 file changed, 45 insertions(+), 23 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/2] latencystats: fix receive sample MP issues
2025-06-16 16:04 ` [PATCH v2 0/2] Latencystats optimizations " Stephen Hemminger
@ 2025-06-16 16:04 ` Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 2/2] latencystats: optimize locking on transmit Stephen Hemminger
1 sibling, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, stable
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 51 ++++++++++++++++++-----------
1 file changed, 32 insertions(+), 19 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..2b994656fb 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
#include <rte_metrics.h>
#include <rte_spinlock.h>
#include <rte_string_fns.h>
+#include <rte_stdatomic.h>
#include "rte_latencystats.h"
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
#define LATENCY_AVG_SCALE 4
#define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,27 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
+ uint64_t now = rte_rdtsc();
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ /* Check without locking */
+ if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, rte_memory_order_relaxed))))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ rte_atomic_store_explicit(&next_tsc, now + samp_intvl, rte_memory_order_relaxed);
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
@@ -270,6 +282,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
glob_stats = mz->addr;
rte_spinlock_init(&glob_stats->lock);
samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+ next_tsc = rte_rdtsc();
/** Register latency stats with stats library */
for (i = 0; i < NUM_LATENCY_STATS; i++)
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 2/2] latencystats: optimize locking on transmit
2025-06-16 16:04 ` [PATCH v2 0/2] Latencystats optimizations " Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
@ 2025-06-16 16:04 ` Stephen Hemminger
1 sibling, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger
If transmit callback is called, and there are no packets
in the burst with timestamp set, then the expensive operations
of locking and accessing TSC can be skipped.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
lib/latencystats/rte_latencystats.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 2b994656fb..cd790fd49f 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -191,17 +191,26 @@ calc_latency(uint16_t pid __rte_unused,
void *_ __rte_unused)
{
unsigned int i;
- uint64_t now, latency;
+ uint64_t ts_flags = 0;
static uint64_t prev_latency;
- now = rte_rdtsc();
+ for (i = 0; i < nb_pkts; i++)
+ ts_flags |= (pkts[i]->ol_flags & timestamp_dynflag);
+ /* no samples in this burst, skip locking */
+ if (likely(ts_flags == 0))
+ return nb_pkts;
+
+ uint64_t now = rte_rdtsc();
rte_spinlock_lock(&glob_stats->lock);
for (i = 0; i < nb_pkts; i++) {
- if (!(pkts[i]->ol_flags & timestamp_dynflag))
+ struct rte_mbuf *m = pkts[i];
+ uint64_t latency;
+
+ if (!(m->ol_flags & timestamp_dynflag))
continue;
- latency = now - *timestamp_dynfield(pkts[i]);
+ latency = now - *timestamp_dynfield(m);
if (glob_stats->samples++ == 0) {
glob_stats->min_latency = latency;
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 0/2] latencystats: fix receive sample MP issues
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
` (3 preceding siblings ...)
2025-06-16 16:04 ` [PATCH v2 0/2] Latencystats optimizations " Stephen Hemminger
@ 2025-06-17 15:00 ` Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 1/2] " Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 2/2] latencystats: optimize locking on transmit Stephen Hemminger
4 siblings, 2 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger
After investigating latencystat test failures, discovered it poorly
designed to handle bursts and multiple queues.
v3 - reduce patch changes, break a line that was too long
Stephen Hemminger (2):
latencystats: fix receive sample MP issues
latencystats: optimize locking on transmit
lib/latencystats/rte_latencystats.c | 64 +++++++++++++++++++----------
1 file changed, 43 insertions(+), 21 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 1/2] latencystats: fix receive sample MP issues
2025-06-17 15:00 ` [PATCH v3 0/2] latencystats: fix receive sample MP issues Stephen Hemminger
@ 2025-06-17 15:00 ` Stephen Hemminger
2025-06-25 11:31 ` Varghese, Vipin
2025-06-17 15:00 ` [PATCH v3 2/2] latencystats: optimize locking on transmit Stephen Hemminger
1 sibling, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, stable
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Bugzilla ID: 1723
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
1 file changed, 35 insertions(+), 20 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..72a58d78d1 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
#include <rte_metrics.h>
#include <rte_spinlock.h>
#include <rte_string_fns.h>
+#include <rte_stdatomic.h>
#include "rte_latencystats.h"
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
#define LATENCY_AVG_SCALE 4
#define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
-
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ uint64_t now = rte_rdtsc();
+
+ /* Check without locking */
+ if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
+ rte_memory_order_relaxed))))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
+ rte_memory_order_relaxed);
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
@@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
glob_stats = mz->addr;
rte_spinlock_init(&glob_stats->lock);
samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+ next_tsc = rte_rdtsc();
/** Register latency stats with stats library */
for (i = 0; i < NUM_LATENCY_STATS; i++)
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v3 2/2] latencystats: optimize locking on transmit
2025-06-17 15:00 ` [PATCH v3 0/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 1/2] " Stephen Hemminger
@ 2025-06-17 15:00 ` Stephen Hemminger
1 sibling, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger
If transmit callback is called, and there are no packets
in the burst with timestamp set, then the expensive operations
of locking and accessing TSC can be skipped.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
lib/latencystats/rte_latencystats.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 72a58d78d1..f61d5a273f 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -194,10 +194,17 @@ calc_latency(uint16_t pid __rte_unused,
{
unsigned int i;
uint64_t now, latency;
+ uint64_t ts_flags = 0;
static uint64_t prev_latency;
- now = rte_rdtsc();
+ for (i = 0; i < nb_pkts; i++)
+ ts_flags |= (pkts[i]->ol_flags & timestamp_dynflag);
+
+ /* no samples in this burst, skip locking */
+ if (likely(ts_flags == 0))
+ return nb_pkts;
+ now = rte_rdtsc();
rte_spinlock_lock(&glob_stats->lock);
for (i = 0; i < nb_pkts; i++) {
if (!(pkts[i]->ol_flags & timestamp_dynflag))
--
2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH v3 1/2] latencystats: fix receive sample MP issues
2025-06-17 15:00 ` [PATCH v3 1/2] " Stephen Hemminger
@ 2025-06-25 11:31 ` Varghese, Vipin
2025-06-26 15:01 ` Stephen Hemminger
0 siblings, 1 reply; 12+ messages in thread
From: Varghese, Vipin @ 2025-06-25 11:31 UTC (permalink / raw)
To: Stephen Hemminger, dev, David Marchand; +Cc: stable
[AMD Official Use Only - AMD Internal Distribution Only]
Hi David & Stephen,
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 17, 2025 8:30 PM
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>; stable@dpdk.org
> Subject: [PATCH v3 1/2] latencystats: fix receive sample MP issues
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> The receive callback was not safe with multiple queues.
> If one receive queue callback decides to take a sample it needs to add that sample
> and do atomic update to the previous TSC sample value. Add a new lock for that.
>
> Optimize the check for when to take sample so that it only needs to lock when likely
> to need a sample.
>
> Also, add code to handle TSC wraparound in comparison.
> Perhaps this should move to rte_cycles.h?
>
> Bugzilla ID: 1723
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
> Cc: stable@dpdk.org
> ---
> lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 20 deletions(-)
>
> diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
> index 6873a44a92..72a58d78d1 100644
> --- a/lib/latencystats/rte_latencystats.c
> +++ b/lib/latencystats/rte_latencystats.c
> @@ -22,6 +22,7 @@
> #include <rte_metrics.h>
> #include <rte_spinlock.h>
> #include <rte_string_fns.h>
> +#include <rte_stdatomic.h>
>
> #include "rte_latencystats.h"
>
> @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
> timestamp_dynfield_offset, rte_mbuf_timestamp_t *); }
>
> +/* Compare two 64 bit timer counter but deal with wraparound correctly.
> +*/ static inline bool tsc_after(uint64_t t0, uint64_t t1) {
> + return (int64_t)(t1 - t0) < 0;
> +}
> +
> +#define tsc_before(a, b) tsc_after(b, a)
> +
> static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int
> latency_stats_index;
> +
> +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
> static uint64_t samp_intvl;
> -static uint64_t timer_tsc;
> -static uint64_t prev_tsc;
> +static RTE_ATOMIC(uint64_t) next_tsc;
>
> #define LATENCY_AVG_SCALE 4
> #define LATENCY_JITTER_SCALE 16
> @@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
> void *user_cb __rte_unused) {
> unsigned int i;
> - uint64_t diff_tsc, now;
> -
> - /*
> - * For every sample interval,
> - * time stamp is marked on one received packet.
> - */
> - now = rte_rdtsc();
> - for (i = 0; i < nb_pkts; i++) {
> - diff_tsc = now - prev_tsc;
> - timer_tsc += diff_tsc;
> -
> - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
> - && (timer_tsc >= samp_intvl)) {
> - *timestamp_dynfield(pkts[i]) = now;
> - pkts[i]->ol_flags |= timestamp_dynflag;
> - timer_tsc = 0;
> + uint64_t now = rte_rdtsc();
> +
> + /* Check without locking */
> + if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
> + rte_memory_order_relaxed))))
> + return nb_pkts;
> +
> + /* Try and get sample, skip if sample is being done by other core. */
> + if (likely(rte_spinlock_trylock(&sample_lock))) {
> + for (i = 0; i < nb_pkts; i++) {
> + struct rte_mbuf *m = pkts[i];
> +
> + /* skip if already timestamped */
> + if (unlikely(m->ol_flags & timestamp_dynflag))
> + continue;
> +
> + m->ol_flags |= timestamp_dynflag;
> + *timestamp_dynfield(m) = now;
> + rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
> + rte_memory_order_relaxed);
> + break;
> }
> - prev_tsc = now;
> - now = rte_rdtsc();
> + rte_spinlock_unlock(&sample_lock);
> }
>
> return nb_pkts;
> @@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
> glob_stats = mz->addr;
> rte_spinlock_init(&glob_stats->lock);
> samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
> + next_tsc = rte_rdtsc();
>
> /** Register latency stats with stats library */
> for (i = 0; i < NUM_LATENCY_STATS; i++)
Application: testpmd io mode with latency-stats enabled
CPU: AMD EPYC 7713 64-Core Processor (AVX2) Huge page: 1GB pages * 32
Nic: Intel E810 1CQ DA2, 1 * 100Gbps
+++++++++++++++++++++++++++++
Firmware: 3.20 0x8000d83e 1.3146.0
DDP: comms package 1.3.53
With no args, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20
With no args, After Patch (min, max, avg, jitter)
- 1Q: 40ns, 19136ns, 47ns, 5
- 4Q: 10ns, 18334ns, 194ns, 64
With args: rx_low_latency=1, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20
With args: rx_low_latency=1, After Patch
- 1Q: 40ns, 21631ns, 74ns, 12
- 4Q: 10ns, 23725ns, 116ns, 112
With Solar flare NIC:
+++++++++++++++
throughput profile; After Patch (min, max, avg, jitter)
- 1Q: 10ns, 23115ns, 96ns, 65
- 4Q: 10ns, 2981ns, 136ns, 140
low-latency profile , After Patch
- 1Q: 10ns, 19399ns, 367ns, 238
- 4Q: 10ns, 19970ns, 127ns, 100
Following are our understanding
1. increase in multi-queue latency is attributed by spinlock.
2. the lower latency with patch for multi-queue is because the lowest of all queues are taken into account.
Question: will there be per queue min, max, avg stats be enhanced in future?
Tested-by: Thiyagarajan P <Thiyagarajan.P@amd.com>
Reviewed-by: Vipin Varghese <Vipin.Varghese@amd.com>
> --
> 2.47.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v3 1/2] latencystats: fix receive sample MP issues
2025-06-25 11:31 ` Varghese, Vipin
@ 2025-06-26 15:01 ` Stephen Hemminger
0 siblings, 0 replies; 12+ messages in thread
From: Stephen Hemminger @ 2025-06-26 15:01 UTC (permalink / raw)
To: Varghese, Vipin; +Cc: dev, David Marchand, stable
On Wed, 25 Jun 2025 11:31:43 +0000
"Varghese, Vipin" <Vipin.Varghese@amd.com> wrote:
> Following are our understanding
>
> 1. increase in multi-queue latency is attributed by spinlock.
> 2. the lower latency with patch for multi-queue is because the lowest of all queues are taken into account.
>
> Question: will there be per queue min, max, avg stats be enhanced in future?
>
> Tested-by: Thiyagarajan P <Thiyagarajan.P@amd.com>
> Reviewed-by: Vipin Varghese <Vipin.Varghese@amd.com>
It would make sense for the latencystats to be per-port/per-queue and avoid the
locking, but it would be a significant API change. If you actually use this
then would be good to make it better.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-06-26 15:01 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-13 0:34 ` [PATCH 2/2] latencystats: optimize locking on transmit Stephen Hemminger
2025-06-13 11:13 ` [PATCH 0/2] Latencystat optimization and fix Varghese, Vipin
2025-06-16 16:04 ` [PATCH v2 0/2] Latencystats optimizations " Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-16 16:04 ` [PATCH v2 2/2] latencystats: optimize locking on transmit Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 0/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 1/2] " Stephen Hemminger
2025-06-25 11:31 ` Varghese, Vipin
2025-06-26 15:01 ` Stephen Hemminger
2025-06-17 15:00 ` [PATCH v3 2/2] latencystats: optimize locking on transmit Stephen Hemminger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).