* [PATCH 1/2] latencystats: fix receive sample MP issues
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
@ 2025-06-13 0:34 ` Stephen Hemminger
2025-06-13 0:34 ` [PATCH 2/2] latencystats: optimize locking on transmit Stephen Hemminger
2025-06-13 11:13 ` [PATCH 0/2] Latencystat optimization and fix Varghese, Vipin
2 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev
Cc: Stephen Hemminger, stable, Reshma Pattan, Harry van Haaren, Remy Horton
The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.
Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
lib/latencystats/rte_latencystats.c | 47 ++++++++++++++++++-----------
1 file changed, 29 insertions(+), 18 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..181c53dd0e 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -45,10 +45,19 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
}
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+ return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
static uint64_t samp_intvl;
-static uint64_t timer_tsc;
static uint64_t prev_tsc;
#define LATENCY_AVG_SCALE 4
@@ -147,25 +156,27 @@ add_time_stamps(uint16_t pid __rte_unused,
void *user_cb __rte_unused)
{
unsigned int i;
- uint64_t diff_tsc, now;
+ uint64_t now = rte_rdtsc();
- /*
- * For every sample interval,
- * time stamp is marked on one received packet.
- */
- now = rte_rdtsc();
- for (i = 0; i < nb_pkts; i++) {
- diff_tsc = now - prev_tsc;
- timer_tsc += diff_tsc;
-
- if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
- && (timer_tsc >= samp_intvl)) {
- *timestamp_dynfield(pkts[i]) = now;
- pkts[i]->ol_flags |= timestamp_dynflag;
- timer_tsc = 0;
+ /* Check without locking */
+ if (likely(tsc_before(now, prev_tsc + samp_intvl)))
+ return nb_pkts;
+
+ /* Try and get sample, skip if sample is being done by other core. */
+ if (likely(rte_spinlock_trylock(&sample_lock))) {
+ for (i = 0; i < nb_pkts; i++) {
+ struct rte_mbuf *m = pkts[i];
+
+ /* skip if already timestamped */
+ if (unlikely(m->ol_flags & timestamp_dynflag))
+ continue;
+
+ m->ol_flags |= timestamp_dynflag;
+ *timestamp_dynfield(m) = now;
+ prev_tsc = now;
+ break;
}
- prev_tsc = now;
- now = rte_rdtsc();
+ rte_spinlock_unlock(&sample_lock);
}
return nb_pkts;
--
2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 2/2] latencystats: optimize locking on transmit
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
@ 2025-06-13 0:34 ` Stephen Hemminger
2025-06-13 11:13 ` [PATCH 0/2] Latencystat optimization and fix Varghese, Vipin
2 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw)
To: dev; +Cc: Stephen Hemminger, Reshma Pattan
If transmit callback is called, and there are no packets
in the burst with timestamp set, then the expensive operations
of locking and accessing TSC can be skipped.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
lib/latencystats/rte_latencystats.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 181c53dd0e..dc415c4682 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -190,17 +190,26 @@ calc_latency(uint16_t pid __rte_unused,
void *_ __rte_unused)
{
unsigned int i;
- uint64_t now, latency;
+ uint64_t ts_flags = 0;
static uint64_t prev_latency;
- now = rte_rdtsc();
+ for (i = 0; i < nb_pkts; i++)
+ ts_flags |= (pkts[i]->ol_flags & timestamp_dynflag);
+ /* no samples in this burst, skip locking */
+ if (likely(ts_flags == 0))
+ return nb_pkts;
+
+ uint64_t now = rte_rdtsc();
rte_spinlock_lock(&glob_stats->lock);
for (i = 0; i < nb_pkts; i++) {
- if (!(pkts[i]->ol_flags & timestamp_dynflag))
+ struct rte_mbuf *m = pkts[i];
+ uint64_t latency;
+
+ if (!(m->ol_flags & timestamp_dynflag))
continue;
- latency = now - *timestamp_dynfield(pkts[i]);
+ latency = now - *timestamp_dynfield(m);
if (glob_stats->samples++ == 0) {
glob_stats->min_latency = latency;
--
2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH 0/2] Latencystat optimization and fix
2025-06-13 0:34 [PATCH 0/2] Latencystat optimization and fix Stephen Hemminger
2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
2025-06-13 0:34 ` [PATCH 2/2] latencystats: optimize locking on transmit Stephen Hemminger
@ 2025-06-13 11:13 ` Varghese, Vipin
2 siblings, 0 replies; 4+ messages in thread
From: Varghese, Vipin @ 2025-06-13 11:13 UTC (permalink / raw)
To: Stephen Hemminger, dev
[Public]
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Friday, June 13, 2025 6:04 AM
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Subject: [PATCH 0/2] Latencystat optimization and fix
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> After investigating latencystat test failures, discovered it poorly designed to handle
> bursts and multiple queues.
Acked-by: Vipin.Varghese@amd.com
Reviewed-by: Vipin.Varghese@amd.com
Tested-by: Thiyagarajan.P@amd.com
We tested the changes using DPDK-TestPmd, following is the summary of the test shared below
1. before vs after patch with 1Q: min: -10ns, max: -5000ns, avg: -10ns, jitter: +5ns
2. before vs after patch with 4Q: min: -10ns, max: -7000ns, avg: +5ns, jitter: +80ns
Since we see a greater variation with multiple queues compared to single queue, we also collect HW Traffic Generator before vs after.
Values we got are ` min: -50000ns, max: -62000ns, avg: -5300ns, jitter: 6ns `.
NIC: Intel E810 2CQ-DA2 (used 1 100Gbps port), firmware: 4.7, DDP: default
Processor: AMD EPYC 8534P 64-Core Processor
Traffic Rate: 100Gbps
Application CMD:
1. DPDK: ` sudo build/app/dpdk-testpmd -l 11,16,17,18,19 -a 42:00.0 --file-prefix tr01 --proc-type=primary --force-max-simd-bitwidth=512 -- --nb-cores=4 --nb-ports=1 --txq=4 --rxq=4 --txd=2048 --rxd=2048 --burst=64 --rss-udp --latencystats=16 -i -a `
2. Proc-Info: ` sudo build/app/dpdk-proc-info --file-prefix=tr01 --proc-type=secondary -a 42:00.0 -- --metrics `
>
> Stephen Hemminger (2):
> latencystats: fix receive sample MP issues
> latencystats: optimize locking on transmit
>
> lib/latencystats/rte_latencystats.c | 64 +++++++++++++++++++----------
> 1 file changed, 42 insertions(+), 22 deletions(-)
>
> --
> 2.47.2
^ permalink raw reply [flat|nested] 4+ messages in thread