patches for DPDK stable branches
 help / color / mirror / Atom feed
* [PATCH 1/2] latencystats: fix receive sample MP issues
       [not found] <20250613003547.39239-1-stephen@networkplumber.org>
@ 2025-06-13  0:34 ` Stephen Hemminger
       [not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
       [not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
  2 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-13  0:34 UTC (permalink / raw)
  To: dev
  Cc: Stephen Hemminger, stable, Reshma Pattan, Harry van Haaren, Remy Horton

The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.

Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
 lib/latencystats/rte_latencystats.c | 47 ++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..181c53dd0e 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -45,10 +45,19 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
 			timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
 }
 
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+	return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
 static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
 static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
 static uint64_t samp_intvl;
-static uint64_t timer_tsc;
 static uint64_t prev_tsc;
 
 #define LATENCY_AVG_SCALE     4
@@ -147,25 +156,27 @@ add_time_stamps(uint16_t pid __rte_unused,
 		void *user_cb __rte_unused)
 {
 	unsigned int i;
-	uint64_t diff_tsc, now;
+	uint64_t now = rte_rdtsc();
 
-	/*
-	 * For every sample interval,
-	 * time stamp is marked on one received packet.
-	 */
-	now = rte_rdtsc();
-	for (i = 0; i < nb_pkts; i++) {
-		diff_tsc = now - prev_tsc;
-		timer_tsc += diff_tsc;
-
-		if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
-				&& (timer_tsc >= samp_intvl)) {
-			*timestamp_dynfield(pkts[i]) = now;
-			pkts[i]->ol_flags |= timestamp_dynflag;
-			timer_tsc = 0;
+	/* Check without locking */
+	if (likely(tsc_before(now, prev_tsc + samp_intvl)))
+		return nb_pkts;
+
+	/* Try and get sample, skip if sample is being done by other core. */
+	if (likely(rte_spinlock_trylock(&sample_lock))) {
+		for (i = 0; i < nb_pkts; i++) {
+			struct rte_mbuf *m = pkts[i];
+
+			/* skip if already timestamped */
+			if (unlikely(m->ol_flags & timestamp_dynflag))
+				continue;
+
+			m->ol_flags |= timestamp_dynflag;
+			*timestamp_dynfield(m) = now;
+			prev_tsc = now;
+			break;
 		}
-		prev_tsc = now;
-		now = rte_rdtsc();
+		rte_spinlock_unlock(&sample_lock);
 	}
 
 	return nb_pkts;
-- 
2.47.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/2] latencystats: fix receive sample MP issues
       [not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
@ 2025-06-16 16:04   ` Stephen Hemminger
  0 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.

Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.

Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
 lib/latencystats/rte_latencystats.c | 51 ++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..2b994656fb 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
 #include <rte_metrics.h>
 #include <rte_spinlock.h>
 #include <rte_string_fns.h>
+#include <rte_stdatomic.h>
 
 #include "rte_latencystats.h"
 
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
 			timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
 }
 
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+	return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
 static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
 static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
 static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
 
 #define LATENCY_AVG_SCALE     4
 #define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,27 @@ add_time_stamps(uint16_t pid __rte_unused,
 		void *user_cb __rte_unused)
 {
 	unsigned int i;
-	uint64_t diff_tsc, now;
+	uint64_t now = rte_rdtsc();
 
-	/*
-	 * For every sample interval,
-	 * time stamp is marked on one received packet.
-	 */
-	now = rte_rdtsc();
-	for (i = 0; i < nb_pkts; i++) {
-		diff_tsc = now - prev_tsc;
-		timer_tsc += diff_tsc;
-
-		if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
-				&& (timer_tsc >= samp_intvl)) {
-			*timestamp_dynfield(pkts[i]) = now;
-			pkts[i]->ol_flags |= timestamp_dynflag;
-			timer_tsc = 0;
+	/* Check without locking */
+	if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, rte_memory_order_relaxed))))
+		return nb_pkts;
+
+	/* Try and get sample, skip if sample is being done by other core. */
+	if (likely(rte_spinlock_trylock(&sample_lock))) {
+		for (i = 0; i < nb_pkts; i++) {
+			struct rte_mbuf *m = pkts[i];
+
+			/* skip if already timestamped */
+			if (unlikely(m->ol_flags & timestamp_dynflag))
+				continue;
+
+			m->ol_flags |= timestamp_dynflag;
+			*timestamp_dynfield(m) = now;
+			rte_atomic_store_explicit(&next_tsc, now + samp_intvl, rte_memory_order_relaxed);
+			break;
 		}
-		prev_tsc = now;
-		now = rte_rdtsc();
+		rte_spinlock_unlock(&sample_lock);
 	}
 
 	return nb_pkts;
@@ -270,6 +282,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
 	glob_stats = mz->addr;
 	rte_spinlock_init(&glob_stats->lock);
 	samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+	next_tsc = rte_rdtsc();
 
 	/** Register latency stats with stats library */
 	for (i = 0; i < NUM_LATENCY_STATS; i++)
-- 
2.47.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 1/2] latencystats: fix receive sample MP issues
       [not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
@ 2025-06-17 15:00   ` Stephen Hemminger
  2025-06-25 11:31     ` Varghese, Vipin
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

The receive callback was not safe with multiple queues.
If one receive queue callback decides to take a sample it
needs to add that sample and do atomic update to the previous
TSC sample value. Add a new lock for that.

Optimize the check for when to take sample so that
it only needs to lock when likely to need a sample.

Also, add code to handle TSC wraparound in comparison.
Perhaps this should move to rte_cycles.h?

Bugzilla ID: 1723
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
Cc: stable@dpdk.org
---
 lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
index 6873a44a92..72a58d78d1 100644
--- a/lib/latencystats/rte_latencystats.c
+++ b/lib/latencystats/rte_latencystats.c
@@ -22,6 +22,7 @@
 #include <rte_metrics.h>
 #include <rte_spinlock.h>
 #include <rte_string_fns.h>
+#include <rte_stdatomic.h>
 
 #include "rte_latencystats.h"
 
@@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
 			timestamp_dynfield_offset, rte_mbuf_timestamp_t *);
 }
 
+/* Compare two 64 bit timer counter but deal with wraparound correctly. */
+static inline bool tsc_after(uint64_t t0, uint64_t t1)
+{
+	return (int64_t)(t1 - t0) < 0;
+}
+
+#define tsc_before(a, b) tsc_after(b, a)
+
 static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";
 static int latency_stats_index;
+
+static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
 static uint64_t samp_intvl;
-static uint64_t timer_tsc;
-static uint64_t prev_tsc;
+static RTE_ATOMIC(uint64_t) next_tsc;
 
 #define LATENCY_AVG_SCALE     4
 #define LATENCY_JITTER_SCALE 16
@@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
 		void *user_cb __rte_unused)
 {
 	unsigned int i;
-	uint64_t diff_tsc, now;
-
-	/*
-	 * For every sample interval,
-	 * time stamp is marked on one received packet.
-	 */
-	now = rte_rdtsc();
-	for (i = 0; i < nb_pkts; i++) {
-		diff_tsc = now - prev_tsc;
-		timer_tsc += diff_tsc;
-
-		if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
-				&& (timer_tsc >= samp_intvl)) {
-			*timestamp_dynfield(pkts[i]) = now;
-			pkts[i]->ol_flags |= timestamp_dynflag;
-			timer_tsc = 0;
+	uint64_t now = rte_rdtsc();
+
+	/* Check without locking */
+	if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
+							    rte_memory_order_relaxed))))
+		return nb_pkts;
+
+	/* Try and get sample, skip if sample is being done by other core. */
+	if (likely(rte_spinlock_trylock(&sample_lock))) {
+		for (i = 0; i < nb_pkts; i++) {
+			struct rte_mbuf *m = pkts[i];
+
+			/* skip if already timestamped */
+			if (unlikely(m->ol_flags & timestamp_dynflag))
+				continue;
+
+			m->ol_flags |= timestamp_dynflag;
+			*timestamp_dynfield(m) = now;
+			rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
+						  rte_memory_order_relaxed);
+			break;
 		}
-		prev_tsc = now;
-		now = rte_rdtsc();
+		rte_spinlock_unlock(&sample_lock);
 	}
 
 	return nb_pkts;
@@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
 	glob_stats = mz->addr;
 	rte_spinlock_init(&glob_stats->lock);
 	samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
+	next_tsc = rte_rdtsc();
 
 	/** Register latency stats with stats library */
 	for (i = 0; i < NUM_LATENCY_STATS; i++)
-- 
2.47.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH v3 1/2] latencystats: fix receive sample MP issues
  2025-06-17 15:00   ` [PATCH v3 " Stephen Hemminger
@ 2025-06-25 11:31     ` Varghese, Vipin
  0 siblings, 0 replies; 4+ messages in thread
From: Varghese, Vipin @ 2025-06-25 11:31 UTC (permalink / raw)
  To: Stephen Hemminger, dev, David Marchand; +Cc: stable

[AMD Official Use Only - AMD Internal Distribution Only]

Hi David & Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 17, 2025 8:30 PM
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>; stable@dpdk.org
> Subject: [PATCH v3 1/2] latencystats: fix receive sample MP issues
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> The receive callback was not safe with multiple queues.
> If one receive queue callback decides to take a sample it needs to add that sample
> and do atomic update to the previous TSC sample value. Add a new lock for that.
>
> Optimize the check for when to take sample so that it only needs to lock when likely
> to need a sample.
>
> Also, add code to handle TSC wraparound in comparison.
> Perhaps this should move to rte_cycles.h?
>
> Bugzilla ID: 1723
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats")
> Cc: stable@dpdk.org
> ---
>  lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++-----------
>  1 file changed, 35 insertions(+), 20 deletions(-)
>
> diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c
> index 6873a44a92..72a58d78d1 100644
> --- a/lib/latencystats/rte_latencystats.c
> +++ b/lib/latencystats/rte_latencystats.c
> @@ -22,6 +22,7 @@
>  #include <rte_metrics.h>
>  #include <rte_spinlock.h>
>  #include <rte_string_fns.h>
> +#include <rte_stdatomic.h>
>
>  #include "rte_latencystats.h"
>
> @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf)
>                         timestamp_dynfield_offset, rte_mbuf_timestamp_t *);  }
>
> +/* Compare two 64 bit timer counter but deal with wraparound correctly.
> +*/ static inline bool tsc_after(uint64_t t0, uint64_t t1) {
> +       return (int64_t)(t1 - t0) < 0;
> +}
> +
> +#define tsc_before(a, b) tsc_after(b, a)
> +
>  static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats";  static int
> latency_stats_index;
> +
> +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER;
>  static uint64_t samp_intvl;
> -static uint64_t timer_tsc;
> -static uint64_t prev_tsc;
> +static RTE_ATOMIC(uint64_t) next_tsc;
>
>  #define LATENCY_AVG_SCALE     4
>  #define LATENCY_JITTER_SCALE 16
> @@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused,
>                 void *user_cb __rte_unused)  {
>         unsigned int i;
> -       uint64_t diff_tsc, now;
> -
> -       /*
> -        * For every sample interval,
> -        * time stamp is marked on one received packet.
> -        */
> -       now = rte_rdtsc();
> -       for (i = 0; i < nb_pkts; i++) {
> -               diff_tsc = now - prev_tsc;
> -               timer_tsc += diff_tsc;
> -
> -               if ((pkts[i]->ol_flags & timestamp_dynflag) == 0
> -                               && (timer_tsc >= samp_intvl)) {
> -                       *timestamp_dynfield(pkts[i]) = now;
> -                       pkts[i]->ol_flags |= timestamp_dynflag;
> -                       timer_tsc = 0;
> +       uint64_t now = rte_rdtsc();
> +
> +       /* Check without locking */
> +       if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc,
> +                                                           rte_memory_order_relaxed))))
> +               return nb_pkts;
> +
> +       /* Try and get sample, skip if sample is being done by other core. */
> +       if (likely(rte_spinlock_trylock(&sample_lock))) {
> +               for (i = 0; i < nb_pkts; i++) {
> +                       struct rte_mbuf *m = pkts[i];
> +
> +                       /* skip if already timestamped */
> +                       if (unlikely(m->ol_flags & timestamp_dynflag))
> +                               continue;
> +
> +                       m->ol_flags |= timestamp_dynflag;
> +                       *timestamp_dynfield(m) = now;
> +                       rte_atomic_store_explicit(&next_tsc, now + samp_intvl,
> +                                                 rte_memory_order_relaxed);
> +                       break;
>                 }
> -               prev_tsc = now;
> -               now = rte_rdtsc();
> +               rte_spinlock_unlock(&sample_lock);
>         }
>
>         return nb_pkts;
> @@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl,
>         glob_stats = mz->addr;
>         rte_spinlock_init(&glob_stats->lock);
>         samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns);
> +       next_tsc = rte_rdtsc();
>
>         /** Register latency stats with stats library */
>         for (i = 0; i < NUM_LATENCY_STATS; i++)


Application: testpmd io mode with latency-stats enabled
CPU: AMD EPYC 7713 64-Core Processor (AVX2) Huge page: 1GB pages * 32

Nic: Intel E810 1CQ DA2, 1 * 100Gbps
+++++++++++++++++++++++++++++

Firmware: 3.20 0x8000d83e 1.3146.0
DDP: comms package 1.3.53

With no args, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20

With no args, After Patch (min, max, avg, jitter)
 - 1Q: 40ns, 19136ns, 47ns, 5
 - 4Q: 10ns, 18334ns, 194ns, 64

With args: rx_low_latency=1, Before patch (min, max, avg, jitter)
- 1Q: 30ns, 27432ns, 94ns, 19
- 4Q: 30ns, 27722ns, 95ns, 20

With args: rx_low_latency=1, After Patch
 - 1Q: 40ns, 21631ns, 74ns, 12
 - 4Q: 10ns, 23725ns, 116ns, 112

With Solar flare NIC:
+++++++++++++++

throughput profile; After Patch (min, max, avg, jitter)
 - 1Q: 10ns, 23115ns, 96ns, 65
 - 4Q: 10ns, 2981ns, 136ns, 140

low-latency profile , After Patch
 - 1Q: 10ns, 19399ns, 367ns, 238
 - 4Q: 10ns, 19970ns, 127ns, 100


Following are our understanding

1. increase in multi-queue latency is attributed by spinlock.
2. the lower latency with patch for multi-queue is because the lowest of all queues are taken into account.

Question: will there be per queue min, max, avg stats be enhanced in future?

Tested-by: Thiyagarajan P <Thiyagarajan.P@amd.com>
Reviewed-by: Vipin Varghese <Vipin.Varghese@amd.com>


> --
> 2.47.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-25 11:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20250613003547.39239-1-stephen@networkplumber.org>
2025-06-13  0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger
     [not found] ` <20250616160718.49938-1-stephen@networkplumber.org>
2025-06-16 16:04   ` [PATCH v2 " Stephen Hemminger
     [not found] ` <20250617150252.814215-1-stephen@networkplumber.org>
2025-06-17 15:00   ` [PATCH v3 " Stephen Hemminger
2025-06-25 11:31     ` Varghese, Vipin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).