* [PATCH 1/2] latencystats: fix receive sample MP issues [not found] <20250613003547.39239-1-stephen@networkplumber.org> @ 2025-06-13 0:34 ` Stephen Hemminger [not found] ` <20250616160718.49938-1-stephen@networkplumber.org> [not found] ` <20250617150252.814215-1-stephen@networkplumber.org> 2 siblings, 0 replies; 4+ messages in thread From: Stephen Hemminger @ 2025-06-13 0:34 UTC (permalink / raw) To: dev Cc: Stephen Hemminger, stable, Reshma Pattan, Harry van Haaren, Remy Horton The receive callback was not safe with multiple queues. If one receive queue callback decides to take a sample it needs to add that sample and do atomic update to the previous TSC sample value. Add a new lock for that. Also, add code to handle TSC wraparound in comparison. Perhaps this should move to rte_cycles.h? Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats") Cc: stable@dpdk.org --- lib/latencystats/rte_latencystats.c | 47 ++++++++++++++++++----------- 1 file changed, 29 insertions(+), 18 deletions(-) diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c index 6873a44a92..181c53dd0e 100644 --- a/lib/latencystats/rte_latencystats.c +++ b/lib/latencystats/rte_latencystats.c @@ -45,10 +45,19 @@ timestamp_dynfield(struct rte_mbuf *mbuf) timestamp_dynfield_offset, rte_mbuf_timestamp_t *); } +/* Compare two 64 bit timer counter but deal with wraparound correctly. */ +static inline bool tsc_after(uint64_t t0, uint64_t t1) +{ + return (int64_t)(t1 - t0) < 0; +} + +#define tsc_before(a, b) tsc_after(b, a) + static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int latency_stats_index; + +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER; static uint64_t samp_intvl; -static uint64_t timer_tsc; static uint64_t prev_tsc; #define LATENCY_AVG_SCALE 4 @@ -147,25 +156,27 @@ add_time_stamps(uint16_t pid __rte_unused, void *user_cb __rte_unused) { unsigned int i; - uint64_t diff_tsc, now; + uint64_t now = rte_rdtsc(); - /* - * For every sample interval, - * time stamp is marked on one received packet. - */ - now = rte_rdtsc(); - for (i = 0; i < nb_pkts; i++) { - diff_tsc = now - prev_tsc; - timer_tsc += diff_tsc; - - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0 - && (timer_tsc >= samp_intvl)) { - *timestamp_dynfield(pkts[i]) = now; - pkts[i]->ol_flags |= timestamp_dynflag; - timer_tsc = 0; + /* Check without locking */ + if (likely(tsc_before(now, prev_tsc + samp_intvl))) + return nb_pkts; + + /* Try and get sample, skip if sample is being done by other core. */ + if (likely(rte_spinlock_trylock(&sample_lock))) { + for (i = 0; i < nb_pkts; i++) { + struct rte_mbuf *m = pkts[i]; + + /* skip if already timestamped */ + if (unlikely(m->ol_flags & timestamp_dynflag)) + continue; + + m->ol_flags |= timestamp_dynflag; + *timestamp_dynfield(m) = now; + prev_tsc = now; + break; } - prev_tsc = now; - now = rte_rdtsc(); + rte_spinlock_unlock(&sample_lock); } return nb_pkts; -- 2.47.2 ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <20250616160718.49938-1-stephen@networkplumber.org>]
* [PATCH v2 1/2] latencystats: fix receive sample MP issues [not found] ` <20250616160718.49938-1-stephen@networkplumber.org> @ 2025-06-16 16:04 ` Stephen Hemminger 0 siblings, 0 replies; 4+ messages in thread From: Stephen Hemminger @ 2025-06-16 16:04 UTC (permalink / raw) To: dev; +Cc: Stephen Hemminger, stable The receive callback was not safe with multiple queues. If one receive queue callback decides to take a sample it needs to add that sample and do atomic update to the previous TSC sample value. Add a new lock for that. Optimize the check for when to take sample so that it only needs to lock when likely to need a sample. Also, add code to handle TSC wraparound in comparison. Perhaps this should move to rte_cycles.h? Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats") Cc: stable@dpdk.org --- lib/latencystats/rte_latencystats.c | 51 ++++++++++++++++++----------- 1 file changed, 32 insertions(+), 19 deletions(-) diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c index 6873a44a92..2b994656fb 100644 --- a/lib/latencystats/rte_latencystats.c +++ b/lib/latencystats/rte_latencystats.c @@ -22,6 +22,7 @@ #include <rte_metrics.h> #include <rte_spinlock.h> #include <rte_string_fns.h> +#include <rte_stdatomic.h> #include "rte_latencystats.h" @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf) timestamp_dynfield_offset, rte_mbuf_timestamp_t *); } +/* Compare two 64 bit timer counter but deal with wraparound correctly. */ +static inline bool tsc_after(uint64_t t0, uint64_t t1) +{ + return (int64_t)(t1 - t0) < 0; +} + +#define tsc_before(a, b) tsc_after(b, a) + static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int latency_stats_index; + +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER; static uint64_t samp_intvl; -static uint64_t timer_tsc; -static uint64_t prev_tsc; +static RTE_ATOMIC(uint64_t) next_tsc; #define LATENCY_AVG_SCALE 4 #define LATENCY_JITTER_SCALE 16 @@ -147,25 +157,27 @@ add_time_stamps(uint16_t pid __rte_unused, void *user_cb __rte_unused) { unsigned int i; - uint64_t diff_tsc, now; + uint64_t now = rte_rdtsc(); - /* - * For every sample interval, - * time stamp is marked on one received packet. - */ - now = rte_rdtsc(); - for (i = 0; i < nb_pkts; i++) { - diff_tsc = now - prev_tsc; - timer_tsc += diff_tsc; - - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0 - && (timer_tsc >= samp_intvl)) { - *timestamp_dynfield(pkts[i]) = now; - pkts[i]->ol_flags |= timestamp_dynflag; - timer_tsc = 0; + /* Check without locking */ + if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, rte_memory_order_relaxed)))) + return nb_pkts; + + /* Try and get sample, skip if sample is being done by other core. */ + if (likely(rte_spinlock_trylock(&sample_lock))) { + for (i = 0; i < nb_pkts; i++) { + struct rte_mbuf *m = pkts[i]; + + /* skip if already timestamped */ + if (unlikely(m->ol_flags & timestamp_dynflag)) + continue; + + m->ol_flags |= timestamp_dynflag; + *timestamp_dynfield(m) = now; + rte_atomic_store_explicit(&next_tsc, now + samp_intvl, rte_memory_order_relaxed); + break; } - prev_tsc = now; - now = rte_rdtsc(); + rte_spinlock_unlock(&sample_lock); } return nb_pkts; @@ -270,6 +282,7 @@ rte_latencystats_init(uint64_t app_samp_intvl, glob_stats = mz->addr; rte_spinlock_init(&glob_stats->lock); samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns); + next_tsc = rte_rdtsc(); /** Register latency stats with stats library */ for (i = 0; i < NUM_LATENCY_STATS; i++) -- 2.47.2 ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <20250617150252.814215-1-stephen@networkplumber.org>]
* [PATCH v3 1/2] latencystats: fix receive sample MP issues [not found] ` <20250617150252.814215-1-stephen@networkplumber.org> @ 2025-06-17 15:00 ` Stephen Hemminger 2025-06-25 11:31 ` Varghese, Vipin 0 siblings, 1 reply; 4+ messages in thread From: Stephen Hemminger @ 2025-06-17 15:00 UTC (permalink / raw) To: dev; +Cc: Stephen Hemminger, stable The receive callback was not safe with multiple queues. If one receive queue callback decides to take a sample it needs to add that sample and do atomic update to the previous TSC sample value. Add a new lock for that. Optimize the check for when to take sample so that it only needs to lock when likely to need a sample. Also, add code to handle TSC wraparound in comparison. Perhaps this should move to rte_cycles.h? Bugzilla ID: 1723 Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats") Cc: stable@dpdk.org --- lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++----------- 1 file changed, 35 insertions(+), 20 deletions(-) diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c index 6873a44a92..72a58d78d1 100644 --- a/lib/latencystats/rte_latencystats.c +++ b/lib/latencystats/rte_latencystats.c @@ -22,6 +22,7 @@ #include <rte_metrics.h> #include <rte_spinlock.h> #include <rte_string_fns.h> +#include <rte_stdatomic.h> #include "rte_latencystats.h" @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf) timestamp_dynfield_offset, rte_mbuf_timestamp_t *); } +/* Compare two 64 bit timer counter but deal with wraparound correctly. */ +static inline bool tsc_after(uint64_t t0, uint64_t t1) +{ + return (int64_t)(t1 - t0) < 0; +} + +#define tsc_before(a, b) tsc_after(b, a) + static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int latency_stats_index; + +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER; static uint64_t samp_intvl; -static uint64_t timer_tsc; -static uint64_t prev_tsc; +static RTE_ATOMIC(uint64_t) next_tsc; #define LATENCY_AVG_SCALE 4 #define LATENCY_JITTER_SCALE 16 @@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused, void *user_cb __rte_unused) { unsigned int i; - uint64_t diff_tsc, now; - - /* - * For every sample interval, - * time stamp is marked on one received packet. - */ - now = rte_rdtsc(); - for (i = 0; i < nb_pkts; i++) { - diff_tsc = now - prev_tsc; - timer_tsc += diff_tsc; - - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0 - && (timer_tsc >= samp_intvl)) { - *timestamp_dynfield(pkts[i]) = now; - pkts[i]->ol_flags |= timestamp_dynflag; - timer_tsc = 0; + uint64_t now = rte_rdtsc(); + + /* Check without locking */ + if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, + rte_memory_order_relaxed)))) + return nb_pkts; + + /* Try and get sample, skip if sample is being done by other core. */ + if (likely(rte_spinlock_trylock(&sample_lock))) { + for (i = 0; i < nb_pkts; i++) { + struct rte_mbuf *m = pkts[i]; + + /* skip if already timestamped */ + if (unlikely(m->ol_flags & timestamp_dynflag)) + continue; + + m->ol_flags |= timestamp_dynflag; + *timestamp_dynfield(m) = now; + rte_atomic_store_explicit(&next_tsc, now + samp_intvl, + rte_memory_order_relaxed); + break; } - prev_tsc = now; - now = rte_rdtsc(); + rte_spinlock_unlock(&sample_lock); } return nb_pkts; @@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl, glob_stats = mz->addr; rte_spinlock_init(&glob_stats->lock); samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns); + next_tsc = rte_rdtsc(); /** Register latency stats with stats library */ for (i = 0; i < NUM_LATENCY_STATS; i++) -- 2.47.2 ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH v3 1/2] latencystats: fix receive sample MP issues 2025-06-17 15:00 ` [PATCH v3 " Stephen Hemminger @ 2025-06-25 11:31 ` Varghese, Vipin 0 siblings, 0 replies; 4+ messages in thread From: Varghese, Vipin @ 2025-06-25 11:31 UTC (permalink / raw) To: Stephen Hemminger, dev, David Marchand; +Cc: stable [AMD Official Use Only - AMD Internal Distribution Only] Hi David & Stephen, > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Tuesday, June 17, 2025 8:30 PM > To: dev@dpdk.org > Cc: Stephen Hemminger <stephen@networkplumber.org>; stable@dpdk.org > Subject: [PATCH v3 1/2] latencystats: fix receive sample MP issues > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > The receive callback was not safe with multiple queues. > If one receive queue callback decides to take a sample it needs to add that sample > and do atomic update to the previous TSC sample value. Add a new lock for that. > > Optimize the check for when to take sample so that it only needs to lock when likely > to need a sample. > > Also, add code to handle TSC wraparound in comparison. > Perhaps this should move to rte_cycles.h? > > Bugzilla ID: 1723 > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > Fixes: 5cd3cac9ed22 ("latency: added new library for latency stats") > Cc: stable@dpdk.org > --- > lib/latencystats/rte_latencystats.c | 55 ++++++++++++++++++----------- > 1 file changed, 35 insertions(+), 20 deletions(-) > > diff --git a/lib/latencystats/rte_latencystats.c b/lib/latencystats/rte_latencystats.c > index 6873a44a92..72a58d78d1 100644 > --- a/lib/latencystats/rte_latencystats.c > +++ b/lib/latencystats/rte_latencystats.c > @@ -22,6 +22,7 @@ > #include <rte_metrics.h> > #include <rte_spinlock.h> > #include <rte_string_fns.h> > +#include <rte_stdatomic.h> > > #include "rte_latencystats.h" > > @@ -45,11 +46,20 @@ timestamp_dynfield(struct rte_mbuf *mbuf) > timestamp_dynfield_offset, rte_mbuf_timestamp_t *); } > > +/* Compare two 64 bit timer counter but deal with wraparound correctly. > +*/ static inline bool tsc_after(uint64_t t0, uint64_t t1) { > + return (int64_t)(t1 - t0) < 0; > +} > + > +#define tsc_before(a, b) tsc_after(b, a) > + > static const char *MZ_RTE_LATENCY_STATS = "rte_latencystats"; static int > latency_stats_index; > + > +static rte_spinlock_t sample_lock = RTE_SPINLOCK_INITIALIZER; > static uint64_t samp_intvl; > -static uint64_t timer_tsc; > -static uint64_t prev_tsc; > +static RTE_ATOMIC(uint64_t) next_tsc; > > #define LATENCY_AVG_SCALE 4 > #define LATENCY_JITTER_SCALE 16 > @@ -147,25 +157,29 @@ add_time_stamps(uint16_t pid __rte_unused, > void *user_cb __rte_unused) { > unsigned int i; > - uint64_t diff_tsc, now; > - > - /* > - * For every sample interval, > - * time stamp is marked on one received packet. > - */ > - now = rte_rdtsc(); > - for (i = 0; i < nb_pkts; i++) { > - diff_tsc = now - prev_tsc; > - timer_tsc += diff_tsc; > - > - if ((pkts[i]->ol_flags & timestamp_dynflag) == 0 > - && (timer_tsc >= samp_intvl)) { > - *timestamp_dynfield(pkts[i]) = now; > - pkts[i]->ol_flags |= timestamp_dynflag; > - timer_tsc = 0; > + uint64_t now = rte_rdtsc(); > + > + /* Check without locking */ > + if (likely(tsc_before(now, rte_atomic_load_explicit(&next_tsc, > + rte_memory_order_relaxed)))) > + return nb_pkts; > + > + /* Try and get sample, skip if sample is being done by other core. */ > + if (likely(rte_spinlock_trylock(&sample_lock))) { > + for (i = 0; i < nb_pkts; i++) { > + struct rte_mbuf *m = pkts[i]; > + > + /* skip if already timestamped */ > + if (unlikely(m->ol_flags & timestamp_dynflag)) > + continue; > + > + m->ol_flags |= timestamp_dynflag; > + *timestamp_dynfield(m) = now; > + rte_atomic_store_explicit(&next_tsc, now + samp_intvl, > + rte_memory_order_relaxed); > + break; > } > - prev_tsc = now; > - now = rte_rdtsc(); > + rte_spinlock_unlock(&sample_lock); > } > > return nb_pkts; > @@ -270,6 +284,7 @@ rte_latencystats_init(uint64_t app_samp_intvl, > glob_stats = mz->addr; > rte_spinlock_init(&glob_stats->lock); > samp_intvl = (uint64_t)(app_samp_intvl * cycles_per_ns); > + next_tsc = rte_rdtsc(); > > /** Register latency stats with stats library */ > for (i = 0; i < NUM_LATENCY_STATS; i++) Application: testpmd io mode with latency-stats enabled CPU: AMD EPYC 7713 64-Core Processor (AVX2) Huge page: 1GB pages * 32 Nic: Intel E810 1CQ DA2, 1 * 100Gbps +++++++++++++++++++++++++++++ Firmware: 3.20 0x8000d83e 1.3146.0 DDP: comms package 1.3.53 With no args, Before patch (min, max, avg, jitter) - 1Q: 30ns, 27432ns, 94ns, 19 - 4Q: 30ns, 27722ns, 95ns, 20 With no args, After Patch (min, max, avg, jitter) - 1Q: 40ns, 19136ns, 47ns, 5 - 4Q: 10ns, 18334ns, 194ns, 64 With args: rx_low_latency=1, Before patch (min, max, avg, jitter) - 1Q: 30ns, 27432ns, 94ns, 19 - 4Q: 30ns, 27722ns, 95ns, 20 With args: rx_low_latency=1, After Patch - 1Q: 40ns, 21631ns, 74ns, 12 - 4Q: 10ns, 23725ns, 116ns, 112 With Solar flare NIC: +++++++++++++++ throughput profile; After Patch (min, max, avg, jitter) - 1Q: 10ns, 23115ns, 96ns, 65 - 4Q: 10ns, 2981ns, 136ns, 140 low-latency profile , After Patch - 1Q: 10ns, 19399ns, 367ns, 238 - 4Q: 10ns, 19970ns, 127ns, 100 Following are our understanding 1. increase in multi-queue latency is attributed by spinlock. 2. the lower latency with patch for multi-queue is because the lowest of all queues are taken into account. Question: will there be per queue min, max, avg stats be enhanced in future? Tested-by: Thiyagarajan P <Thiyagarajan.P@amd.com> Reviewed-by: Vipin Varghese <Vipin.Varghese@amd.com> > -- > 2.47.2 ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-25 11:31 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20250613003547.39239-1-stephen@networkplumber.org> 2025-06-13 0:34 ` [PATCH 1/2] latencystats: fix receive sample MP issues Stephen Hemminger [not found] ` <20250616160718.49938-1-stephen@networkplumber.org> 2025-06-16 16:04 ` [PATCH v2 " Stephen Hemminger [not found] ` <20250617150252.814215-1-stephen@networkplumber.org> 2025-06-17 15:00 ` [PATCH v3 " Stephen Hemminger 2025-06-25 11:31 ` Varghese, Vipin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).