DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities
@ 2024-10-09 11:40 Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-09 11:40 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski

There is a series of weakly dependent patches related to the tx trace
in mlx5 PMD:
 - fix issue in the trace dump Python script
 - update the real-time clock reading routine, make it atomic
   and move its initialization to the driver startup
 - advance the tx tracing in mlx5 PMD, provide real time parameter,
   read from the NIC to synchronize trace records with HW descriptors
 - update the trace dump script to see raw records and incomplete bursts

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Tim Martin (2):
  net/mlx5: fix real time counter reading from PCI BAR
  net/mlx5: fix Tx tracing to use single clock source

Viacheslav Ovsiienko (2):
  net/mlx5/tools: fix trace dump multiple burst completions
  net/mlx5: update dump script to show incomplete records

 .mailmap                             |  1 +
 doc/guides/nics/mlx5.rst             |  6 ++
 drivers/net/mlx5/mlx5.c              |  4 ++
 drivers/net/mlx5/mlx5_trace.h        |  9 ++-
 drivers/net/mlx5/mlx5_tx.h           | 53 ++++++++++++++--
 drivers/net/mlx5/mlx5_txpp.c         | 11 +---
 drivers/net/mlx5/tools/mlx5_trace.py | 90 ++++++++++++++++++++--------
 7 files changed, 133 insertions(+), 41 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
@ 2024-10-09 11:40 ` Viacheslav Ovsiienko
  2024-10-09 13:08   ` Dariusz Sosnowski
  2024-10-09 11:40 ` [PATCH 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-09 11:40 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, stable

In case if there were multiple bursts completed in the single
completion the first only burst was moved to the done list.
The situation is not typical, because usually tracing was
used for scheduled traffic debugging and for this case each
burst had its own completion requested, and there were no
completions with multiple bursts.

Fixes: 9725191a7e14 ("net/mlx5: add Tx datapath trace analyzing script")
Cc: stable@dpdk.org

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/tools/mlx5_trace.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 8c1fd0a350..67461520a9 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -258,13 +258,14 @@ def do_tx_complete(msg, trace):
         if burst.comp(wqe_id, wqe_ts) == 0:
             break
         rmv += 1
-    # mode completed burst to done list
+    # move completed burst(s) to done list
     if rmv != 0:
         idx = 0
         while idx < rmv:
+            burst = queue.wait_burst[idx]
             queue.done_burst.append(burst)
             idx += 1
-        del queue.wait_burst[0:rmv]
+        queue.wait_burst = queue.wait_burst[rmv:]
 
 
 def do_tx(msg, trace):
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/4] net/mlx5: fix real time counter reading from PCI BAR
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
@ 2024-10-09 11:40 ` Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-09 11:40 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, Tim Martin, stable

From: Tim Martin <timothym@nvidia.com>

There is the mlx5_txpp_read_clock() routine reading
the 64-bit real time counter from the device PCI BAR.
It introduced two issues:

  - it checks the PCI BAR mapping into process address
    space and tries to map this on demand. This might be
    problematic if something goes wrong and mapping fails.
    It happens on every read_clock API call, invokes kernel
    taking a long time and causing application malfunction.

  - the 64-bit counter should be read in single atomic
    transaction

Fixes: 9b31fc9007f9 ("net/mlx5: fix read device clock in real time mode")
Cc: stable@dpdk.org

Signed-off-by: Tim Martin <timothym@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 .mailmap                     |  1 +
 drivers/net/mlx5/mlx5.c      |  4 +++
 drivers/net/mlx5/mlx5_tx.h   | 53 +++++++++++++++++++++++++++++++++---
 drivers/net/mlx5/mlx5_txpp.c | 11 ++------
 4 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/.mailmap b/.mailmap
index dff07122f3..e36e0a4766 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1494,6 +1494,7 @@ Timmons C. Player <timmons.player@spirent.com>
 Timothy McDaniel <timothy.mcdaniel@intel.com>
 Timothy Miskell <timothy.miskell@intel.com>
 Timothy Redaelli <tredaelli@redhat.com>
+Tim Martin <timothym@nvidia.com>
 Tim Shearer <tim.shearer@overturenetworks.com>
 Ting-Kai Ku <ting-kai.ku@intel.com>
 Ting Xu <ting.xu@intel.com>
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index cf34766a50..14676be484 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2242,6 +2242,7 @@ int
 mlx5_proc_priv_init(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
 	struct mlx5_proc_priv *ppriv;
 	size_t ppriv_size;
 
@@ -2262,6 +2263,9 @@ mlx5_proc_priv_init(struct rte_eth_dev *dev)
 	dev->process_private = ppriv;
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		priv->sh->pppriv = ppriv;
+	/* Check and try to map HCA PCI BAR to allow reading real time. */
+	if (sh->dev_cap.rt_timestamp && mlx5_dev_is_pci(dev->device))
+		mlx5_txpp_map_hca_bar(dev);
 	return 0;
 }
 
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 983913faa2..55568c41b1 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -372,6 +372,46 @@ mlx5_txpp_convert_tx_ts(struct mlx5_dev_ctx_shared *sh, uint64_t mts)
 	return ci;
 }
 
+/**
+ * Read real time clock counter directly from the device PCI BAR area.
+ * The PCI BAR must be mapped to the process memory space at initialization.
+ *
+ * @param dev
+ *   Device to read clock counter from
+ *
+ * @return
+ *   0 - if HCA BAR is not supported or not mapped.
+ *   !=0 - read 64-bit value of real-time in UTC formatv (nanoseconds)
+ */
+static __rte_always_inline uint64_t mlx5_read_pcibar_clock(struct rte_eth_dev *dev)
+{
+	struct mlx5_proc_priv *ppriv = dev->process_private;
+
+	if (ppriv && ppriv->hca_bar) {
+		struct mlx5_priv *priv = dev->data->dev_private;
+		struct mlx5_dev_ctx_shared *sh = priv->sh;
+		uint64_t *hca_ptr = (uint64_t *)(ppriv->hca_bar) +
+				  __mlx5_64_off(initial_seg, real_time);
+		uint64_t __rte_atomic *ts_addr;
+		uint64_t ts;
+
+		ts_addr = (uint64_t __rte_atomic *)hca_ptr;
+		ts = rte_atomic_load_explicit(ts_addr, rte_memory_order_seq_cst);
+		ts = rte_be_to_cpu_64(ts);
+		ts = mlx5_txpp_convert_rx_ts(sh, ts);
+		return ts;
+	}
+	return 0;
+}
+
+static __rte_always_inline uint64_t mlx5_read_pcibar_clock_from_txq(struct mlx5_txq_data *txq)
+{
+	struct mlx5_txq_ctrl *txq_ctrl = container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct rte_eth_dev *dev = ETH_DEV(txq_ctrl->priv);
+
+	return mlx5_read_pcibar_clock(dev);
+}
+
 /**
  * Set Software Parser flags and offsets in Ethernet Segment of WQE.
  * Flags must be preliminary initialized to zero.
@@ -809,6 +849,7 @@ mlx5_tx_cseg_init(struct mlx5_txq_data *__rte_restrict txq,
 		  unsigned int olx)
 {
 	struct mlx5_wqe_cseg *__rte_restrict cs = &wqe->cseg;
+	uint64_t real_time;
 
 	/* For legacy MPW replace the EMPW by TSO with modifier. */
 	if (MLX5_TXOFF_CONFIG(MPW) && opcode == MLX5_OPCODE_ENHANCED_MPSW)
@@ -822,9 +863,12 @@ mlx5_tx_cseg_init(struct mlx5_txq_data *__rte_restrict txq,
 		cs->flags = RTE_BE32(MLX5_COMP_ONLY_FIRST_ERR <<
 				     MLX5_COMP_MODE_OFFSET);
 	cs->misc = RTE_BE32(0);
-	if (__rte_trace_point_fp_is_enabled() && !loc->pkts_sent)
-		rte_pmd_mlx5_trace_tx_entry(txq->port_id, txq->idx);
-	rte_pmd_mlx5_trace_tx_wqe((txq->wqe_ci << 8) | opcode);
+	if (__rte_trace_point_fp_is_enabled()) {
+		real_time = mlx5_read_pcibar_clock_from_txq(txq);
+		if (!loc->pkts_sent)
+			rte_pmd_mlx5_trace_tx_entry(real_time, txq->port_id, txq->idx);
+		rte_pmd_mlx5_trace_tx_wqe(real_time, (txq->wqe_ci << 8) | opcode);
+	}
 }
 
 /**
@@ -3786,7 +3830,8 @@ mlx5_tx_burst_tmpl(struct mlx5_txq_data *__rte_restrict txq,
 		__mlx5_tx_free_mbuf(txq, pkts, loc.mbuf_free, olx);
 	/* Trace productive bursts only. */
 	if (__rte_trace_point_fp_is_enabled() && loc.pkts_sent)
-		rte_pmd_mlx5_trace_tx_exit(loc.pkts_sent, pkts_n);
+		rte_pmd_mlx5_trace_tx_exit(mlx5_read_pcibar_clock_from_txq(txq),
+					   loc.pkts_sent, pkts_n);
 	return loc.pkts_sent;
 }
 
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index 4e26fa2db8..e6d3ad83e9 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -971,7 +971,6 @@ mlx5_txpp_read_clock(struct rte_eth_dev *dev, uint64_t *timestamp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_proc_priv *ppriv;
 	uint64_t ts;
 	int ret;
 
@@ -997,15 +996,9 @@ mlx5_txpp_read_clock(struct rte_eth_dev *dev, uint64_t *timestamp)
 		*timestamp = ts;
 		return 0;
 	}
-	/* Check and try to map HCA PIC BAR to allow reading real time. */
-	ppriv = dev->process_private;
-	if (ppriv && !ppriv->hca_bar &&
-	    sh->dev_cap.rt_timestamp && mlx5_dev_is_pci(dev->device))
-		mlx5_txpp_map_hca_bar(dev);
 	/* Check if we can read timestamp directly from hardware. */
-	if (ppriv && ppriv->hca_bar) {
-		ts = MLX5_GET64(initial_seg, ppriv->hca_bar, real_time);
-		ts = mlx5_txpp_convert_rx_ts(sh, ts);
+	ts = mlx5_read_pcibar_clock(dev);
+	if (ts != 0) {
 		*timestamp = ts;
 		return 0;
 	}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/4] net/mlx5: fix Tx tracing to use single clock source
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
@ 2024-10-09 11:40 ` Viacheslav Ovsiienko
  2024-10-09 11:40 ` [PATCH 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-09 11:40 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, Tim Martin, stable

From: Tim Martin <timothym@nvidia.com>

The prior commit introduced tracing for mlx5, but there is a mixture of
two unrelated clocks used: the TSC for host work submission timestamps
and the NIC HW clock for CQE completion times. It is necessary to have
timestamps from a single common clock, and the NIC HW clock is the
better choice since it can be used with externally synchronized clocks.

This patch adds the NIC HW clock as an additional logged parameter for
trace_tx_entry, trace_tx_exit, and trace_tx_wqe.  The included trace
analysis python script is also updated to use the new clock when
it is available.

Fixes: a1e910f5b8d4 ("net/mlx5: introduce tracepoints")
Fixes: 9725191a7e14 ("net/mlx5: add Tx datapath trace analyzing script")
Cc: stable@dpdk.org

Signed-off-by: Tim Martin <timothym@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_trace.h        |  9 ++++++---
 drivers/net/mlx5/tools/mlx5_trace.py | 12 +++++++++---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_trace.h b/drivers/net/mlx5/mlx5_trace.h
index 888d96f60b..656dbb1a4f 100644
--- a/drivers/net/mlx5/mlx5_trace.h
+++ b/drivers/net/mlx5/mlx5_trace.h
@@ -22,21 +22,24 @@ extern "C" {
 /* TX burst subroutines trace points. */
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_entry,
-	RTE_TRACE_POINT_ARGS(uint16_t port_id, uint16_t queue_id),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint16_t port_id, uint16_t queue_id),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u16(port_id);
 	rte_trace_point_emit_u16(queue_id);
 )
 
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_exit,
-	RTE_TRACE_POINT_ARGS(uint16_t nb_sent, uint16_t nb_req),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint16_t nb_sent, uint16_t nb_req),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u16(nb_sent);
 	rte_trace_point_emit_u16(nb_req);
 )
 
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_wqe,
-	RTE_TRACE_POINT_ARGS(uint32_t opcode),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint32_t opcode),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u32(opcode);
 )
 
diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 67461520a9..5eb634a490 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -174,7 +174,9 @@ def do_tx_entry(msg, trace):
         return
     # allocate the new burst and append to the queue
     burst = MlxBurst()
-    burst.call_ts = msg.default_clock_snapshot.ns_from_origin
+    burst.call_ts = event["real_time"]
+    if burst.call_ts == 0:
+        burst.call_ts = msg.default_clock_snapshot.ns_from_origin
     trace.tx_blst[cpu_id] = burst
     pq_id = event["port_id"] << 16 | event["queue_id"]
     queue = trace.tx_qlst.get(pq_id)
@@ -194,7 +196,9 @@ def do_tx_exit(msg, trace):
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
-    burst.done_ts = msg.default_clock_snapshot.ns_from_origin
+    burst.done_ts = event["real_time"]
+    if burst.done_ts == 0:
+        burst.done_ts = msg.default_clock_snapshot.ns_from_origin
     burst.req = event["nb_req"]
     burst.done = event["nb_sent"]
     trace.tx_blst.pop(cpu_id)
@@ -210,7 +214,9 @@ def do_tx_wqe(msg, trace):
     wqe = MlxWqe()
     wqe.wait_ts = trace.tx_wlst.get(cpu_id)
     if wqe.wait_ts is None:
-        wqe.wait_ts = msg.default_clock_snapshot.ns_from_origin
+        wqe.wait_ts = event["real_time"]
+        if wqe.wait_ts == 0:
+            wqe.wait_ts = msg.default_clock_snapshot.ns_from_origin
     wqe.opcode = event["opcode"]
     burst.wqes.append(wqe)
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/4] net/mlx5: update dump script to show incomplete records
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
                   ` (2 preceding siblings ...)
  2024-10-09 11:40 ` [PATCH 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
@ 2024-10-09 11:40 ` Viacheslav Ovsiienko
  2024-10-09 13:08   ` Dariusz Sosnowski
  2024-10-09 17:41 ` [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Stephen Hemminger
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
  5 siblings, 1 reply; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-09 11:40 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski

If the trace dump is stored at the moment when there is some
incomplete Tx transfers - WQE is pushed but hardware did not
sent the completions yet - this incomplete was not dumped by
the script mlx5_trace. For some cases (for example, if queue
was stuck) the valuable debug information was lost.

To improve the dump fullness the following optional script
arguments are added:

 -v [level] - provides the raw dump of the object record
              of the specified level (0 - bursts, 1 - WQEs,
              2+ - mbufs)
 -a - dumps all bursts, including incomplete ones

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/tools/mlx5_trace.py | 82 ++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 23 deletions(-)
---
 doc/guides/nics/mlx5.rst             |  6 +++
 drivers/net/mlx5/tools/mlx5_trace.py | 73 ++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 20 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 1dccdaad50..f82e2d75de 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -2360,6 +2360,12 @@ Steps to enable Tx datapath tracing:
 
    The parameter of the script is the trace data folder.
 
+   The optional parameter ``-a`` forces to dump incomplete bursts.
+
+   The optional parameter ``-v [level]`` forces to dump raw records data
+   for the specified level and below. Level 0 dumps bursts, level 1 dumps WQEs,
+   level 2 dumps mbufs.
+
    .. code-block:: console
 
       mlx5_trace.py /var/log/rte-2023-01-23-AM-11-52-39
diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 5eb634a490..96eb82082f 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -21,10 +21,13 @@ def __init__(self):
         self.wait_burst = []  # waiting for completion
         self.pq_id = 0
 
-    def log(self):
+    def log(self, all):
         """Log all queue bursts"""
         for txb in self.done_burst:
             txb.log()
+        if all == True:
+            for txb in self.wait_burst:
+                txb.log()
 
 
 class MlxMbuf:
@@ -147,24 +150,26 @@ def __init__(self):
         self.tx_qlst = {}  # active Tx queues per port/queue
         self.tx_wlst = {}  # wait timestamp list per CPU
 
-    def run(self, msg_it):
+    def run(self, msg_it, verbose):
         """Run over gathered tracing data and build database"""
         for msg in msg_it:
             if not isinstance(msg, bt2._EventMessageConst):
                 continue
             event = msg.event
             if event.name.startswith(PFX_TX):
-                do_tx(msg, self)
+                do_tx(msg, self, verbose)
             # Handling of other log event cathegories can be added here
+        if verbose:
+            print("*** End of raw data dump ***")
 
-    def log(self):
+    def log(self, all):
         """Log gathered trace database"""
         for pq_id in self.tx_qlst:
             queue = self.tx_qlst.get(pq_id)
-            queue.log()
+            queue.log(all)
 
 
-def do_tx_entry(msg, trace):
+def do_tx_entry(msg, trace, verbose):
     """Handle entry Tx busrt"""
     event = msg.event
     cpu_id = event["cpu_id"]
@@ -172,6 +177,10 @@ def do_tx_entry(msg, trace):
     if burst is not None:
         # continue existing burst after WAIT
         return
+    if verbose > 0:
+        print("%u:%X tx_entry(real_time=%u, port_id=%u, queue_id=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["port_id"], event["queue_id"]))
     # allocate the new burst and append to the queue
     burst = MlxBurst()
     burst.call_ts = event["real_time"]
@@ -189,10 +198,14 @@ def do_tx_entry(msg, trace):
     queue.wait_burst.append(burst)
 
 
-def do_tx_exit(msg, trace):
+def do_tx_exit(msg, trace, verbose):
     """Handle exit Tx busrt"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 0:
+        print("%u:%X tx_exit(real_time=%u, nb_sent=%u, nb_req=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["nb_sent"], event["nb_req"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -204,10 +217,14 @@ def do_tx_exit(msg, trace):
     trace.tx_blst.pop(cpu_id)
 
 
-def do_tx_wqe(msg, trace):
+def do_tx_wqe(msg, trace, verbose):
     """Handle WQE record"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 1:
+        print("%u:%X tx_wqe(real_time=%u, opcode=%08X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["opcode"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -221,17 +238,24 @@ def do_tx_wqe(msg, trace):
     burst.wqes.append(wqe)
 
 
-def do_tx_wait(msg, trace):
+def do_tx_wait(msg, trace, verbose):
     """Handle WAIT record"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 1:
+        print("%u:%X tx_wait(ts=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id, event["ts"]))
     trace.tx_wlst[cpu_id] = event["ts"]
 
 
-def do_tx_push(msg, trace):
+def do_tx_push(msg, trace, verbose):
     """Handle WQE push event"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 2:
+        print("%u:%X tx_push(mbuf=%X, pkt_len=%u, nb_segs=%u, wqe_id=%04X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id, event["mbuf"],
+               event["mbuf_pkt_len"], event["mbuf_nb_segs"], event["wqe_id"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -246,10 +270,15 @@ def do_tx_push(msg, trace):
     wqe.mbuf.append(mbuf)
 
 
-def do_tx_complete(msg, trace):
+def do_tx_complete(msg, trace, verbose):
     """Handle send completion event"""
     event = msg.event
     pq_id = event["port_id"] << 16 | event["queue_id"]
+    if verbose > 1:
+        cpu_id = event["cpu_id"]
+        print("%u:%X tx_complete(port_id=%u, queue_id=%u, ts=%u, wqe_id=%04X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["port_id"], event["queue_id"], event["ts"], event["wqe_id"]))
     queue = trace.tx_qlst.get(pq_id)
     if queue is None:
         return
@@ -274,21 +303,21 @@ def do_tx_complete(msg, trace):
         queue.wait_burst = queue.wait_burst[rmv:]
 
 
-def do_tx(msg, trace):
+def do_tx(msg, trace, verbose):
     """Handle Tx related records"""
     name = msg.event.name[PFX_TX_LEN:]
     if name == "entry":
-        do_tx_entry(msg, trace)
+        do_tx_entry(msg, trace, verbose)
     elif name == "exit":
-        do_tx_exit(msg, trace)
+        do_tx_exit(msg, trace, verbose)
     elif name == "wqe":
-        do_tx_wqe(msg, trace)
+        do_tx_wqe(msg, trace, verbose)
     elif name == "wait":
-        do_tx_wait(msg, trace)
+        do_tx_wait(msg, trace, verbose)
     elif name == "push":
-        do_tx_push(msg, trace)
+        do_tx_push(msg, trace, verbose)
     elif name == "complete":
-        do_tx_complete(msg, trace)
+        do_tx_complete(msg, trace, verbose)
     else:
         print("Error: unrecognized Tx event name: %s" % msg.event.name, file=sys.stderr)
         raise ValueError()
@@ -299,12 +328,16 @@ def main() -> int:
     try:
         parser = argparse.ArgumentParser()
         parser.add_argument("path", nargs=1, type=str, help="input trace folder")
+        parser.add_argument("-a", "--all", nargs="?", default=False, const=True,
+                            help="show all the bursts, including incomplete ones")
+        parser.add_argument("-v", "--verbose", type=int, nargs="?", default=0, const=2,
+                            help="show all the records below specified level")
         args = parser.parse_args()
 
         mlx_tr = MlxTrace()
         msg_it = bt2.TraceCollectionMessageIterator(args.path)
-        mlx_tr.run(msg_it)
-        mlx_tr.log()
+        mlx_tr.run(msg_it, args.verbose)
+        mlx_tr.log(args.all)
         return 0
     except ValueError:
         return -1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions
  2024-10-09 11:40 ` [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
@ 2024-10-09 13:08   ` Dariusz Sosnowski
  0 siblings, 0 replies; 13+ messages in thread
From: Dariusz Sosnowski @ 2024-10-09 13:08 UTC (permalink / raw)
  To: Slava Ovsiienko, dev; +Cc: Matan Azrad, Raslan Darawsheh, Ori Kam, stable

> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Wednesday, October 9, 2024 13:40
> To: dev@dpdk.org
> Cc: Matan Azrad <matan@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Ori Kam <orika@nvidia.com>; Dariusz Sosnowski
> <dsosnowski@nvidia.com>; stable@dpdk.org
> Subject: [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions
> 
> In case if there were multiple bursts completed in the single completion the first
> only burst was moved to the done list.
> The situation is not typical, because usually tracing was used for scheduled traffic
> debugging and for this case each burst had its own completion requested, and
> there were no completions with multiple bursts.
> 
> Fixes: 9725191a7e14 ("net/mlx5: add Tx datapath trace analyzing script")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>

Best regards,
Dariusz Sosnowski

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 4/4] net/mlx5: update dump script to show incomplete records
  2024-10-09 11:40 ` [PATCH 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
@ 2024-10-09 13:08   ` Dariusz Sosnowski
  0 siblings, 0 replies; 13+ messages in thread
From: Dariusz Sosnowski @ 2024-10-09 13:08 UTC (permalink / raw)
  To: Slava Ovsiienko, dev; +Cc: Matan Azrad, Raslan Darawsheh, Ori Kam

> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Wednesday, October 9, 2024 13:40
> To: dev@dpdk.org
> Cc: Matan Azrad <matan@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Ori Kam <orika@nvidia.com>; Dariusz Sosnowski
> <dsosnowski@nvidia.com>
> Subject: [PATCH 4/4] net/mlx5: update dump script to show incomplete records
> 
> If the trace dump is stored at the moment when there is some incomplete Tx
> transfers - WQE is pushed but hardware did not sent the completions yet - this
> incomplete was not dumped by the script mlx5_trace. For some cases (for
> example, if queue was stuck) the valuable debug information was lost.
> 
> To improve the dump fullness the following optional script arguments are added:
> 
>  -v [level] - provides the raw dump of the object record
>               of the specified level (0 - bursts, 1 - WQEs,
>               2+ - mbufs)
>  -a - dumps all bursts, including incomplete ones
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>

Best regards,
Dariusz Sosnowski

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
                   ` (3 preceding siblings ...)
  2024-10-09 11:40 ` [PATCH 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
@ 2024-10-09 17:41 ` Stephen Hemminger
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
  5 siblings, 0 replies; 13+ messages in thread
From: Stephen Hemminger @ 2024-10-09 17:41 UTC (permalink / raw)
  To: Viacheslav Ovsiienko; +Cc: dev, matan, rasland, orika, dsosnowski

On Wed, 9 Oct 2024 14:40:24 +0300
Viacheslav Ovsiienko <viacheslavo@nvidia.com> wrote:

> There is a series of weakly dependent patches related to the tx trace
> in mlx5 PMD:
>  - fix issue in the trace dump Python script
>  - update the real-time clock reading routine, make it atomic
>    and move its initialization to the driver startup
>  - advance the tx tracing in mlx5 PMD, provide real time parameter,
>    read from the NIC to synchronize trace records with HW descriptors
>  - update the trace dump script to see raw records and incomplete bursts
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> Tim Martin (2):
>   net/mlx5: fix real time counter reading from PCI BAR
>   net/mlx5: fix Tx tracing to use single clock source
> 
> Viacheslav Ovsiienko (2):
>   net/mlx5/tools: fix trace dump multiple burst completions
>   net/mlx5: update dump script to show incomplete records
> 
>  .mailmap                             |  1 +
>  doc/guides/nics/mlx5.rst             |  6 ++
>  drivers/net/mlx5/mlx5.c              |  4 ++
>  drivers/net/mlx5/mlx5_trace.h        |  9 ++-
>  drivers/net/mlx5/mlx5_tx.h           | 53 ++++++++++++++--
>  drivers/net/mlx5/mlx5_txpp.c         | 11 +---
>  drivers/net/mlx5/tools/mlx5_trace.py | 90 ++++++++++++++++++++--------
>  7 files changed, 133 insertions(+), 41 deletions(-)

Series-Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie
  2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
                   ` (4 preceding siblings ...)
  2024-10-09 17:41 ` [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Stephen Hemminger
@ 2024-10-14  8:04 ` Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
                     ` (3 more replies)
  5 siblings, 4 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-14  8:04 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

---
v1: https://inbox.dpdk.org/dev/20241009114028.973284-1-viacheslavo@nvidia.com/
v2: move part of code between 2nd and 3rd patches to fix single patch compilation issue.

Tim Martin (2):
  net/mlx5: fix real time counter reading from PCI BAR
  net/mlx5: fix Tx tracing to use single clock source

Viacheslav Ovsiienko (2):
  net/mlx5/tools: fix trace dump multiple burst completions
  net/mlx5: update dump script to show incomplete records

 .mailmap                             |  1 +
 doc/guides/nics/mlx5.rst             |  6 ++
 drivers/net/mlx5/mlx5.c              |  4 ++
 drivers/net/mlx5/mlx5_trace.h        |  9 ++-
 drivers/net/mlx5/mlx5_tx.h           | 53 ++++++++++++++--
 drivers/net/mlx5/mlx5_txpp.c         | 11 +---
 drivers/net/mlx5/tools/mlx5_trace.py | 90 ++++++++++++++++++++--------
 7 files changed, 133 insertions(+), 41 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] net/mlx5/tools: fix trace dump multiple burst completions
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
@ 2024-10-14  8:04   ` Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-14  8:04 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, stable

In case if there were multiple bursts completed in the single
completion the first only burst was moved to the done list.
The situation is not typical, because usually tracing was
used for scheduled traffic debugging and for this case each
burst had its own completion requested, and there were no
completions with multiple bursts.

Fixes: 9725191a7e14 ("net/mlx5: add Tx datapath trace analyzing script")
Cc: stable@dpdk.org

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
 drivers/net/mlx5/tools/mlx5_trace.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 8c1fd0a350..67461520a9 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -258,13 +258,14 @@ def do_tx_complete(msg, trace):
         if burst.comp(wqe_id, wqe_ts) == 0:
             break
         rmv += 1
-    # mode completed burst to done list
+    # move completed burst(s) to done list
     if rmv != 0:
         idx = 0
         while idx < rmv:
+            burst = queue.wait_burst[idx]
             queue.done_burst.append(burst)
             idx += 1
-        del queue.wait_burst[0:rmv]
+        queue.wait_burst = queue.wait_burst[rmv:]
 
 
 def do_tx(msg, trace):
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] net/mlx5: fix real time counter reading from PCI BAR
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
@ 2024-10-14  8:04   ` Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
  3 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-14  8:04 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, Tim Martin, stable

From: Tim Martin <timothym@nvidia.com>

There is the mlx5_txpp_read_clock() routine reading
the 64-bit real time counter from the device PCI BAR.
It introduced two issues:

  - it checks the PCI BAR mapping into process address
    space and tries to map this on demand. This might be
    problematic if something goes wrong and mapping fails.
    It happens on every read_clock API call, invokes kernel
    taking a long time and causing application malfunction.

  - the 64-bit counter should be read in single atomic
    transaction

Fixes: 9b31fc9007f9 ("net/mlx5: fix read device clock in real time mode")
Cc: stable@dpdk.org

Signed-off-by: Tim Martin <timothym@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 .mailmap                     |  1 +
 drivers/net/mlx5/mlx5.c      |  4 ++++
 drivers/net/mlx5/mlx5_tx.h   | 34 +++++++++++++++++++++++++++++++++-
 drivers/net/mlx5/mlx5_txpp.c | 11 ++---------
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/.mailmap b/.mailmap
index b30d993f3b..3065211c0a 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1505,6 +1505,7 @@ Timmons C. Player <timmons.player@spirent.com>
 Timothy McDaniel <timothy.mcdaniel@intel.com>
 Timothy Miskell <timothy.miskell@intel.com>
 Timothy Redaelli <tredaelli@redhat.com>
+Tim Martin <timothym@nvidia.com>
 Tim Shearer <tim.shearer@overturenetworks.com>
 Ting-Kai Ku <ting-kai.ku@intel.com>
 Ting Xu <ting.xu@intel.com>
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e36fa651a1..52b90e6ff3 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2242,6 +2242,7 @@ int
 mlx5_proc_priv_init(struct rte_eth_dev *dev)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
+	struct mlx5_dev_ctx_shared *sh = priv->sh;
 	struct mlx5_proc_priv *ppriv;
 	size_t ppriv_size;
 
@@ -2262,6 +2263,9 @@ mlx5_proc_priv_init(struct rte_eth_dev *dev)
 	dev->process_private = ppriv;
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		priv->sh->pppriv = ppriv;
+	/* Check and try to map HCA PCI BAR to allow reading real time. */
+	if (sh->dev_cap.rt_timestamp && mlx5_dev_is_pci(dev->device))
+		mlx5_txpp_map_hca_bar(dev);
 	return 0;
 }
 
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 983913faa2..587e6a9f7d 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -372,6 +372,38 @@ mlx5_txpp_convert_tx_ts(struct mlx5_dev_ctx_shared *sh, uint64_t mts)
 	return ci;
 }
 
+/**
+ * Read real time clock counter directly from the device PCI BAR area.
+ * The PCI BAR must be mapped to the process memory space at initialization.
+ *
+ * @param dev
+ *   Device to read clock counter from
+ *
+ * @return
+ *   0 - if HCA BAR is not supported or not mapped.
+ *   !=0 - read 64-bit value of real-time in UTC formatv (nanoseconds)
+ */
+static __rte_always_inline uint64_t mlx5_read_pcibar_clock(struct rte_eth_dev *dev)
+{
+	struct mlx5_proc_priv *ppriv = dev->process_private;
+
+	if (ppriv && ppriv->hca_bar) {
+		struct mlx5_priv *priv = dev->data->dev_private;
+		struct mlx5_dev_ctx_shared *sh = priv->sh;
+		uint64_t *hca_ptr = (uint64_t *)(ppriv->hca_bar) +
+				  __mlx5_64_off(initial_seg, real_time);
+		uint64_t __rte_atomic *ts_addr;
+		uint64_t ts;
+
+		ts_addr = (uint64_t __rte_atomic *)hca_ptr;
+		ts = rte_atomic_load_explicit(ts_addr, rte_memory_order_seq_cst);
+		ts = rte_be_to_cpu_64(ts);
+		ts = mlx5_txpp_convert_rx_ts(sh, ts);
+		return ts;
+	}
+	return 0;
+}
+
 /**
  * Set Software Parser flags and offsets in Ethernet Segment of WQE.
  * Flags must be preliminary initialized to zero.
@@ -822,7 +854,7 @@ mlx5_tx_cseg_init(struct mlx5_txq_data *__rte_restrict txq,
 		cs->flags = RTE_BE32(MLX5_COMP_ONLY_FIRST_ERR <<
 				     MLX5_COMP_MODE_OFFSET);
 	cs->misc = RTE_BE32(0);
-	if (__rte_trace_point_fp_is_enabled() && !loc->pkts_sent)
+	if (__rte_trace_point_fp_is_enabled())
 		rte_pmd_mlx5_trace_tx_entry(txq->port_id, txq->idx);
 	rte_pmd_mlx5_trace_tx_wqe((txq->wqe_ci << 8) | opcode);
 }
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index 4e26fa2db8..e6d3ad83e9 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -971,7 +971,6 @@ mlx5_txpp_read_clock(struct rte_eth_dev *dev, uint64_t *timestamp)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
 	struct mlx5_dev_ctx_shared *sh = priv->sh;
-	struct mlx5_proc_priv *ppriv;
 	uint64_t ts;
 	int ret;
 
@@ -997,15 +996,9 @@ mlx5_txpp_read_clock(struct rte_eth_dev *dev, uint64_t *timestamp)
 		*timestamp = ts;
 		return 0;
 	}
-	/* Check and try to map HCA PIC BAR to allow reading real time. */
-	ppriv = dev->process_private;
-	if (ppriv && !ppriv->hca_bar &&
-	    sh->dev_cap.rt_timestamp && mlx5_dev_is_pci(dev->device))
-		mlx5_txpp_map_hca_bar(dev);
 	/* Check if we can read timestamp directly from hardware. */
-	if (ppriv && ppriv->hca_bar) {
-		ts = MLX5_GET64(initial_seg, ppriv->hca_bar, real_time);
-		ts = mlx5_txpp_convert_rx_ts(sh, ts);
+	ts = mlx5_read_pcibar_clock(dev);
+	if (ts != 0) {
 		*timestamp = ts;
 		return 0;
 	}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] net/mlx5: fix Tx tracing to use single clock source
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
@ 2024-10-14  8:04   ` Viacheslav Ovsiienko
  2024-10-14  8:04   ` [PATCH v2 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
  3 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-14  8:04 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski, Tim Martin, stable

From: Tim Martin <timothym@nvidia.com>

The prior commit introduced tracing for mlx5, but there is a mixture of
two unrelated clocks used: the TSC for host work submission timestamps
and the NIC HW clock for CQE completion times. It is necessary to have
timestamps from a single common clock, and the NIC HW clock is the
better choice since it can be used with externally synchronized clocks.

This patch adds the NIC HW clock as an additional logged parameter for
trace_tx_entry, trace_tx_exit, and trace_tx_wqe.  The included trace
analysis python script is also updated to use the new clock when
it is available.

Fixes: a1e910f5b8d4 ("net/mlx5: introduce tracepoints")
Fixes: 9725191a7e14 ("net/mlx5: add Tx datapath trace analyzing script")
Cc: stable@dpdk.org

Signed-off-by: Tim Martin <timothym@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_trace.h        |  9 ++++++---
 drivers/net/mlx5/mlx5_tx.h           | 21 +++++++++++++++++----
 drivers/net/mlx5/tools/mlx5_trace.py | 12 +++++++++---
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_trace.h b/drivers/net/mlx5/mlx5_trace.h
index a8f0b372c8..4fc3584acc 100644
--- a/drivers/net/mlx5/mlx5_trace.h
+++ b/drivers/net/mlx5/mlx5_trace.h
@@ -22,21 +22,24 @@ extern "C" {
 /* TX burst subroutines trace points. */
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_entry,
-	RTE_TRACE_POINT_ARGS(uint16_t port_id, uint16_t queue_id),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint16_t port_id, uint16_t queue_id),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u16(port_id);
 	rte_trace_point_emit_u16(queue_id);
 )
 
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_exit,
-	RTE_TRACE_POINT_ARGS(uint16_t nb_sent, uint16_t nb_req),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint16_t nb_sent, uint16_t nb_req),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u16(nb_sent);
 	rte_trace_point_emit_u16(nb_req);
 )
 
 RTE_TRACE_POINT_FP(
 	rte_pmd_mlx5_trace_tx_wqe,
-	RTE_TRACE_POINT_ARGS(uint32_t opcode),
+	RTE_TRACE_POINT_ARGS(uint64_t real_time, uint32_t opcode),
+	rte_trace_point_emit_u64(real_time);
 	rte_trace_point_emit_u32(opcode);
 )
 
diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h
index 587e6a9f7d..55568c41b1 100644
--- a/drivers/net/mlx5/mlx5_tx.h
+++ b/drivers/net/mlx5/mlx5_tx.h
@@ -404,6 +404,14 @@ static __rte_always_inline uint64_t mlx5_read_pcibar_clock(struct rte_eth_dev *d
 	return 0;
 }
 
+static __rte_always_inline uint64_t mlx5_read_pcibar_clock_from_txq(struct mlx5_txq_data *txq)
+{
+	struct mlx5_txq_ctrl *txq_ctrl = container_of(txq, struct mlx5_txq_ctrl, txq);
+	struct rte_eth_dev *dev = ETH_DEV(txq_ctrl->priv);
+
+	return mlx5_read_pcibar_clock(dev);
+}
+
 /**
  * Set Software Parser flags and offsets in Ethernet Segment of WQE.
  * Flags must be preliminary initialized to zero.
@@ -841,6 +849,7 @@ mlx5_tx_cseg_init(struct mlx5_txq_data *__rte_restrict txq,
 		  unsigned int olx)
 {
 	struct mlx5_wqe_cseg *__rte_restrict cs = &wqe->cseg;
+	uint64_t real_time;
 
 	/* For legacy MPW replace the EMPW by TSO with modifier. */
 	if (MLX5_TXOFF_CONFIG(MPW) && opcode == MLX5_OPCODE_ENHANCED_MPSW)
@@ -854,9 +863,12 @@ mlx5_tx_cseg_init(struct mlx5_txq_data *__rte_restrict txq,
 		cs->flags = RTE_BE32(MLX5_COMP_ONLY_FIRST_ERR <<
 				     MLX5_COMP_MODE_OFFSET);
 	cs->misc = RTE_BE32(0);
-	if (__rte_trace_point_fp_is_enabled())
-		rte_pmd_mlx5_trace_tx_entry(txq->port_id, txq->idx);
-	rte_pmd_mlx5_trace_tx_wqe((txq->wqe_ci << 8) | opcode);
+	if (__rte_trace_point_fp_is_enabled()) {
+		real_time = mlx5_read_pcibar_clock_from_txq(txq);
+		if (!loc->pkts_sent)
+			rte_pmd_mlx5_trace_tx_entry(real_time, txq->port_id, txq->idx);
+		rte_pmd_mlx5_trace_tx_wqe(real_time, (txq->wqe_ci << 8) | opcode);
+	}
 }
 
 /**
@@ -3818,7 +3830,8 @@ mlx5_tx_burst_tmpl(struct mlx5_txq_data *__rte_restrict txq,
 		__mlx5_tx_free_mbuf(txq, pkts, loc.mbuf_free, olx);
 	/* Trace productive bursts only. */
 	if (__rte_trace_point_fp_is_enabled() && loc.pkts_sent)
-		rte_pmd_mlx5_trace_tx_exit(loc.pkts_sent, pkts_n);
+		rte_pmd_mlx5_trace_tx_exit(mlx5_read_pcibar_clock_from_txq(txq),
+					   loc.pkts_sent, pkts_n);
 	return loc.pkts_sent;
 }
 
diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 67461520a9..5eb634a490 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -174,7 +174,9 @@ def do_tx_entry(msg, trace):
         return
     # allocate the new burst and append to the queue
     burst = MlxBurst()
-    burst.call_ts = msg.default_clock_snapshot.ns_from_origin
+    burst.call_ts = event["real_time"]
+    if burst.call_ts == 0:
+        burst.call_ts = msg.default_clock_snapshot.ns_from_origin
     trace.tx_blst[cpu_id] = burst
     pq_id = event["port_id"] << 16 | event["queue_id"]
     queue = trace.tx_qlst.get(pq_id)
@@ -194,7 +196,9 @@ def do_tx_exit(msg, trace):
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
-    burst.done_ts = msg.default_clock_snapshot.ns_from_origin
+    burst.done_ts = event["real_time"]
+    if burst.done_ts == 0:
+        burst.done_ts = msg.default_clock_snapshot.ns_from_origin
     burst.req = event["nb_req"]
     burst.done = event["nb_sent"]
     trace.tx_blst.pop(cpu_id)
@@ -210,7 +214,9 @@ def do_tx_wqe(msg, trace):
     wqe = MlxWqe()
     wqe.wait_ts = trace.tx_wlst.get(cpu_id)
     if wqe.wait_ts is None:
-        wqe.wait_ts = msg.default_clock_snapshot.ns_from_origin
+        wqe.wait_ts = event["real_time"]
+        if wqe.wait_ts == 0:
+            wqe.wait_ts = msg.default_clock_snapshot.ns_from_origin
     wqe.opcode = event["opcode"]
     burst.wqes.append(wqe)
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] net/mlx5: update dump script to show incomplete records
  2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
                     ` (2 preceding siblings ...)
  2024-10-14  8:04   ` [PATCH v2 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
@ 2024-10-14  8:04   ` Viacheslav Ovsiienko
  3 siblings, 0 replies; 13+ messages in thread
From: Viacheslav Ovsiienko @ 2024-10-14  8:04 UTC (permalink / raw)
  To: dev; +Cc: matan, rasland, orika, dsosnowski

If the trace dump is stored at the moment when there is some
incomplete Tx transfers - WQE is pushed but hardware did not
sent the completions yet - this incomplete was not dumped by
the script mlx5_trace. For some cases (for example, if queue
was stuck) the valuable debug information was lost.

To improve the dump fullness the following optional script
arguments are added:

 -v [level] - provides the raw dump of the object record
              of the specified level (0 - bursts, 1 - WQEs,
              2+ - mbufs)
 -a - dumps all bursts, including incomplete ones

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>

---
 drivers/net/mlx5/tools/mlx5_trace.py | 82 ++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 23 deletions(-)
---
 doc/guides/nics/mlx5.rst             |  6 +++
 drivers/net/mlx5/tools/mlx5_trace.py | 73 ++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 20 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 1dccdaad50..f82e2d75de 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -2360,6 +2360,12 @@ Steps to enable Tx datapath tracing:
 
    The parameter of the script is the trace data folder.
 
+   The optional parameter ``-a`` forces to dump incomplete bursts.
+
+   The optional parameter ``-v [level]`` forces to dump raw records data
+   for the specified level and below. Level 0 dumps bursts, level 1 dumps WQEs,
+   level 2 dumps mbufs.
+
    .. code-block:: console
 
       mlx5_trace.py /var/log/rte-2023-01-23-AM-11-52-39
diff --git a/drivers/net/mlx5/tools/mlx5_trace.py b/drivers/net/mlx5/tools/mlx5_trace.py
index 5eb634a490..96eb82082f 100755
--- a/drivers/net/mlx5/tools/mlx5_trace.py
+++ b/drivers/net/mlx5/tools/mlx5_trace.py
@@ -21,10 +21,13 @@ def __init__(self):
         self.wait_burst = []  # waiting for completion
         self.pq_id = 0
 
-    def log(self):
+    def log(self, all):
         """Log all queue bursts"""
         for txb in self.done_burst:
             txb.log()
+        if all == True:
+            for txb in self.wait_burst:
+                txb.log()
 
 
 class MlxMbuf:
@@ -147,24 +150,26 @@ def __init__(self):
         self.tx_qlst = {}  # active Tx queues per port/queue
         self.tx_wlst = {}  # wait timestamp list per CPU
 
-    def run(self, msg_it):
+    def run(self, msg_it, verbose):
         """Run over gathered tracing data and build database"""
         for msg in msg_it:
             if not isinstance(msg, bt2._EventMessageConst):
                 continue
             event = msg.event
             if event.name.startswith(PFX_TX):
-                do_tx(msg, self)
+                do_tx(msg, self, verbose)
             # Handling of other log event cathegories can be added here
+        if verbose:
+            print("*** End of raw data dump ***")
 
-    def log(self):
+    def log(self, all):
         """Log gathered trace database"""
         for pq_id in self.tx_qlst:
             queue = self.tx_qlst.get(pq_id)
-            queue.log()
+            queue.log(all)
 
 
-def do_tx_entry(msg, trace):
+def do_tx_entry(msg, trace, verbose):
     """Handle entry Tx busrt"""
     event = msg.event
     cpu_id = event["cpu_id"]
@@ -172,6 +177,10 @@ def do_tx_entry(msg, trace):
     if burst is not None:
         # continue existing burst after WAIT
         return
+    if verbose > 0:
+        print("%u:%X tx_entry(real_time=%u, port_id=%u, queue_id=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["port_id"], event["queue_id"]))
     # allocate the new burst and append to the queue
     burst = MlxBurst()
     burst.call_ts = event["real_time"]
@@ -189,10 +198,14 @@ def do_tx_entry(msg, trace):
     queue.wait_burst.append(burst)
 
 
-def do_tx_exit(msg, trace):
+def do_tx_exit(msg, trace, verbose):
     """Handle exit Tx busrt"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 0:
+        print("%u:%X tx_exit(real_time=%u, nb_sent=%u, nb_req=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["nb_sent"], event["nb_req"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -204,10 +217,14 @@ def do_tx_exit(msg, trace):
     trace.tx_blst.pop(cpu_id)
 
 
-def do_tx_wqe(msg, trace):
+def do_tx_wqe(msg, trace, verbose):
     """Handle WQE record"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 1:
+        print("%u:%X tx_wqe(real_time=%u, opcode=%08X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["real_time"], event["opcode"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -221,17 +238,24 @@ def do_tx_wqe(msg, trace):
     burst.wqes.append(wqe)
 
 
-def do_tx_wait(msg, trace):
+def do_tx_wait(msg, trace, verbose):
     """Handle WAIT record"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 1:
+        print("%u:%X tx_wait(ts=%u)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id, event["ts"]))
     trace.tx_wlst[cpu_id] = event["ts"]
 
 
-def do_tx_push(msg, trace):
+def do_tx_push(msg, trace, verbose):
     """Handle WQE push event"""
     event = msg.event
     cpu_id = event["cpu_id"]
+    if verbose > 2:
+        print("%u:%X tx_push(mbuf=%X, pkt_len=%u, nb_segs=%u, wqe_id=%04X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id, event["mbuf"],
+               event["mbuf_pkt_len"], event["mbuf_nb_segs"], event["wqe_id"]))
     burst = trace.tx_blst.get(cpu_id)
     if burst is None:
         return
@@ -246,10 +270,15 @@ def do_tx_push(msg, trace):
     wqe.mbuf.append(mbuf)
 
 
-def do_tx_complete(msg, trace):
+def do_tx_complete(msg, trace, verbose):
     """Handle send completion event"""
     event = msg.event
     pq_id = event["port_id"] << 16 | event["queue_id"]
+    if verbose > 1:
+        cpu_id = event["cpu_id"]
+        print("%u:%X tx_complete(port_id=%u, queue_id=%u, ts=%u, wqe_id=%04X)" %
+              (msg.default_clock_snapshot.ns_from_origin, cpu_id,
+               event["port_id"], event["queue_id"], event["ts"], event["wqe_id"]))
     queue = trace.tx_qlst.get(pq_id)
     if queue is None:
         return
@@ -274,21 +303,21 @@ def do_tx_complete(msg, trace):
         queue.wait_burst = queue.wait_burst[rmv:]
 
 
-def do_tx(msg, trace):
+def do_tx(msg, trace, verbose):
     """Handle Tx related records"""
     name = msg.event.name[PFX_TX_LEN:]
     if name == "entry":
-        do_tx_entry(msg, trace)
+        do_tx_entry(msg, trace, verbose)
     elif name == "exit":
-        do_tx_exit(msg, trace)
+        do_tx_exit(msg, trace, verbose)
     elif name == "wqe":
-        do_tx_wqe(msg, trace)
+        do_tx_wqe(msg, trace, verbose)
     elif name == "wait":
-        do_tx_wait(msg, trace)
+        do_tx_wait(msg, trace, verbose)
     elif name == "push":
-        do_tx_push(msg, trace)
+        do_tx_push(msg, trace, verbose)
     elif name == "complete":
-        do_tx_complete(msg, trace)
+        do_tx_complete(msg, trace, verbose)
     else:
         print("Error: unrecognized Tx event name: %s" % msg.event.name, file=sys.stderr)
         raise ValueError()
@@ -299,12 +328,16 @@ def main() -> int:
     try:
         parser = argparse.ArgumentParser()
         parser.add_argument("path", nargs=1, type=str, help="input trace folder")
+        parser.add_argument("-a", "--all", nargs="?", default=False, const=True,
+                            help="show all the bursts, including incomplete ones")
+        parser.add_argument("-v", "--verbose", type=int, nargs="?", default=0, const=2,
+                            help="show all the records below specified level")
         args = parser.parse_args()
 
         mlx_tr = MlxTrace()
         msg_it = bt2.TraceCollectionMessageIterator(args.path)
-        mlx_tr.run(msg_it)
-        mlx_tr.log()
+        mlx_tr.run(msg_it, args.verbose)
+        mlx_tr.log(args.all)
         return 0
     except ValueError:
         return -1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-10-14  8:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-09 11:40 [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Viacheslav Ovsiienko
2024-10-09 11:40 ` [PATCH 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
2024-10-09 13:08   ` Dariusz Sosnowski
2024-10-09 11:40 ` [PATCH 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
2024-10-09 11:40 ` [PATCH 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
2024-10-09 11:40 ` [PATCH 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko
2024-10-09 13:08   ` Dariusz Sosnowski
2024-10-09 17:41 ` [PATCH 0/4] net/mlx5: series to fix and improve tx trace capabilities Stephen Hemminger
2024-10-14  8:04 ` [PATCH v2 0/4] net/mlx5: series to fix and improve tx trace capabilitie Viacheslav Ovsiienko
2024-10-14  8:04   ` [PATCH v2 1/4] net/mlx5/tools: fix trace dump multiple burst completions Viacheslav Ovsiienko
2024-10-14  8:04   ` [PATCH v2 2/4] net/mlx5: fix real time counter reading from PCI BAR Viacheslav Ovsiienko
2024-10-14  8:04   ` [PATCH v2 3/4] net/mlx5: fix Tx tracing to use single clock source Viacheslav Ovsiienko
2024-10-14  8:04   ` [PATCH v2 4/4] net/mlx5: update dump script to show incomplete records Viacheslav Ovsiienko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).