DPDK patches and discussions
 help / color / mirror / Atom feed
* mbuf performance optimization
@ 2022-12-03 17:13 Morten Brørup
  2022-12-04 10:10 ` [PATCH v2] mbuf perf test, please ignore Morten Brørup
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-03 17:13 UTC (permalink / raw)
  To: olivier.matz, Shijith Thotton, thomas, andrew.rybchenko,
	honnappa.nagarahalli, bruce.richardson
  Cc: dev

I have been playing around with the idea to make some changes to avoid using the mbuf's 2nd cache line in many common cases, which would reduce the cache pressure significantly, and thus improve performance. I would like to discuss if it is doable. (And let's just assume that ABI breakage is an acceptable tradeoff.)

Move 'tx_offload' to the 1st cache line
---------------------------------------
Under all circumstances:

We would need to move the 'tx_offload' field to the 1st cache line. This field is set by the application's packet forwarding pipeline stage, and read by the PMD TX function. In most cases, these two stages directly follow each other.

This also means that we must make room for it by moving a 64 bit field from the 1st to the 2nd cache line. It could be the 'next' or the 'pool' field, as discussed below.


The 'next' field - make it conditional
--------------------------------------
Optimization for (1) non-segmented packets:

We could avoid touching the 'next' field by making the 'next' field depend on something in the first cache line. E.g.:
- Use the 'ol_flags' field. Add a RTE_MBUF_F_MORE_SEGS flag, to be set/cleared when setting/clearing the 'next' field.
- Use the 'nb_segs' field. Set the 'nb_segs' field to a value >1 when setting the 'next' field, and set it to 1 when clearing the 'next' field.


The 'pool' field - use it less frequently
-----------------------------------------
Optimizations for (2) single-mempool TX queues and (3) single-mempool applications:

The 'pool' field seems to be only used by when a PMD frees a burst of mbufs that it has finished transmitting. Please correct me if I am wrong here.

We could introduce a sibling to RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE, with the only requirement that the mbufs come from the same mempool. When set, only the first mbuf in a burst gets its 'pool' field read, thus avoiding reading it in the remaining mbufs in the burst.


For single-mempool applications, we could introduce a global 'mbuf_pool' variable, to be used instead of the mbuf's 'pool' field, if set.


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
@ 2022-12-04 10:10 ` Morten Brørup
  2022-12-04 12:00 ` [PATCH v3] " Morten Brørup
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-04 10:10 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c  |  2 ++
 lib/mbuf/rte_mbuf_core.h        | 38 ++++++++++++++++-----------------
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
  2022-12-04 10:10 ` [PATCH v2] mbuf perf test, please ignore Morten Brørup
@ 2022-12-04 12:00 ` Morten Brørup
  2022-12-04 12:33 ` [PATCH v4] " Morten Brørup
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-04 12:00 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 ++++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/mbuf/rte_mbuf.c                      | 39 +++++++++++++-----------
 lib/mbuf/rte_mbuf.h                      | 19 ++++++------
 lib/mbuf/rte_mbuf_core.h                 | 38 +++++++++++------------
 9 files changed, 70 insertions(+), 49 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..517562a5b9 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (m->nb_segs != 1) {
 		m->nb_segs = 1;
+		m->next = NULL;
+       }
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -495,7 +495,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 		__rte_mbuf_sanity_check(m, 1);
 
 		do {
-			m_next = m->next;
+			m_next = (m->nb_segs == 1) ? NULL : m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
@@ -511,7 +511,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +520,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +567,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (m->nb_segs == 1) ? NULL : m->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +591,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +613,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +631,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+            prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +700,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..e175641bf5 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1353,10 +1353,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+        }
 
 		return m;
 
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+        }
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1415,7 +1415,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 		__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -1497,7 +1497,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 }
 
 /**
@@ -1540,7 +1540,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (m->nb_segs != 1)
 		m = m->next;
 	return m;
 }
@@ -1765,6 +1765,7 @@ static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
  2022-12-04 10:10 ` [PATCH v2] mbuf perf test, please ignore Morten Brørup
  2022-12-04 12:00 ` [PATCH v3] " Morten Brørup
@ 2022-12-04 12:33 ` Morten Brørup
  2022-12-04 12:54 ` [PATCH v5] " Morten Brørup
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-04 12:33 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 ++++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/mbuf/rte_mbuf.c                      | 39 +++++++++++++-----------
 lib/mbuf/rte_mbuf.h                      | 19 ++++++------
 lib/mbuf/rte_mbuf_core.h                 | 38 +++++++++++------------
 9 files changed, 70 insertions(+), 49 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..6a8153b443 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (m->nb_segs != 1) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -495,7 +495,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 		__rte_mbuf_sanity_check(m, 1);
 
 		do {
-			m_next = m->next;
+			m_next = (m->nb_segs == 1) ? NULL : m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
@@ -511,7 +511,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +520,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +567,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +591,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +613,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +631,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +700,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..ce4c2f5f66 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1353,10 +1353,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1415,7 +1415,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 		__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -1497,7 +1497,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 }
 
 /**
@@ -1540,7 +1540,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (m->nb_segs != 1)
 		m = m->next;
 	return m;
 }
@@ -1765,6 +1765,7 @@ static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (2 preceding siblings ...)
  2022-12-04 12:33 ` [PATCH v4] " Morten Brørup
@ 2022-12-04 12:54 ` Morten Brørup
  2022-12-06  9:20 ` Morten Brørup
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-04 12:54 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 ++++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/mbuf/rte_mbuf.c                      | 39 +++++++++++++-----------
 lib/mbuf/rte_mbuf.h                      | 24 ++++++++-------
 lib/mbuf/rte_mbuf_core.h                 | 38 +++++++++++------------
 9 files changed, 73 insertions(+), 51 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..6a8153b443 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (m->nb_segs != 1) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -495,7 +495,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 		__rte_mbuf_sanity_check(m, 1);
 
 		do {
-			m_next = m->next;
+			m_next = (m->nb_segs == 1) ? NULL : m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
@@ -511,7 +511,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +520,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +567,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +591,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +613,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +631,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +700,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..6d08c4ebfd 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1353,10 +1353,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1415,7 +1415,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 		__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -1497,7 +1497,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 }
 
 /**
@@ -1540,7 +1540,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (m->nb_segs != 1)
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1758,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (3 preceding siblings ...)
  2022-12-04 12:54 ` [PATCH v5] " Morten Brørup
@ 2022-12-06  9:20 ` Morten Brørup
  2022-12-06 11:03 ` [PATCH v6] " Morten Brørup
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06  9:20 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 ++++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/mbuf/rte_mbuf.c                      | 39 +++++++++++++-----------
 lib/mbuf/rte_mbuf.h                      | 24 ++++++++-------
 lib/mbuf/rte_mbuf_core.h                 | 38 +++++++++++------------
 9 files changed, 73 insertions(+), 51 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..6a8153b443 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (m->nb_segs != 1) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -495,7 +495,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 		__rte_mbuf_sanity_check(m, 1);
 
 		do {
-			m_next = m->next;
+			m_next = (m->nb_segs == 1) ? NULL : m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
@@ -511,7 +511,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +520,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +567,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +591,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +613,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +631,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +700,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..6d08c4ebfd 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1353,10 +1353,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1415,7 +1415,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 		__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -1497,7 +1497,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 }
 
 /**
@@ -1540,7 +1540,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (m->nb_segs != 1)
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1758,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (4 preceding siblings ...)
  2022-12-06  9:20 ` Morten Brørup
@ 2022-12-06 11:03 ` Morten Brørup
  2022-12-06 12:12 ` [PATCH v7] " Morten Brørup
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06 11:03 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 ++++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/kni/rte_kni_common.h                 | 12 ++++++--
 lib/mbuf/rte_mbuf.c                      | 39 +++++++++++++-----------
 lib/mbuf/rte_mbuf.h                      | 24 ++++++++-------
 lib/mbuf/rte_mbuf_core.h                 | 38 +++++++++++------------
 10 files changed, 83 insertions(+), 53 deletions(-)

diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..6a8153b443 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (m->nb_segs != 1) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -495,7 +495,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 		__rte_mbuf_sanity_check(m, 1);
 
 		do {
-			m_next = m->next;
+			m_next = (m->nb_segs == 1) ? NULL : m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
@@ -511,7 +511,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +520,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +567,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +591,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +613,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +631,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +700,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..6d08c4ebfd 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1353,10 +1353,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (m->nb_segs != 1) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1415,7 +1415,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 		__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -1497,7 +1497,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 }
 
 /**
@@ -1540,7 +1540,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (m->nb_segs != 1)
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1758,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (5 preceding siblings ...)
  2022-12-06 11:03 ` [PATCH v6] " Morten Brørup
@ 2022-12-06 12:12 ` Morten Brørup
  2022-12-06 12:16 ` [PATCH v8] " Morten Brørup
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06 12:12 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v7:
* Add some likely/unlikely.
* Implement new interpretation of 'nb_segs' in chain split in test_mbuf.
v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_mbuf.c                     |  1 +
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 +++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/kni/rte_kni_common.h                 | 12 +++++--
 lib/mbuf/rte_mbuf.c                      | 44 ++++++++++++++----------
 lib/mbuf/rte_mbuf.h                      | 42 +++++++++++++---------
 lib/mbuf/rte_mbuf_core.h                 | 38 ++++++++++----------
 11 files changed, 101 insertions(+), 59 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 53fe898a38..dffc40044b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2744,6 +2744,7 @@ test_nb_segs_and_next_reset(void)
 
 	/* split m0 chain in two, between m1 and m2 */
 	m0->nb_segs = 2;
+	m1->nb_segs = 1;
 	m1->next = NULL;
 	m2->nb_segs = 1;
 
diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..9ebf65a04a 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (unlikely(m->nb_segs != 1)) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -494,13 +494,16 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 
 		__rte_mbuf_sanity_check(m, 1);
 
-		do {
+		while (unlikely(m->nb_segs != 1)) {
 			m_next = m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
 			m = m_next;
-		} while (m != NULL);
+		}
+		__rte_pktmbuf_free_seg_via_array(m,
+				pending, &nb_pending,
+				RTE_PKTMBUF_FREE_PENDING_SZ);
 	}
 
 	if (nb_pending > 0)
@@ -511,7 +514,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +523,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +570,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +594,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +616,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +634,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +703,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..0fd144def8 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1345,7 +1345,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1353,16 +1353,16 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1411,14 +1411,17 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	if (m != NULL)
-		__rte_mbuf_sanity_check(m, 1);
+	if (m = NULL)
+		return;
 
-	while (m != NULL) {
+	__rte_mbuf_sanity_check(m, 1);
+
+	while (unlikely(m->nb_segs != 1)) {
 		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
+	rte_pktmbuf_free_seg(m);
 }
 
 /**
@@ -1493,11 +1496,16 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
+	struct rte_mbuf *m_next;
+
 	__rte_mbuf_sanity_check(m, 1);
 
-	do {
+	while (unlikely(m->nb_segs != 1)) {
+		m_next = m->next;
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+		m = m_next;
+	}
+	rte_mbuf_refcnt_update(m, v);
 }
 
 /**
@@ -1540,7 +1548,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (unlikely(m->nb_segs != 1))
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1766,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v8] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (6 preceding siblings ...)
  2022-12-06 12:12 ` [PATCH v7] " Morten Brørup
@ 2022-12-06 12:16 ` Morten Brørup
  2022-12-06 12:35 ` [PATCH v9] " Morten Brørup
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06 12:16 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v8:
* Fix typo in comparison.
v7:
* Add some likely/unlikely.
* Implement new interpretation of 'nb_segs' in chain split in test_mbuf.
v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_mbuf.c                     |  1 +
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 +++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/kni/rte_kni_common.h                 | 12 +++++--
 lib/mbuf/rte_mbuf.c                      | 44 ++++++++++++++----------
 lib/mbuf/rte_mbuf.h                      | 42 +++++++++++++---------
 lib/mbuf/rte_mbuf_core.h                 | 38 ++++++++++----------
 11 files changed, 101 insertions(+), 59 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 53fe898a38..dffc40044b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2744,6 +2744,7 @@ test_nb_segs_and_next_reset(void)
 
 	/* split m0 chain in two, between m1 and m2 */
 	m0->nb_segs = 2;
+	m1->nb_segs = 1;
 	m1->next = NULL;
 	m2->nb_segs = 1;
 
diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..9ebf65a04a 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (unlikely(m->nb_segs != 1)) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -494,13 +494,16 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 
 		__rte_mbuf_sanity_check(m, 1);
 
-		do {
+		while (unlikely(m->nb_segs != 1)) {
 			m_next = m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
 			m = m_next;
-		} while (m != NULL);
+		}
+		__rte_pktmbuf_free_seg_via_array(m,
+				pending, &nb_pending,
+				RTE_PKTMBUF_FREE_PENDING_SZ);
 	}
 
 	if (nb_pending > 0)
@@ -511,7 +514,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +523,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +570,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +594,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +616,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +634,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +703,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..5167052154 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1345,7 +1345,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1353,16 +1353,16 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1411,14 +1411,17 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	if (m != NULL)
-		__rte_mbuf_sanity_check(m, 1);
+	if (m == NULL)
+		return;
 
-	while (m != NULL) {
+	__rte_mbuf_sanity_check(m, 1);
+
+	while (unlikely(m->nb_segs != 1)) {
 		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
+	rte_pktmbuf_free_seg(m);
 }
 
 /**
@@ -1493,11 +1496,16 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
+	struct rte_mbuf *m_next;
+
 	__rte_mbuf_sanity_check(m, 1);
 
-	do {
+	while (unlikely(m->nb_segs != 1)) {
+		m_next = m->next;
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+		m = m_next;
+	}
+	rte_mbuf_refcnt_update(m, v);
 }
 
 /**
@@ -1540,7 +1548,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (unlikely(m->nb_segs != 1))
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1766,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v9] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (7 preceding siblings ...)
  2022-12-06 12:16 ` [PATCH v8] " Morten Brørup
@ 2022-12-06 12:35 ` Morten Brørup
  2022-12-06 13:53 ` [PATCH v10] " Morten Brørup
  2022-12-07  7:35 ` [PATCH v11] " Morten Brørup
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06 12:35 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v9:
* Set enable_iova_as_pa meson option to false.
v8:
* Fix typo in comparison.
v7:
* Add some likely/unlikely.
* Implement new interpretation of 'nb_segs' in chain split in test_mbuf.
v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_mbuf.c                     |  1 +
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 +++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/kni/rte_kni_common.h                 | 12 +++++--
 lib/mbuf/rte_mbuf.c                      | 44 ++++++++++++++----------
 lib/mbuf/rte_mbuf.h                      | 42 +++++++++++++---------
 lib/mbuf/rte_mbuf_core.h                 | 38 ++++++++++----------
 meson_options.txt                        |  2 +-
 12 files changed, 102 insertions(+), 60 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 53fe898a38..dffc40044b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2744,6 +2744,7 @@ test_nb_segs_and_next_reset(void)
 
 	/* split m0 chain in two, between m1 and m2 */
 	m0->nb_segs = 2;
+	m1->nb_segs = 1;
 	m1->next = NULL;
 	m2->nb_segs = 1;
 
diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..9ebf65a04a 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (unlikely(m->nb_segs != 1)) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -494,13 +494,16 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 
 		__rte_mbuf_sanity_check(m, 1);
 
-		do {
+		while (unlikely(m->nb_segs != 1)) {
 			m_next = m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
 			m = m_next;
-		} while (m != NULL);
+		}
+		__rte_pktmbuf_free_seg_via_array(m,
+				pending, &nb_pending,
+				RTE_PKTMBUF_FREE_PENDING_SZ);
 	}
 
 	if (nb_pending > 0)
@@ -511,7 +514,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +523,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +570,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +594,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +616,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +634,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +703,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..5167052154 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1345,7 +1345,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1353,16 +1353,16 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1411,14 +1411,17 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	if (m != NULL)
-		__rte_mbuf_sanity_check(m, 1);
+	if (m == NULL)
+		return;
 
-	while (m != NULL) {
+	__rte_mbuf_sanity_check(m, 1);
+
+	while (unlikely(m->nb_segs != 1)) {
 		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
+	rte_pktmbuf_free_seg(m);
 }
 
 /**
@@ -1493,11 +1496,16 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
+	struct rte_mbuf *m_next;
+
 	__rte_mbuf_sanity_check(m, 1);
 
-	do {
+	while (unlikely(m->nb_segs != 1)) {
+		m_next = m->next;
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+		m = m_next;
+	}
+	rte_mbuf_refcnt_update(m, v);
 }
 
 /**
@@ -1540,7 +1548,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (unlikely(m->nb_segs != 1))
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1766,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
diff --git a/meson_options.txt b/meson_options.txt
index 08528492f7..345a3627e3 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -40,7 +40,7 @@ option('max_lcores', type: 'string', value: 'default', description:
        'Set maximum number of cores/threads supported by EAL; "default" is different per-arch, "detect" detects the number of cores on the build machine.')
 option('max_numa_nodes', type: 'string', value: 'default', description:
        'Set the highest NUMA node supported by EAL; "default" is different per-arch, "detect" detects the highest NUMA node on the build machine.')
-option('enable_iova_as_pa', type: 'boolean', value: true, description:
+option('enable_iova_as_pa', type: 'boolean', value: false, description:
        'Support for IOVA as physical address. Disabling removes the buf_iova field of mbuf.')
 option('mbuf_refcnt_atomic', type: 'boolean', value: true, description:
        'Atomically access the mbuf refcnt.')
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v10] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (8 preceding siblings ...)
  2022-12-06 12:35 ` [PATCH v9] " Morten Brørup
@ 2022-12-06 13:53 ` Morten Brørup
  2022-12-07  7:35 ` [PATCH v11] " Morten Brørup
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-06 13:53 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v10:
* Restore enable_iova_as_pa meson option default.
  Set RTE_IOVA_AS_PA to 0 in x86 config instead.
v9:
* Set enable_iova_as_pa meson option to false.
v8:
* Fix typo in comparison.
v7:
* Add some likely/unlikely.
* Implement new interpretation of 'nb_segs' in chain split in test_mbuf.
v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_mbuf.c                     |  1 +
 config/x86/meson.build                   |  2 ++
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 +++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 lib/kni/rte_kni_common.h                 | 12 +++++--
 lib/mbuf/rte_mbuf.c                      | 44 ++++++++++++++----------
 lib/mbuf/rte_mbuf.h                      | 42 +++++++++++++---------
 lib/mbuf/rte_mbuf_core.h                 | 38 ++++++++++----------
 12 files changed, 103 insertions(+), 59 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 53fe898a38..dffc40044b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2744,6 +2744,7 @@ test_nb_segs_and_next_reset(void)
 
 	/* split m0 chain in two, between m1 and m2 */
 	m0->nb_segs = 2;
+	m1->nb_segs = 1;
 	m1->next = NULL;
 	m2->nb_segs = 1;
 
diff --git a/config/x86/meson.build b/config/x86/meson.build
index 54345c4da3..df0e3ea7ab 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -73,3 +73,5 @@ endif
 dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
 dpdk_conf.set('RTE_MAX_LCORE', 128)
 dpdk_conf.set('RTE_MAX_NUMA_NODES', 32)
+
+dpdk_conf.set('RTE_IOVA_AS_PA', 0)
diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..9ebf65a04a 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (unlikely(m->nb_segs != 1)) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -494,13 +494,16 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 
 		__rte_mbuf_sanity_check(m, 1);
 
-		do {
+		while (unlikely(m->nb_segs != 1)) {
 			m_next = m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
 			m = m_next;
-		} while (m != NULL);
+		}
+		__rte_pktmbuf_free_seg_via_array(m,
+				pending, &nb_pending,
+				RTE_PKTMBUF_FREE_PENDING_SZ);
 	}
 
 	if (nb_pending > 0)
@@ -511,7 +514,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +523,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +570,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +594,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +616,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +634,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +703,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..5167052154 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1345,7 +1345,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1353,16 +1353,16 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1411,14 +1411,17 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	if (m != NULL)
-		__rte_mbuf_sanity_check(m, 1);
+	if (m == NULL)
+		return;
 
-	while (m != NULL) {
+	__rte_mbuf_sanity_check(m, 1);
+
+	while (unlikely(m->nb_segs != 1)) {
 		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
+	rte_pktmbuf_free_seg(m);
 }
 
 /**
@@ -1493,11 +1496,16 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
+	struct rte_mbuf *m_next;
+
 	__rte_mbuf_sanity_check(m, 1);
 
-	do {
+	while (unlikely(m->nb_segs != 1)) {
+		m_next = m->next;
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+		m = m_next;
+	}
+	rte_mbuf_refcnt_update(m, v);
 }
 
 /**
@@ -1540,7 +1548,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (unlikely(m->nb_segs != 1))
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1766,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v11] mbuf perf test, please ignore
  2022-12-03 17:13 mbuf performance optimization Morten Brørup
                   ` (9 preceding siblings ...)
  2022-12-06 13:53 ` [PATCH v10] " Morten Brørup
@ 2022-12-07  7:35 ` Morten Brørup
  10 siblings, 0 replies; 12+ messages in thread
From: Morten Brørup @ 2022-12-07  7:35 UTC (permalink / raw)
  To: dev; +Cc: Morten Brørup

Playing around with the mbuf structure, trying to reduce the use of the
second cache line in some common scenarios.

v11:
* Fix idpf to not fail when RTE_IOVA_AS_PA is 0.
v10:
* Restore enable_iova_as_pa meson option default.
  Set RTE_IOVA_AS_PA to 0 in x86 config instead.
v9:
* Set enable_iova_as_pa meson option to false.
v8:
* Fix typo in comparison.
v7:
* Add some likely/unlikely.
* Implement new interpretation of 'nb_segs' in chain split in test_mbuf.
v6:
* Update the rte_kni_mbuf to match the rte_mbuf modifications.
v5:
* Fix rte_pktmbuf_chain() for the case where the head is a single segment.
v4:
* Use tabs, not spaces.
* Fix copy-paste bug in linearize.
v3:
* Make 'next' depend on 'nb_segs' > 1.
* Implement new interpretation of 'nb_segs' in i40e PMD.
v2:
* Remove BUILD_BUG_ON in cnxk PMD.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_mbuf.c                     |  1 +
 config/x86/meson.build                   |  2 ++
 drivers/net/cnxk/cn10k_ethdev.c          |  2 ++
 drivers/net/cnxk/cn9k_ethdev.c           |  2 ++
 drivers/net/i40e/i40e_rxtx.c             |  9 +++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  4 +++
 drivers/net/i40e/i40e_rxtx_vec_common.h  |  2 ++
 drivers/net/i40e/i40e_rxtx_vec_neon.c    |  4 +++
 drivers/net/idpf/meson.build             |  6 ++++
 lib/kni/rte_kni_common.h                 | 12 +++++--
 lib/mbuf/rte_mbuf.c                      | 44 ++++++++++++++----------
 lib/mbuf/rte_mbuf.h                      | 42 +++++++++++++---------
 lib/mbuf/rte_mbuf_core.h                 | 38 ++++++++++----------
 13 files changed, 109 insertions(+), 59 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 53fe898a38..dffc40044b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -2744,6 +2744,7 @@ test_nb_segs_and_next_reset(void)
 
 	/* split m0 chain in two, between m1 and m2 */
 	m0->nb_segs = 2;
+	m1->nb_segs = 1;
 	m1->next = NULL;
 	m2->nb_segs = 1;
 
diff --git a/config/x86/meson.build b/config/x86/meson.build
index 54345c4da3..df0e3ea7ab 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -73,3 +73,5 @@ endif
 dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
 dpdk_conf.set('RTE_MAX_LCORE', 128)
 dpdk_conf.set('RTE_MAX_NUMA_NODES', 32)
+
+dpdk_conf.set('RTE_IOVA_AS_PA', 0)
diff --git a/drivers/net/cnxk/cn10k_ethdev.c b/drivers/net/cnxk/cn10k_ethdev.c
index 4658713591..9f6086efe6 100644
--- a/drivers/net/cnxk/cn10k_ethdev.c
+++ b/drivers/net/cnxk/cn10k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/cnxk/cn9k_ethdev.c b/drivers/net/cnxk/cn9k_ethdev.c
index 3b702d9696..3e9161ca79 100644
--- a/drivers/net/cnxk/cn9k_ethdev.c
+++ b/drivers/net/cnxk/cn9k_ethdev.c
@@ -72,8 +72,10 @@ nix_tx_offload_flags(struct rte_eth_dev *eth_dev)
 			 offsetof(struct rte_mbuf, buf_addr) + 24);
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) !=
 			 offsetof(struct rte_mbuf, ol_flags) + 12);
+/*
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, tx_offload) !=
 			 offsetof(struct rte_mbuf, pool) + 2 * sizeof(void *));
+*/
 
 	if (conf & RTE_ETH_TX_OFFLOAD_VLAN_INSERT ||
 	    conf & RTE_ETH_TX_OFFLOAD_QINQ_INSERT)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..a08afe0a13 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -920,8 +920,9 @@ i40e_recv_scattered_pkts(void *rx_queue,
 			first_seg->pkt_len =
 				(uint16_t)(first_seg->pkt_len +
 						rx_packet_len);
-			first_seg->nb_segs++;
+			last_seg->nb_segs = 2;
 			last_seg->next = rxm;
+			first_seg->nb_segs++;
 		}
 
 		/**
@@ -944,6 +945,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		 *  the length of that CRC part from the data length of the
 		 *  previous mbuf.
 		 */
+		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
 			first_seg->pkt_len -= RTE_ETHER_CRC_LEN;
@@ -953,6 +955,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				last_seg->data_len =
 					(uint16_t)(last_seg->data_len -
 					(RTE_ETHER_CRC_LEN - rx_packet_len));
+				last_seg->nb_segs = 1;
 				last_seg->next = NULL;
 			} else
 				rxm->data_len = (uint16_t)(rx_packet_len -
@@ -1065,7 +1068,7 @@ i40e_calc_pkt_desc(struct rte_mbuf *tx_pkt)
 
 	while (txd != NULL) {
 		count += DIV_ROUND_UP(txd->data_len, I40E_MAX_DATA_PER_TXD);
-		txd = txd->next;
+		txd = (txd->nb_segs == 1) ? NULL : txd->next;
 	}
 
 	return count;
@@ -1282,7 +1285,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->next;
+			m_seg = (m_seg->nb_segs == 1) ? NULL : m_seg->next;
 		} while (m_seg != NULL);
 
 		/* The last packet data descriptor needs End Of Packet (EOP) */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 2dfa04599c..ad91b5cb60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -410,6 +410,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..d5799f5242 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -27,6 +27,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 	for (buf_idx = 0, pkt_idx = 0; buf_idx < nb_bufs; buf_idx++) {
 		if (end != NULL) {
 			/* processing a split packet */
+			end->nb_segs = 2;
 			end->next = rx_bufs[buf_idx];
 			rx_bufs[buf_idx]->data_len += rxq->crc_len;
 
@@ -52,6 +53,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_bufs,
 						secondlast = secondlast->next;
 					secondlast->data_len -= (rxq->crc_len -
 							end->data_len);
+					secondlast->nb_segs = 1;
 					secondlast->next = NULL;
 					rte_pktmbuf_free_seg(end);
 				}
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..3199b0f8cf 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -541,6 +541,10 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			split_packet += RTE_I40E_DESCS_PER_LOOP;
 
 			/* zero-out next pointers */
+			rx_pkts[pos]->nb_segs = 1;
+			rx_pkts[pos + 1]->nb_segs = 1;
+			rx_pkts[pos + 2]->nb_segs = 1;
+			rx_pkts[pos + 3]->nb_segs = 1;
 			rx_pkts[pos]->next = NULL;
 			rx_pkts[pos + 1]->next = NULL;
 			rx_pkts[pos + 2]->next = NULL;
diff --git a/drivers/net/idpf/meson.build b/drivers/net/idpf/meson.build
index 998afd21fe..650dade0b9 100644
--- a/drivers/net/idpf/meson.build
+++ b/drivers/net/idpf/meson.build
@@ -7,6 +7,12 @@ if is_windows
     subdir_done()
 endif
 
+if dpdk_conf.get('RTE_IOVA_AS_PA') == 0
+    build = false
+    reason = 'driver does not support disabling IOVA as PA mode'
+    subdir_done()
+endif
+
 deps += ['common_idpf']
 
 sources = files(
diff --git a/lib/kni/rte_kni_common.h b/lib/kni/rte_kni_common.h
index 8d3ee0fa4f..01819ed6c6 100644
--- a/lib/kni/rte_kni_common.h
+++ b/lib/kni/rte_kni_common.h
@@ -80,7 +80,11 @@ struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
+#if RTE_IOVA_AS_PA
 	uint64_t buf_iova;
+#else
+	void *next;             /**< Physical address of next mbuf in kernel. */
+#endif
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint16_t nb_segs;       /**< Number of segments. */
@@ -89,12 +93,16 @@ struct rte_kni_mbuf {
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad3[14];
-	void *pool;
+	char pad3[22];
 
 	/* fields on second cache line */
 	__attribute__((__aligned__(RTE_CACHE_LINE_MIN_SIZE)))
+#if RTE_IOVA_AS_PA
 	void *next;             /**< Physical address of next mbuf in kernel. */
+#else
+	char pad5[8];
+#endif
+	void *pool;
 };
 
 /*
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index cfd8062f1e..9ebf65a04a 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -123,10 +123,10 @@ rte_pktmbuf_free_pinned_extmem(void *addr, void *opaque)
 
 	rte_mbuf_ext_refcnt_set(m->shinfo, 1);
 	m->ol_flags = RTE_MBUF_F_EXTERNAL;
-	if (m->next != NULL)
-		m->next = NULL;
-	if (m->nb_segs != 1)
+	if (unlikely(m->nb_segs != 1)) {
 		m->nb_segs = 1;
+		m->next = NULL;
+	}
 	rte_mbuf_raw_free(m);
 }
 
@@ -427,7 +427,7 @@ int rte_mbuf_check(const struct rte_mbuf *m, int is_header,
 		}
 		nb_segs -= 1;
 		pkt_len -= m->data_len;
-	} while ((m = m->next) != NULL);
+	} while ((m = ((m->nb_segs == 1) ? NULL : m->next)) != NULL);
 
 	if (nb_segs) {
 		*reason = "bad nb_segs";
@@ -494,13 +494,16 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 
 		__rte_mbuf_sanity_check(m, 1);
 
-		do {
+		while (unlikely(m->nb_segs != 1)) {
 			m_next = m->next;
 			__rte_pktmbuf_free_seg_via_array(m,
 					pending, &nb_pending,
 					RTE_PKTMBUF_FREE_PENDING_SZ);
 			m = m_next;
-		} while (m != NULL);
+		}
+		__rte_pktmbuf_free_seg_via_array(m,
+				pending, &nb_pending,
+				RTE_PKTMBUF_FREE_PENDING_SZ);
 	}
 
 	if (nb_pending > 0)
@@ -511,7 +514,7 @@ void rte_pktmbuf_free_bulk(struct rte_mbuf **mbufs, unsigned int count)
 struct rte_mbuf *
 rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 {
-	struct rte_mbuf *mc, *mi, **prev;
+	struct rte_mbuf *mc, *mi, *prev;
 	uint32_t pktlen;
 	uint16_t nseg;
 
@@ -520,19 +523,21 @@ rte_pktmbuf_clone(struct rte_mbuf *md, struct rte_mempool *mp)
 		return NULL;
 
 	mi = mc;
-	prev = &mi->next;
+	prev = mi;
 	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
-		*prev = mi;
-		prev = &mi->next;
-	} while ((md = md->next) != NULL &&
+		prev->nb_segs = 2;
+		prev->next = mi;
+		prev = mi;
+	} while ((md = ((md->nb_segs == 1) ? NULL : md->next)) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
-	*prev = NULL;
+	prev->nb_segs = 1;
+	prev->next = NULL;
 	mc->nb_segs = nseg;
 	mc->pkt_len = pktlen;
 
@@ -565,9 +570,9 @@ __rte_pktmbuf_linearize(struct rte_mbuf *mbuf)
 	mbuf->data_len = (uint16_t)(mbuf->pkt_len);
 
 	/* Append data from next segments to the first one */
-	m = mbuf->next;
+	m = (mbuf->nb_segs == 1) ? NULL : mbuf->next;
 	while (m != NULL) {
-		m_next = m->next;
+		m_next = (m->nb_segs == 1) ? NULL : m->next;
 
 		seg_len = rte_pktmbuf_data_len(m);
 		rte_memcpy(buffer, rte_pktmbuf_mtod(m, char *), seg_len);
@@ -589,7 +594,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t off, uint32_t len)
 {
 	const struct rte_mbuf *seg = m;
-	struct rte_mbuf *mc, *m_last, **prev;
+	struct rte_mbuf *mc, *m_last, *prev;
 
 	/* garbage in check */
 	__rte_mbuf_sanity_check(m, 1);
@@ -611,7 +616,7 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	/* copied mbuf is not indirect or external */
 	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
 
-	prev = &mc->next;
+	prev = mc;
 	m_last = mc;
 	while (len > 0) {
 		uint32_t copy_len;
@@ -629,9 +634,10 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 				rte_pktmbuf_free(mc);
 				return NULL;
 			}
+			prev->nb_segs = 2;
+			prev->next = m_last;
 			++mc->nb_segs;
-			*prev = m_last;
-			prev = &m_last->next;
+			prev = m_last;
 		}
 
 		/*
@@ -697,7 +703,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		if (len != 0)
 			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
-		m = m->next;
+		m = (m->nb_segs == 1) ? NULL : m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3a82eb136d..5167052154 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1345,7 +1345,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1353,16 +1353,16 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 
 		return m;
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (!RTE_MBUF_DIRECT(m)) {
+		if (unlikely(!RTE_MBUF_DIRECT(m))) {
 			rte_pktmbuf_detach(m);
 			if (RTE_MBUF_HAS_EXTBUF(m) &&
 			    RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
@@ -1370,10 +1370,10 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 				return NULL;
 		}
 
-		if (m->next != NULL)
-			m->next = NULL;
-		if (m->nb_segs != 1)
+		if (unlikely(m->nb_segs != 1)) {
 			m->nb_segs = 1;
+			m->next = NULL;
+		}
 		rte_mbuf_refcnt_set(m, 1);
 
 		return m;
@@ -1411,14 +1411,17 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	if (m != NULL)
-		__rte_mbuf_sanity_check(m, 1);
+	if (m == NULL)
+		return;
 
-	while (m != NULL) {
+	__rte_mbuf_sanity_check(m, 1);
+
+	while (unlikely(m->nb_segs != 1)) {
 		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
+	rte_pktmbuf_free_seg(m);
 }
 
 /**
@@ -1493,11 +1496,16 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
+	struct rte_mbuf *m_next;
+
 	__rte_mbuf_sanity_check(m, 1);
 
-	do {
+	while (unlikely(m->nb_segs != 1)) {
+		m_next = m->next;
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->next) != NULL);
+		m = m_next;
+	}
+	rte_mbuf_refcnt_update(m, v);
 }
 
 /**
@@ -1540,7 +1548,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	while (m->next != NULL)
+	while (unlikely(m->nb_segs != 1))
 		m = m->next;
 	return m;
 }
@@ -1758,20 +1766,22 @@ static inline const void *rte_pktmbuf_read(const struct rte_mbuf *m,
 static inline int rte_pktmbuf_chain(struct rte_mbuf *head, struct rte_mbuf *tail)
 {
 	struct rte_mbuf *cur_tail;
+	const unsigned int nb_segs = head->nb_segs + tail->nb_segs;
 
 	/* Check for number-of-segments-overflow */
-	if (head->nb_segs + tail->nb_segs > RTE_MBUF_MAX_NB_SEGS)
+	if (nb_segs > RTE_MBUF_MAX_NB_SEGS)
 		return -EOVERFLOW;
 
 	/* Chain 'tail' onto the old tail */
 	cur_tail = rte_pktmbuf_lastseg(head);
+	cur_tail->nb_segs = 2;
 	cur_tail->next = tail;
 
 	/* accumulate number of segments and total length.
 	 * NB: elaborating the addition like this instead of using
 	 *     -= allows us to ensure the result type is uint16_t
 	 *     avoiding compiler warnings on gcc 8.1 at least */
-	head->nb_segs = (uint16_t)(head->nb_segs + tail->nb_segs);
+	head->nb_segs = (uint16_t)nb_segs;
 	head->pkt_len += tail->pkt_len;
 
 	/* pkt_len is only set in the head */
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a30e1e0eaf..c0c3b45024 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -594,25 +594,6 @@ struct rte_mbuf {
 
 	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-
-	/* second cache line - fields only used in slow path or on TX */
-	RTE_MARKER cacheline1 __rte_cache_min_aligned;
-
-#if RTE_IOVA_AS_PA
-	/**
-	 * Next segment of scattered packet. Must be NULL in the last
-	 * segment or in case of non-segmented packet.
-	 */
-	struct rte_mbuf *next;
-#else
-	/**
-	 * Reserved for dynamic fields
-	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
-	 */
-	uint64_t dynfield2;
-#endif
-
 	/* fields to support TX offloads */
 	RTE_STD_C11
 	union {
@@ -651,6 +632,25 @@ struct rte_mbuf {
 		};
 	};
 
+	/* second cache line - fields only used in slow path or on TX */
+	RTE_MARKER cacheline1 __rte_cache_min_aligned;
+
+#if RTE_IOVA_AS_PA
+	/**
+	 * Next segment of scattered packet. Must be NULL in the last
+	 * segment or in case of non-segmented packet.
+	 */
+	struct rte_mbuf *next;
+#else
+	/**
+	 * Reserved for dynamic fields
+	 * when the next pointer is in first cache line (i.e. RTE_IOVA_AS_PA is 0).
+	 */
+	uint64_t dynfield2;
+#endif
+
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+
 	/** Shared data for external buffer attached to mbuf. See
 	 * rte_pktmbuf_attach_extbuf().
 	 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-07  7:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-03 17:13 mbuf performance optimization Morten Brørup
2022-12-04 10:10 ` [PATCH v2] mbuf perf test, please ignore Morten Brørup
2022-12-04 12:00 ` [PATCH v3] " Morten Brørup
2022-12-04 12:33 ` [PATCH v4] " Morten Brørup
2022-12-04 12:54 ` [PATCH v5] " Morten Brørup
2022-12-06  9:20 ` Morten Brørup
2022-12-06 11:03 ` [PATCH v6] " Morten Brørup
2022-12-06 12:12 ` [PATCH v7] " Morten Brørup
2022-12-06 12:16 ` [PATCH v8] " Morten Brørup
2022-12-06 12:35 ` [PATCH v9] " Morten Brørup
2022-12-06 13:53 ` [PATCH v10] " Morten Brørup
2022-12-07  7:35 ` [PATCH v11] " Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).