DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3
@ 2014-09-17 10:01 Bruce Richardson
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

This is the final planned set of patches to make changes to the mbuf
data structure and associated files. This patch set makes more changes to
help improve performance following the mbuf changes and adds in two new
fields into the mbuf structure.

It is planned to add other fields other than the two provided here, but
patches for adding those fields will be included in the patch sets for the
changes making use of those fields, since adding them does not affect, or
move, any other mbuf fields.

NOTE: this patch set has a dependency on mbuf patch sets 1 and 2

Bruce Richardson (5):
  mbuf: ensure next pointer is set to null on free
  ixgbe: add prefetch to improve slow-path tx perf
  testpmd: Change rxfreet default to 32
  mbuf: add userdata pointer field
  mbuf: Add in second vlan tag field to mbuf

 app/test-pmd/flowgen.c                                |  2 +-
 app/test-pmd/macfwd.c                                 |  2 +-
 app/test-pmd/macswap.c                                |  2 +-
 app/test-pmd/rxonly.c                                 |  2 +-
 app/test-pmd/testpmd.c                                |  6 +++---
 app/test-pmd/txonly.c                                 |  2 +-
 app/test/packet_burst_generator.c                     |  4 ++--
 examples/ipv4_multicast/main.c                        |  3 ++-
 examples/vhost/main.c                                 |  2 +-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h    |  3 ++-
 lib/librte_mbuf/rte_mbuf.h                            | 17 ++++++++++++-----
 lib/librte_pmd_e1000/em_rxtx.c                        |  8 ++++----
 lib/librte_pmd_e1000/igb_rxtx.c                       |  6 +++---
 lib/librte_pmd_i40e/i40e_rxtx.c                       |  8 ++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                     | 19 +++++++++++--------
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c                 |  4 ++--
 16 files changed, 51 insertions(+), 39 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
@ 2014-09-17 10:01 ` Bruce Richardson
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

The receive functions for packets do not modify the next pointer so
the next pointer should always be cleared on mbuf free, just in case.
The slow-path TX needs to clear it, and the standard mbuf free function
also needs to clear it. Fast path TX does not handle chained mbufs so
is unaffected

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h        | 4 +++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 1c6e115..8e27d2e 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -682,8 +682,10 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 static inline void __attribute__((always_inline))
 rte_pktmbuf_free_seg(struct rte_mbuf *m)
 {
-	if (likely(NULL != (m = __rte_pktmbuf_prefree_seg(m))))
+	if (likely(NULL != (m = __rte_pktmbuf_prefree_seg(m)))) {
+		m->next = NULL;
 		__rte_mbuf_raw_free(m);
+	}
 }
 
 /**
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index a80cade..6f702b3 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -145,6 +145,7 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	/* free buffers one at a time */
 	if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
 		for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
+			txep->mbuf->next = NULL;
 			rte_mempool_put(txep->mbuf->pool, txep->mbuf);
 			txep->mbuf = NULL;
 		}
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
@ 2014-09-17 10:01 ` Bruce Richardson
  2014-09-17 15:21   ` Neil Horman
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

Make a small improvement to slow path TX performance by adding in a
prefetch for the second mbuf cache line.
Also move assignment of l2/l3 length values only when needed.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 6f702b3..c0bb49f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ixgbe_xmit_cleanup(txq);
 	}
 
+	rte_prefetch0(&txe->mbuf->pool);
+
 	/* TX loop */
 	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
 		new_ctx = 0;
 		tx_pkt = *tx_pkts++;
 		pkt_len = tx_pkt->pkt_len;
 
-		RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
-
 		/*
 		 * Determine how many (if any) context descriptors
 		 * are needed for offload functionality.
 		 */
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
-		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
 
 		/* If hardware offload required */
 		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 		if (tx_ol_req) {
+			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
+			vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
+
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
 				vlan_macip_lens.data);
@@ -720,7 +721,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 				    &txr[tx_id];
 
 				txn = &sw_ring[txe->next_id];
-				RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
+				rte_prefetch0(&txn->mbuf->pool);
 
 				if (txe->mbuf != NULL) {
 					rte_pktmbuf_free_seg(txe->mbuf);
@@ -749,6 +750,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		do {
 			txd = &txr[tx_id];
 			txn = &sw_ring[txe->next_id];
+			rte_prefetch0(&txn->mbuf->pool);
 
 			if (txe->mbuf != NULL)
 				rte_pktmbuf_free_seg(txe->mbuf);
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
@ 2014-09-17 10:01 ` Bruce Richardson
  2014-09-17 15:29   ` Neil Horman
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field Bruce Richardson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

To improve performance by using bulk alloc or vectored RX routines, we
need to set rx free threshold (rxfreet) value to 32, so make this the
testpmd default.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/testpmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 9f6cdc4..5751607 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
 /*
  * Configurable value of RX free threshold.
  */
-uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
+uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets */
 
 /*
  * Configurable value of RX drop enable.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                   ` (2 preceding siblings ...)
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
@ 2014-09-17 10:01 ` Bruce Richardson
  2014-09-17 15:35   ` Neil Horman
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf Bruce Richardson
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  5 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

While some applications may store metadata about packets in the packet
mbuf headroom, this is not a workable solution for packet metadata which
is either:
* larger than the headroom (or headroom is needed for adding pkt headers)
* needs to be shared or copied among packets

To support these use cases in applications, we reserve a general
"userdata" pointer field inside the second cache-line of the mbuf. This
is better than having the application store the pointer to the external
metadata in the packet headroom, as it saves an additional cache-line
from being used.

Apart from storing metadata, this field also provides a general 8-byte
scratch space inside the mbuf for any other application uses that are
applicable.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 3 ++-
 lib/librte_mbuf/rte_mbuf.h                                    | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index 25ed672..d27e891 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -117,7 +117,8 @@ struct rte_kni_mbuf {
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	char pad3[8];
-	void *pool __attribute__((__aligned__(64)));
+	void *pad4 __attribute__((__aligned__(64)));
+	void *pool;
 	void *next;
 };
 
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 8e27d2e..b1acfc3 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -172,6 +172,9 @@ struct rte_mbuf {
 
 	/* second cache line - fields only used in slow path or on TX */
 	MARKER cacheline1 __rte_cache_aligned;
+
+	void *userdata;           /**< Can be used for external metadata */
+
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                   ` (3 preceding siblings ...)
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field Bruce Richardson
@ 2014-09-17 10:01 ` Bruce Richardson
  2014-09-17 20:46   ` Stephen Hemminger
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  5 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-17 10:01 UTC (permalink / raw)
  To: dev

Packets with double vlan tags are relatively common. Applications can be
written to use these, and some NICs (such as those supported by the i40e
driver) have hardware support for double vlan tags. To support apps and
drivers using double vlan tags, we need to add in a second vlan tag
field into the mbuf. The existing vlan_tci field is renamed vlan_tci0 to
take account of the fact that there is now a vlan_tci1 field.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/flowgen.c                |  2 +-
 app/test-pmd/macfwd.c                 |  2 +-
 app/test-pmd/macswap.c                |  2 +-
 app/test-pmd/rxonly.c                 |  2 +-
 app/test-pmd/testpmd.c                |  4 ++--
 app/test-pmd/txonly.c                 |  2 +-
 app/test/packet_burst_generator.c     |  4 ++--
 examples/ipv4_multicast/main.c        |  3 ++-
 examples/vhost/main.c                 |  2 +-
 lib/librte_mbuf/rte_mbuf.h            | 10 ++++++----
 lib/librte_pmd_e1000/em_rxtx.c        |  8 ++++----
 lib/librte_pmd_e1000/igb_rxtx.c       |  6 +++---
 lib/librte_pmd_i40e/i40e_rxtx.c       |  8 ++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  8 ++++----
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c |  4 ++--
 15 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index 8b4ed9a..cd68916 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -208,7 +208,7 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 		pkt->nb_segs		= 1;
 		pkt->pkt_len		= pkt_size;
 		pkt->ol_flags		= ol_flags;
-		pkt->vlan_tci		= vlan_tci;
+		pkt->vlan_tci0		= vlan_tci;
 		pkt->l2_len		= sizeof(struct ether_hdr);
 		pkt->l3_len		= sizeof(struct ipv4_hdr);
 		pkts_burst[nb_pkt]	= pkt;
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 38bae23..3e91050 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -118,7 +118,7 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 		mb->ol_flags = txp->tx_ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-		mb->vlan_tci = txp->tx_vlan_id;
+		mb->vlan_tci0 = txp->tx_vlan_id;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
 	fs->tx_packets += nb_tx;
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index 1786095..02b04d7 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -120,7 +120,7 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 		mb->ol_flags = txp->tx_ol_flags;
 		mb->l2_len = sizeof(struct ether_hdr);
 		mb->l3_len = sizeof(struct ipv4_hdr);
-		mb->vlan_tci = txp->tx_vlan_id;
+		mb->vlan_tci0 = txp->tx_vlan_id;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
 	fs->tx_packets += nb_tx;
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 98c788b..a4c8797 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -165,7 +165,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 			printf(" - FDIR hash=0x%x - FDIR id=0x%x ",
 			       mb->hash.fdir.hash, mb->hash.fdir.id);
 		if (ol_flags & PKT_RX_VLAN_PKT)
-			printf(" - VLAN tci=0x%x", mb->vlan_tci);
+			printf(" - VLAN tci0=0x%x", mb->vlan_tci0);
 		printf("\n");
 		if (ol_flags != 0) {
 			int rxf;
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 5751607..fdf3296 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -406,8 +406,8 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 	mb->ol_flags     = 0;
 	mb->data_off     = RTE_PKTMBUF_HEADROOM;
 	mb->nb_segs      = 1;
-	mb->l2_l3_len       = 0;
-	mb->vlan_tci     = 0;
+	mb->l2_l3_len    = 0;
+	mb->vlan_tci0    = 0;
 	mb->hash.rss     = 0;
 }
 
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 3d08005..52b2f06 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -264,7 +264,7 @@ pkt_burst_transmit(struct fwd_stream *fs)
 		pkt->nb_segs = tx_pkt_nb_segs;
 		pkt->pkt_len = tx_pkt_length;
 		pkt->ol_flags = ol_flags;
-		pkt->vlan_tci  = vlan_tci;
+		pkt->vlan_tci0 = vlan_tci;
 		pkt->l2_len = sizeof(struct ether_hdr);
 		pkt->l3_len = sizeof(struct ipv4_hdr);
 		pkts_burst[nb_pkt] = pkt;
diff --git a/app/test/packet_burst_generator.c b/app/test/packet_burst_generator.c
index 9e747a4..66f2a29 100644
--- a/app/test/packet_burst_generator.c
+++ b/app/test/packet_burst_generator.c
@@ -264,7 +264,7 @@ nomore_mbuf:
 		pkt->l2_len = eth_hdr_size;
 
 		if (ipv4) {
-			pkt->vlan_tci  = ETHER_TYPE_IPv4;
+			pkt->vlan_tci0  = ETHER_TYPE_IPv4;
 			pkt->l3_len = sizeof(struct ipv4_hdr);
 
 			if (vlan_enabled)
@@ -272,7 +272,7 @@ nomore_mbuf:
 			else
 				pkt->ol_flags = PKT_RX_IPV4_HDR;
 		} else {
-			pkt->vlan_tci  = ETHER_TYPE_IPv6;
+			pkt->vlan_tci0  = ETHER_TYPE_IPv6;
 			pkt->l3_len = sizeof(struct ipv6_hdr);
 
 			if (vlan_enabled)
diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c
index 35bd842..54d0782 100644
--- a/examples/ipv4_multicast/main.c
+++ b/examples/ipv4_multicast/main.c
@@ -338,7 +338,8 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
 
 	/* copy metadata from source packet*/
 	hdr->port = pkt->port;
-	hdr->vlan_tci = pkt->vlan_tci;
+	hdr->vlan_tci0 = pkt->vlan_tci0;
+	hdr->vlan_tci1 = pkt->vlan_tci1;
 	hdr->l2_l3_len = pkt->l2_l3_len;
 	hdr->hash = pkt->hash;
 
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 85ee8b8..dc2177b 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -2693,7 +2693,7 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
 		mbuf->buf_addr = m->buf_addr;
 	}
 	mbuf->ol_flags = PKT_TX_VLAN_PKT;
-	mbuf->vlan_tci = vlan_tag;
+	mbuf->vlan_tci0 = vlan_tag;
 	mbuf->l2_len = sizeof(struct ether_hdr);
 	mbuf->l3_len = sizeof(struct ipv4_hdr);
 	MBUF_HEADROOM_UINT32(mbuf) = (uint32_t)desc_idx;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index b1acfc3..0991788 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -159,8 +159,8 @@ struct rte_mbuf {
 	uint16_t reserved2;       /**< Unused field. Required for padding */
 	uint16_t data_len;        /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
-	uint16_t reserved;
-	uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
+	uint16_t vlan_tci0;       /**< VLAN Tag Control Identifier (CPU order) */
+	uint16_t vlan_tci1;       /**< Second VLAN Tag Control Identifier */
 	union {
 		uint32_t rss;     /**< RSS hash result if RSS enabled */
 		struct {
@@ -536,7 +536,8 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->next = NULL;
 	m->pkt_len = 0;
 	m->l2_l3_len = 0;
-	m->vlan_tci = 0;
+	m->vlan_tci0 = 0;
+	m->vlan_tci1 = 0;
 	m->nb_segs = 1;
 	m->port = 0xff;
 
@@ -602,7 +603,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->data_off = md->data_off;
 	mi->data_len = md->data_len;
 	mi->port = md->port;
-	mi->vlan_tci = md->vlan_tci;
+	mi->vlan_tci0 = md->vlan_tci0;
+	mi->vlan_tci1 = md->vlan_tci1;
 	mi->l2_l3_len = md->l2_l3_len;
 	mi->hash = md->hash;
 
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index b8423b4..3b05a03 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -440,7 +440,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* If hardware offload required */
 		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
 		if (tx_ol_req) {
-			hdrlen.f.vlan_tci = tx_pkt->vlan_tci;
+			hdrlen.f.vlan_tci = tx_pkt->vlan_tci0;
 			hdrlen.f.l2_len = tx_pkt->l2_len;
 			hdrlen.f.l3_len = tx_pkt->l3_len;
 			/* If new context to be built or reuse the exist ctx. */
@@ -537,7 +537,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* Set VLAN Tag offload fields. */
 		if (ol_flags & PKT_TX_VLAN_PKT) {
 			cmd_type_len |= E1000_TXD_CMD_VLE;
-			popts_spec = tx_pkt->vlan_tci << E1000_TXD_VLAN_SHIFT;
+			popts_spec = tx_pkt->vlan_tci0 << E1000_TXD_VLAN_SHIFT;
 		}
 
 		if (tx_ol_req) {
@@ -803,7 +803,7 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 				rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.special);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -1029,7 +1029,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 					rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
 		rte_packet_prefetch((char *)first_seg->buf_addr +
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 56d1dfc..76391d6 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -398,7 +398,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		tx_last = (uint16_t) (tx_id + tx_pkt->nb_segs - 1);
 
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
+		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci0;
 		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
 		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 
@@ -781,7 +781,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->hash.rss = rxd.wb.lower.hi_dword.rss;
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
+		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
 		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
@@ -1015,7 +1015,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
+		first_seg->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
 		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
index cd05654..e753495 100644
--- a/lib/librte_pmd_i40e/i40e_rxtx.c
+++ b/lib/librte_pmd_i40e/i40e_rxtx.c
@@ -611,7 +611,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
 			mb->data_len = pkt_len;
 			mb->pkt_len = pkt_len;
-			mb->vlan_tci = rx_status &
+			mb->vlan_tci0 = rx_status &
 				(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
 			rte_le_to_cpu_16(\
 				rxdp[j].wb.qword0.lo_dword.l2tag1) : 0;
@@ -848,7 +848,7 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rxm->data_len = rx_packet_len;
 		rxm->port = rxq->port_id;
 
-		rxm->vlan_tci = rx_status &
+		rxm->vlan_tci0 = rx_status &
 			(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
 			rte_le_to_cpu_16(rxd.wb.qword0.lo_dword.l2tag1) : 0;
 		pkt_flags = i40e_rxd_status_to_pkt_flags(qword1);
@@ -1002,7 +1002,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		}
 
 		first_seg->port = rxq->port_id;
-		first_seg->vlan_tci = (rx_status &
+		first_seg->vlan_tci0 = (rx_status &
 			(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
 			rte_le_to_cpu_16(rxd.wb.qword0.lo_dword.l2tag1) : 0;
 		pkt_flags = i40e_rxd_status_to_pkt_flags(qword1);
@@ -1142,7 +1142,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 
 		/* Descriptor based VLAN insertion */
 		if (ol_flags & PKT_TX_VLAN_PKT) {
-			tx_flags |= tx_pkt->vlan_tci <<
+			tx_flags |= tx_pkt->vlan_tci0 <<
 					I40E_TX_FLAG_L2TAG1_SHIFT;
 			tx_flags |= I40E_TX_FLAG_INSERT_VLAN;
 			td_cmd |= I40E_TX_DESC_CMD_IL2TAG1;
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index c0bb49f..3a43cb8 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -582,7 +582,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* If hardware offload required */
 		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 		if (tx_ol_req) {
-			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
+			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci0;
 			vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
 
 			/* If new context need be built or reuse the exist ctx. */
@@ -940,7 +940,7 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 							rxq->crc_len);
 			mb->data_len = pkt_len;
 			mb->pkt_len = pkt_len;
-			mb->vlan_tci = rxdp[j].wb.upper.vlan;
+			mb->vlan_tci0 = rxdp[j].wb.upper.vlan;
 			mb->hash.rss = rxdp[j].wb.lower.hi_dword.rss;
 
 			/* convert descriptor fields to rte mbuf flags */
@@ -1258,7 +1258,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
+		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
 		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
@@ -1500,7 +1500,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
+		first_seg->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
 		pkt_flags = (uint16_t)(pkt_flags |
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index 263f9ce..998c50a 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -550,7 +550,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 					       rte_pktmbuf_mtod(rxm, void *));
 #endif
 				/* Copy vlan tag in packet buffer */
-				rxm->vlan_tci = rte_le_to_cpu_16(
+				rxm->vlan_tci0 = rte_le_to_cpu_16(
 						(uint16_t)rcd->tci);
 
 			} else
@@ -563,7 +563,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			rxm->pkt_len = (uint16_t)rcd->len;
 			rxm->data_len = (uint16_t)rcd->len;
 			rxm->port = rxq->port_id;
-			rxm->vlan_tci = 0;
+			rxm->vlan_tci0 = 0;
 			rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
@ 2014-09-17 15:21   ` Neil Horman
  2014-09-17 15:35     ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-17 15:21 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> Make a small improvement to slow path TX performance by adding in a
> prefetch for the second mbuf cache line.
> Also move assignment of l2/l3 length values only when needed.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 6f702b3..c0bb49f 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		ixgbe_xmit_cleanup(txq);
>  	}
>  
> +	rte_prefetch0(&txe->mbuf->pool);
> +

Can you explain what all of these prefetches are doing?  It looks to me like
they're just fetching the first caheline of the mempool structure, which it
appears amounts to the pools name.  I don't see that having any use here.

>  	/* TX loop */
>  	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
>  		new_ctx = 0;
>  		tx_pkt = *tx_pkts++;
>  		pkt_len = tx_pkt->pkt_len;
>  
> -		RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
> -
>  		/*
>  		 * Determine how many (if any) context descriptors
>  		 * are needed for offload functionality.
>  		 */
>  		ol_flags = tx_pkt->ol_flags;
> -		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> -		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
>  
>  		/* If hardware offload required */
>  		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
>  		if (tx_ol_req) {
> +			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> +			vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
> +
>  			/* If new context need be built or reuse the exist ctx. */
>  			ctx = what_advctx_update(txq, tx_ol_req,
>  				vlan_macip_lens.data);
> @@ -720,7 +721,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  				    &txr[tx_id];
>  
>  				txn = &sw_ring[txe->next_id];
> -				RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
> +				rte_prefetch0(&txn->mbuf->pool);
>  
>  				if (txe->mbuf != NULL) {
>  					rte_pktmbuf_free_seg(txe->mbuf);
> @@ -749,6 +750,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		do {
>  			txd = &txr[tx_id];
>  			txn = &sw_ring[txe->next_id];
> +			rte_prefetch0(&txn->mbuf->pool);
>  
>  			if (txe->mbuf != NULL)
>  				rte_pktmbuf_free_seg(txe->mbuf);
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
@ 2014-09-17 15:29   ` Neil Horman
  2014-09-18 15:53     ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-17 15:29 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Sep 17, 2014 at 11:01:40AM +0100, Bruce Richardson wrote:
> To improve performance by using bulk alloc or vectored RX routines, we
> need to set rx free threshold (rxfreet) value to 32, so make this the
> testpmd default.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  app/test-pmd/testpmd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 9f6cdc4..5751607 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
>  /*
>   * Configurable value of RX free threshold.
>   */
> -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets */
>  

Why 32?  Was that an experimentally determined value?  Does it hold true for all
PMD's?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field Bruce Richardson
@ 2014-09-17 15:35   ` Neil Horman
  2014-09-17 16:02     ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-17 15:35 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Sep 17, 2014 at 11:01:41AM +0100, Bruce Richardson wrote:
> While some applications may store metadata about packets in the packet
> mbuf headroom, this is not a workable solution for packet metadata which
> is either:
> * larger than the headroom (or headroom is needed for adding pkt headers)
> * needs to be shared or copied among packets
> 
> To support these use cases in applications, we reserve a general
> "userdata" pointer field inside the second cache-line of the mbuf. This
> is better than having the application store the pointer to the external
> metadata in the packet headroom, as it saves an additional cache-line
> from being used.
> 
> Apart from storing metadata, this field also provides a general 8-byte
> scratch space inside the mbuf for any other application uses that are
> applicable.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 3 ++-
>  lib/librte_mbuf/rte_mbuf.h                                    | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> index 25ed672..d27e891 100644
> --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> @@ -117,7 +117,8 @@ struct rte_kni_mbuf {
>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
>  	char pad3[8];
> -	void *pool __attribute__((__aligned__(64)));
> +	void *pad4 __attribute__((__aligned__(64)));
> +	void *pool;
I don't see a comment about this in the changelog, only about the userdata
pointer being added below.


>  	void *next;
>  };
>  
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 8e27d2e..b1acfc3 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -172,6 +172,9 @@ struct rte_mbuf {
>  
>  	/* second cache line - fields only used in slow path or on TX */
>  	MARKER cacheline1 __rte_cache_aligned;
> +
> +	void *userdata;           /**< Can be used for external metadata */
> +
Do you want to make this a void* or a char[8]?  I ask because if people are
going to use is as a scratch space (rather than a pointer), they get a suprise
when they build this on 32 bit systems, and their 8 byte scratch space is
reduced to 4 bytes.

Neil

>  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
>  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
>  
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-17 15:21   ` Neil Horman
@ 2014-09-17 15:35     ` Richardson, Bruce
  2014-09-17 17:59       ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-17 15:35 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev


> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, September 17, 2014 4:21 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> perf
> 
> On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > Make a small improvement to slow path TX performance by adding in a
> > prefetch for the second mbuf cache line.
> > Also move assignment of l2/l3 length values only when needed.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > index 6f702b3..c0bb49f 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> **tx_pkts,
> >  		ixgbe_xmit_cleanup(txq);
> >  	}
> >
> > +	rte_prefetch0(&txe->mbuf->pool);
> > +
> 
> Can you explain what all of these prefetches are doing?  It looks to me like
> they're just fetching the first caheline of the mempool structure, which it
> appears amounts to the pools name.  I don't see that having any use here.
> 
This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 

What I've done with the prefetches is two-fold:
1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
2) changed the code to prefetch earlier - in effect to prefetch one mbuf ahead. The original code prefetched the mbuf to be freed as soon as it started processing the mbuf to replace it. Instead now, every time we calculate what the next mbuf position is going to be we prefetch the mbuf in that position (i.e. the mbuf pool pointer we are going to free the mbuf to), even while we are still updating the previous mbuf slot on the ring. This gives the prefetch much more time to resolve and get the data we need in the cache before we need it.

Hope this clarifies things.

/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
  2014-09-17 15:35   ` Neil Horman
@ 2014-09-17 16:02     ` Richardson, Bruce
  2014-09-17 18:29       ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-17 16:02 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, September 17, 2014 4:35 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
> 
> On Wed, Sep 17, 2014 at 11:01:41AM +0100, Bruce Richardson wrote:
> > While some applications may store metadata about packets in the packet
> > mbuf headroom, this is not a workable solution for packet metadata which
> > is either:
> > * larger than the headroom (or headroom is needed for adding pkt headers)
> > * needs to be shared or copied among packets
> >
> > To support these use cases in applications, we reserve a general
> > "userdata" pointer field inside the second cache-line of the mbuf. This
> > is better than having the application store the pointer to the external
> > metadata in the packet headroom, as it saves an additional cache-line
> > from being used.
> >
> > Apart from storing metadata, this field also provides a general 8-byte
> > scratch space inside the mbuf for any other application uses that are
> > applicable.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 3 ++-
> >  lib/librte_mbuf/rte_mbuf.h                                    | 3 +++
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > index 25ed672..d27e891 100644
> > --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > @@ -117,7 +117,8 @@ struct rte_kni_mbuf {
> >  	uint16_t data_len;      /**< Amount of data in segment buffer. */
> >  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
> >  	char pad3[8];
> > -	void *pool __attribute__((__aligned__(64)));
> > +	void *pad4 __attribute__((__aligned__(64)));
> > +	void *pool;
> I don't see a comment about this in the changelog, only about the userdata
> pointer being added below.

Yes, this is the userdata pointer - just added as padding here, since it's not actually needed by the kernel-side KNI module.

> 
> 
> >  	void *next;
> >  };
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 8e27d2e..b1acfc3 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -172,6 +172,9 @@ struct rte_mbuf {
> >
> >  	/* second cache line - fields only used in slow path or on TX */
> >  	MARKER cacheline1 __rte_cache_aligned;
> > +
> > +	void *userdata;           /**< Can be used for external metadata */
> > +
> Do you want to make this a void* or a char[8]?  I ask because if people are
> going to use is as a scratch space (rather than a pointer), they get a suprise
> when they build this on 32 bit systems, and their 8 byte scratch space is
> reduced to 4 bytes.

I think this is better as a pointer, as that is how it is likely to be used. As for 32-bit, I'm torn between wanting to just update the comment and feeling the need to update the code to actually make the thing 8-byte on 32-bit! Changing the comment to be more accurate is easier, unions are ugly looking in the structure IMHO, so maybe I'll just mark the following field (pool) as always 8-byte aligned....

/Bruce

> 
> Neil
> 
> >  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> */
> >  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> >
> > --
> > 1.9.3
> >
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-17 15:35     ` Richardson, Bruce
@ 2014-09-17 17:59       ` Neil Horman
  2014-09-18 13:36         ` Bruce Richardson
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-17 17:59 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, September 17, 2014 4:21 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > perf
> > 
> > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > Make a small improvement to slow path TX performance by adding in a
> > > prefetch for the second mbuf cache line.
> > > Also move assignment of l2/l3 length values only when needed.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > index 6f702b3..c0bb49f 100644
> > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > **tx_pkts,
> > >  		ixgbe_xmit_cleanup(txq);
> > >  	}
> > >
> > > +	rte_prefetch0(&txe->mbuf->pool);
> > > +
> > 
> > Can you explain what all of these prefetches are doing?  It looks to me like
> > they're just fetching the first caheline of the mempool structure, which it
> > appears amounts to the pools name.  I don't see that having any use here.
> > 
> This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> 
> What I've done with the prefetches is two-fold:
> 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
ok, this makes some sense, but you're not guaranteed to either have that
prefetch be needed, nor are you certain it will still be in cache by the time
you get to the free call.  Seems like it might be preferable to prefecth the
data pointed to by tx_pkt, as you're sure to use that every loop iteration.

> 2) changed the code to prefetch earlier - in effect to prefetch one mbuf ahead. The original code prefetched the mbuf to be freed as soon as it started processing the mbuf to replace it. Instead now, every time we calculate what the next mbuf position is going to be we prefetch the mbuf in that position (i.e. the mbuf pool pointer we are going to free the mbuf to), even while we are still updating the previous mbuf slot on the ring. This gives the prefetch much more time to resolve and get the data we need in the cache before we need it.
> 
Again, early isn't necessecarily better, as it just means more time for the data
in cache to get victimized. It seems like it would be better to prefetch the
tx_pkts data a few cache lines ahead.

Neil

> Hope this clarifies things.
> 
> /Bruce
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
  2014-09-17 16:02     ` Richardson, Bruce
@ 2014-09-17 18:29       ` Neil Horman
  0 siblings, 0 replies; 39+ messages in thread
From: Neil Horman @ 2014-09-17 18:29 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Wed, Sep 17, 2014 at 04:02:07PM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, September 17, 2014 4:35 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field
> > 
> > On Wed, Sep 17, 2014 at 11:01:41AM +0100, Bruce Richardson wrote:
> > > While some applications may store metadata about packets in the packet
> > > mbuf headroom, this is not a workable solution for packet metadata which
> > > is either:
> > > * larger than the headroom (or headroom is needed for adding pkt headers)
> > > * needs to be shared or copied among packets
> > >
> > > To support these use cases in applications, we reserve a general
> > > "userdata" pointer field inside the second cache-line of the mbuf. This
> > > is better than having the application store the pointer to the external
> > > metadata in the packet headroom, as it saves an additional cache-line
> > > from being used.
> > >
> > > Apart from storing metadata, this field also provides a general 8-byte
> > > scratch space inside the mbuf for any other application uses that are
> > > applicable.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 3 ++-
> > >  lib/librte_mbuf/rte_mbuf.h                                    | 3 +++
> > >  2 files changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > > index 25ed672..d27e891 100644
> > > --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > > +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > > @@ -117,7 +117,8 @@ struct rte_kni_mbuf {
> > >  	uint16_t data_len;      /**< Amount of data in segment buffer. */
> > >  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
> > >  	char pad3[8];
> > > -	void *pool __attribute__((__aligned__(64)));
> > > +	void *pad4 __attribute__((__aligned__(64)));
> > > +	void *pool;
> > I don't see a comment about this in the changelog, only about the userdata
> > pointer being added below.
> 
> Yes, this is the userdata pointer - just added as padding here, since it's not actually needed by the kernel-side KNI module.
> 
Ah, then maybe merge it with the pad3 pointer, to make it look a bit better?

> > 
> > 
> > >  	void *next;
> > >  };
> > >
> > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > index 8e27d2e..b1acfc3 100644
> > > --- a/lib/librte_mbuf/rte_mbuf.h
> > > +++ b/lib/librte_mbuf/rte_mbuf.h
> > > @@ -172,6 +172,9 @@ struct rte_mbuf {
> > >
> > >  	/* second cache line - fields only used in slow path or on TX */
> > >  	MARKER cacheline1 __rte_cache_aligned;
> > > +
> > > +	void *userdata;           /**< Can be used for external metadata */
> > > +
> > Do you want to make this a void* or a char[8]?  I ask because if people are
> > going to use is as a scratch space (rather than a pointer), they get a suprise
> > when they build this on 32 bit systems, and their 8 byte scratch space is
> > reduced to 4 bytes.
> 
> I think this is better as a pointer, as that is how it is likely to be used. As for 32-bit, I'm torn between wanting to just update the comment and feeling the need to update the code to actually make the thing 8-byte on 32-bit! Changing the comment to be more accurate is easier, unions are ugly looking in the structure IMHO, so maybe I'll just mark the following field (pool) as always 8-byte aligned....
> 
Ok, however you want to bring it into alignment, as long as its clear that its
either 8 bytes always, or a variable size based on the arch.

Best
Neil

> /Bruce
> 
> > 
> > Neil
> > 
> > >  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> > */
> > >  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> > >
> > > --
> > > 1.9.3
> > >
> > >
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf Bruce Richardson
@ 2014-09-17 20:46   ` Stephen Hemminger
  0 siblings, 0 replies; 39+ messages in thread
From: Stephen Hemminger @ 2014-09-17 20:46 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, 17 Sep 2014 11:01:42 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> Packets with double vlan tags are relatively common. Applications can be
> written to use these, and some NICs (such as those supported by the i40e
> driver) have hardware support for double vlan tags. To support apps and
> drivers using double vlan tags, we need to add in a second vlan tag
> field into the mbuf. The existing vlan_tci field is renamed vlan_tci0 to
> take account of the fact that there is now a vlan_tci1 field.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  app/test-pmd/flowgen.c                |  2 +-
>  app/test-pmd/macfwd.c                 |  2 +-
>  app/test-pmd/macswap.c                |  2 +-
>  app/test-pmd/rxonly.c                 |  2 +-
>  app/test-pmd/testpmd.c                |  4 ++--
>  app/test-pmd/txonly.c                 |  2 +-
>  app/test/packet_burst_generator.c     |  4 ++--
>  examples/ipv4_multicast/main.c        |  3 ++-
>  examples/vhost/main.c                 |  2 +-
>  lib/librte_mbuf/rte_mbuf.h            | 10 ++++++----
>  lib/librte_pmd_e1000/em_rxtx.c        |  8 ++++----
>  lib/librte_pmd_e1000/igb_rxtx.c       |  6 +++---
>  lib/librte_pmd_i40e/i40e_rxtx.c       |  8 ++++----
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  8 ++++----
>  lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c |  4 ++--
>  15 files changed, 35 insertions(+), 32 deletions(-)
> 
> diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
> index 8b4ed9a..cd68916 100644
> --- a/app/test-pmd/flowgen.c
> +++ b/app/test-pmd/flowgen.c
> @@ -208,7 +208,7 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
>  		pkt->nb_segs		= 1;
>  		pkt->pkt_len		= pkt_size;
>  		pkt->ol_flags		= ol_flags;
> -		pkt->vlan_tci		= vlan_tci;
> +		pkt->vlan_tci0		= vlan_tci;
>  		pkt->l2_len		= sizeof(struct ether_hdr);
>  		pkt->l3_len		= sizeof(struct ipv4_hdr);
>  		pkts_burst[nb_pkt]	= pkt;
> diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
> index 38bae23..3e91050 100644
> --- a/app/test-pmd/macfwd.c
> +++ b/app/test-pmd/macfwd.c
> @@ -118,7 +118,7 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
>  		mb->ol_flags = txp->tx_ol_flags;
>  		mb->l2_len = sizeof(struct ether_hdr);
>  		mb->l3_len = sizeof(struct ipv4_hdr);
> -		mb->vlan_tci = txp->tx_vlan_id;
> +		mb->vlan_tci0 = txp->tx_vlan_id;
>  	}
>  	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
>  	fs->tx_packets += nb_tx;
> diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
> index 1786095..02b04d7 100644
> --- a/app/test-pmd/macswap.c
> +++ b/app/test-pmd/macswap.c
> @@ -120,7 +120,7 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
>  		mb->ol_flags = txp->tx_ol_flags;
>  		mb->l2_len = sizeof(struct ether_hdr);
>  		mb->l3_len = sizeof(struct ipv4_hdr);
> -		mb->vlan_tci = txp->tx_vlan_id;
> +		mb->vlan_tci0 = txp->tx_vlan_id;
>  	}
>  	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
>  	fs->tx_packets += nb_tx;
> diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
> index 98c788b..a4c8797 100644
> --- a/app/test-pmd/rxonly.c
> +++ b/app/test-pmd/rxonly.c
> @@ -165,7 +165,7 @@ pkt_burst_receive(struct fwd_stream *fs)
>  			printf(" - FDIR hash=0x%x - FDIR id=0x%x ",
>  			       mb->hash.fdir.hash, mb->hash.fdir.id);
>  		if (ol_flags & PKT_RX_VLAN_PKT)
> -			printf(" - VLAN tci=0x%x", mb->vlan_tci);
> +			printf(" - VLAN tci0=0x%x", mb->vlan_tci0);
>  		printf("\n");
>  		if (ol_flags != 0) {
>  			int rxf;
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 5751607..fdf3296 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -406,8 +406,8 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
>  	mb->ol_flags     = 0;
>  	mb->data_off     = RTE_PKTMBUF_HEADROOM;
>  	mb->nb_segs      = 1;
> -	mb->l2_l3_len       = 0;
> -	mb->vlan_tci     = 0;
> +	mb->l2_l3_len    = 0;
> +	mb->vlan_tci0    = 0;
>  	mb->hash.rss     = 0;
>  }
>  
> diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
> index 3d08005..52b2f06 100644
> --- a/app/test-pmd/txonly.c
> +++ b/app/test-pmd/txonly.c
> @@ -264,7 +264,7 @@ pkt_burst_transmit(struct fwd_stream *fs)
>  		pkt->nb_segs = tx_pkt_nb_segs;
>  		pkt->pkt_len = tx_pkt_length;
>  		pkt->ol_flags = ol_flags;
> -		pkt->vlan_tci  = vlan_tci;
> +		pkt->vlan_tci0 = vlan_tci;
>  		pkt->l2_len = sizeof(struct ether_hdr);
>  		pkt->l3_len = sizeof(struct ipv4_hdr);
>  		pkts_burst[nb_pkt] = pkt;
> diff --git a/app/test/packet_burst_generator.c b/app/test/packet_burst_generator.c
> index 9e747a4..66f2a29 100644
> --- a/app/test/packet_burst_generator.c
> +++ b/app/test/packet_burst_generator.c
> @@ -264,7 +264,7 @@ nomore_mbuf:
>  		pkt->l2_len = eth_hdr_size;
>  
>  		if (ipv4) {
> -			pkt->vlan_tci  = ETHER_TYPE_IPv4;
> +			pkt->vlan_tci0  = ETHER_TYPE_IPv4;
>  			pkt->l3_len = sizeof(struct ipv4_hdr);
>  
>  			if (vlan_enabled)
> @@ -272,7 +272,7 @@ nomore_mbuf:
>  			else
>  				pkt->ol_flags = PKT_RX_IPV4_HDR;
>  		} else {
> -			pkt->vlan_tci  = ETHER_TYPE_IPv6;
> +			pkt->vlan_tci0  = ETHER_TYPE_IPv6;
>  			pkt->l3_len = sizeof(struct ipv6_hdr);
>  
>  			if (vlan_enabled)
> diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c
> index 35bd842..54d0782 100644
> --- a/examples/ipv4_multicast/main.c
> +++ b/examples/ipv4_multicast/main.c
> @@ -338,7 +338,8 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
>  
>  	/* copy metadata from source packet*/
>  	hdr->port = pkt->port;
> -	hdr->vlan_tci = pkt->vlan_tci;
> +	hdr->vlan_tci0 = pkt->vlan_tci0;
> +	hdr->vlan_tci1 = pkt->vlan_tci1;
>  	hdr->l2_l3_len = pkt->l2_l3_len;
>  	hdr->hash = pkt->hash;
>  
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 85ee8b8..dc2177b 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -2693,7 +2693,7 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
>  		mbuf->buf_addr = m->buf_addr;
>  	}
>  	mbuf->ol_flags = PKT_TX_VLAN_PKT;
> -	mbuf->vlan_tci = vlan_tag;
> +	mbuf->vlan_tci0 = vlan_tag;
>  	mbuf->l2_len = sizeof(struct ether_hdr);
>  	mbuf->l3_len = sizeof(struct ipv4_hdr);
>  	MBUF_HEADROOM_UINT32(mbuf) = (uint32_t)desc_idx;
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index b1acfc3..0991788 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -159,8 +159,8 @@ struct rte_mbuf {
>  	uint16_t reserved2;       /**< Unused field. Required for padding */
>  	uint16_t data_len;        /**< Amount of data in segment buffer. */
>  	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> -	uint16_t reserved;
> -	uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
> +	uint16_t vlan_tci0;       /**< VLAN Tag Control Identifier (CPU order) */
> +	uint16_t vlan_tci1;       /**< Second VLAN Tag Control Identifier */
>  	union {
>  		uint32_t rss;     /**< RSS hash result if RSS enabled */
>  		struct {
> @@ -536,7 +536,8 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
>  	m->next = NULL;
>  	m->pkt_len = 0;
>  	m->l2_l3_len = 0;
> -	m->vlan_tci = 0;
> +	m->vlan_tci0 = 0;
> +	m->vlan_tci1 = 0;
>  	m->nb_segs = 1;
>  	m->port = 0xff;
>  
> @@ -602,7 +603,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
>  	mi->data_off = md->data_off;
>  	mi->data_len = md->data_len;
>  	mi->port = md->port;
> -	mi->vlan_tci = md->vlan_tci;
> +	mi->vlan_tci0 = md->vlan_tci0;
> +	mi->vlan_tci1 = md->vlan_tci1;
>  	mi->l2_l3_len = md->l2_l3_len;
>  	mi->hash = md->hash;
>  
> diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
> index b8423b4..3b05a03 100644
> --- a/lib/librte_pmd_e1000/em_rxtx.c
> +++ b/lib/librte_pmd_e1000/em_rxtx.c
> @@ -440,7 +440,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		/* If hardware offload required */
>  		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
>  		if (tx_ol_req) {
> -			hdrlen.f.vlan_tci = tx_pkt->vlan_tci;
> +			hdrlen.f.vlan_tci = tx_pkt->vlan_tci0;
>  			hdrlen.f.l2_len = tx_pkt->l2_len;
>  			hdrlen.f.l3_len = tx_pkt->l3_len;
>  			/* If new context to be built or reuse the exist ctx. */
> @@ -537,7 +537,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		/* Set VLAN Tag offload fields. */
>  		if (ol_flags & PKT_TX_VLAN_PKT) {
>  			cmd_type_len |= E1000_TXD_CMD_VLE;
> -			popts_spec = tx_pkt->vlan_tci << E1000_TXD_VLAN_SHIFT;
> +			popts_spec = tx_pkt->vlan_tci0 << E1000_TXD_VLAN_SHIFT;
>  		}
>  
>  		if (tx_ol_req) {
> @@ -803,7 +803,7 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  				rx_desc_error_to_pkt_flags(rxd.errors);
>  
>  		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
> -		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
> +		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.special);
>  
>  		/*
>  		 * Store the mbuf address into the next entry of the array
> @@ -1029,7 +1029,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  					rx_desc_error_to_pkt_flags(rxd.errors);
>  
>  		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
> -		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
> +		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.special);
>  
>  		/* Prefetch data of first segment, if configured to do so. */
>  		rte_packet_prefetch((char *)first_seg->buf_addr +
> diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
> index 56d1dfc..76391d6 100644
> --- a/lib/librte_pmd_e1000/igb_rxtx.c
> +++ b/lib/librte_pmd_e1000/igb_rxtx.c
> @@ -398,7 +398,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		tx_last = (uint16_t) (tx_id + tx_pkt->nb_segs - 1);
>  
>  		ol_flags = tx_pkt->ol_flags;
> -		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> +		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci0;
>  		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
>  		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
>  
> @@ -781,7 +781,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		rxm->hash.rss = rxd.wb.lower.hi_dword.rss;
>  		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
>  		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
> -		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
> +		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
>  
>  		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
>  		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
> @@ -1015,7 +1015,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
>  		 * set in the pkt_flags field.
>  		 */
> -		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
> +		first_seg->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
>  		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
>  		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
>  		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
> diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
> index cd05654..e753495 100644
> --- a/lib/librte_pmd_i40e/i40e_rxtx.c
> +++ b/lib/librte_pmd_i40e/i40e_rxtx.c
> @@ -611,7 +611,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
>  			mb->data_len = pkt_len;
>  			mb->pkt_len = pkt_len;
> -			mb->vlan_tci = rx_status &
> +			mb->vlan_tci0 = rx_status &
>  				(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
>  			rte_le_to_cpu_16(\
>  				rxdp[j].wb.qword0.lo_dword.l2tag1) : 0;
> @@ -848,7 +848,7 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>  		rxm->data_len = rx_packet_len;
>  		rxm->port = rxq->port_id;
>  
> -		rxm->vlan_tci = rx_status &
> +		rxm->vlan_tci0 = rx_status &
>  			(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
>  			rte_le_to_cpu_16(rxd.wb.qword0.lo_dword.l2tag1) : 0;
>  		pkt_flags = i40e_rxd_status_to_pkt_flags(qword1);
> @@ -1002,7 +1002,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
>  		}
>  
>  		first_seg->port = rxq->port_id;
> -		first_seg->vlan_tci = (rx_status &
> +		first_seg->vlan_tci0 = (rx_status &
>  			(1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
>  			rte_le_to_cpu_16(rxd.wb.qword0.lo_dword.l2tag1) : 0;
>  		pkt_flags = i40e_rxd_status_to_pkt_flags(qword1);
> @@ -1142,7 +1142,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>  
>  		/* Descriptor based VLAN insertion */
>  		if (ol_flags & PKT_TX_VLAN_PKT) {
> -			tx_flags |= tx_pkt->vlan_tci <<
> +			tx_flags |= tx_pkt->vlan_tci0 <<
>  					I40E_TX_FLAG_L2TAG1_SHIFT;
>  			tx_flags |= I40E_TX_FLAG_INSERT_VLAN;
>  			td_cmd |= I40E_TX_DESC_CMD_IL2TAG1;
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index c0bb49f..3a43cb8 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -582,7 +582,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		/* If hardware offload required */
>  		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
>  		if (tx_ol_req) {
> -			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
> +			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci0;
>  			vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
>  
>  			/* If new context need be built or reuse the exist ctx. */
> @@ -940,7 +940,7 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
>  							rxq->crc_len);
>  			mb->data_len = pkt_len;
>  			mb->pkt_len = pkt_len;
> -			mb->vlan_tci = rxdp[j].wb.upper.vlan;
> +			mb->vlan_tci0 = rxdp[j].wb.upper.vlan;
>  			mb->hash.rss = rxdp[j].wb.lower.hi_dword.rss;
>  
>  			/* convert descriptor fields to rte mbuf flags */
> @@ -1258,7 +1258,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  
>  		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
>  		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
> -		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
> +		rxm->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
>  
>  		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
>  		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
> @@ -1500,7 +1500,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
>  		 * set in the pkt_flags field.
>  		 */
> -		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
> +		first_seg->vlan_tci0 = rte_le_to_cpu_16(rxd.wb.upper.vlan);
>  		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
>  		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
>  		pkt_flags = (uint16_t)(pkt_flags |
> diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> index 263f9ce..998c50a 100644
> --- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> +++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> @@ -550,7 +550,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>  					       rte_pktmbuf_mtod(rxm, void *));
>  #endif
>  				/* Copy vlan tag in packet buffer */
> -				rxm->vlan_tci = rte_le_to_cpu_16(
> +				rxm->vlan_tci0 = rte_le_to_cpu_16(
>  						(uint16_t)rcd->tci);
>  
>  			} else
> @@ -563,7 +563,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>  			rxm->pkt_len = (uint16_t)rcd->len;
>  			rxm->data_len = (uint16_t)rcd->len;
>  			rxm->port = rxq->port_id;
> -			rxm->vlan_tci = 0;
> +			rxm->vlan_tci0 = 0;
>  			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>  
>  			rx_pkts[nb_rx++] = rxm;

You need to add feature flags for when driver is doing offload, and have a software
routine to add/delete tag?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-17 17:59       ` Neil Horman
@ 2014-09-18 13:36         ` Bruce Richardson
  2014-09-18 15:29           ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-18 13:36 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> > 
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > To: Richardson, Bruce
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > > perf
> > > 
> > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > Make a small improvement to slow path TX performance by adding in a
> > > > prefetch for the second mbuf cache line.
> > > > Also move assignment of l2/l3 length values only when needed.
> > > >
> > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > ---
> > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > index 6f702b3..c0bb49f 100644
> > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > **tx_pkts,
> > > >  		ixgbe_xmit_cleanup(txq);
> > > >  	}
> > > >
> > > > +	rte_prefetch0(&txe->mbuf->pool);
> > > > +
> > > 
> > > Can you explain what all of these prefetches are doing?  It looks to me like
> > > they're just fetching the first caheline of the mempool structure, which it
> > > appears amounts to the pools name.  I don't see that having any use here.
> > > 
> > This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> > 
> > What I've done with the prefetches is two-fold:
> > 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
> ok, this makes some sense, but you're not guaranteed to either have that
> prefetch be needed, nor are you certain it will still be in cache by the time
> you get to the free call.  Seems like it might be preferable to prefecth the
> data pointed to by tx_pkt, as you're sure to use that every loop iteration.

The vast majority of the times the prefetch is necessary, and it does help 
performance doing things this way. If the prefetch is not necessary, it's 
just one extra instruction, while, if it is needed, having the prefetch 
occur 20 cycles before access (picking an arbitrary value) means that we 
have cut down the time it takes to pull the data from cache when it is 
needed by 20 cycles. As for the value pointed to by tx_pkt, since this is a 
packet the app has just been working on, it's almost certainly already in 
l1/l2 cache.  

> 
> > 2) changed the code to prefetch earlier - in effect to prefetch one mbuf ahead. The original code prefetched the mbuf to be freed as soon as it started processing the mbuf to replace it. Instead now, every time we calculate what the next mbuf position is going to be we prefetch the mbuf in that position (i.e. the mbuf pool pointer we are going to free the mbuf to), even while we are still updating the previous mbuf slot on the ring. This gives the prefetch much more time to resolve and get the data we need in the cache before we need it.
> > 
> Again, early isn't necessecarily better, as it just means more time for the data
> in cache to get victimized. It seems like it would be better to prefetch the
> tx_pkts data a few cache lines ahead.
> 
> Neil

Basically it all comes down to measured performance - working with 
prefetches is not an exactly science, sadly. I've just re-run a quick sanity 
test on this patch in the sequence. Running with testpmd on a single core, 
40G of small packet input, I see considerable performance increases. What 
I've run is:
* testpmd with a single forwarding core, defaults - which means slow path RX 
+ slow path TX (i.e. this code): Performance with this patch increases by 
almost 8%
* testpmd with a single forwarding core, defaults + rxfreet=32 - which means 
vector RX path +  slow path TX (again, this code path): Performance 
increases by over 18%.

Given these numbers, the prefetching seems better this way.  Perhaps you 
could run some tests yourself and see if you see a similar performance delta 
(or perhaps there are other scenarios I'm missing here)?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-18 13:36         ` Bruce Richardson
@ 2014-09-18 15:29           ` Neil Horman
  2014-09-18 15:42             ` Bruce Richardson
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-18 15:29 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> > > 
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > To: Richardson, Bruce
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > > > perf
> > > > 
> > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > prefetch for the second mbuf cache line.
> > > > > Also move assignment of l2/l3 length values only when needed.
> > > > >
> > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > ---
> > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > index 6f702b3..c0bb49f 100644
> > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > > **tx_pkts,
> > > > >  		ixgbe_xmit_cleanup(txq);
> > > > >  	}
> > > > >
> > > > > +	rte_prefetch0(&txe->mbuf->pool);
> > > > > +
> > > > 
> > > > Can you explain what all of these prefetches are doing?  It looks to me like
> > > > they're just fetching the first caheline of the mempool structure, which it
> > > > appears amounts to the pools name.  I don't see that having any use here.
> > > > 
> > > This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> > > 
> > > What I've done with the prefetches is two-fold:
> > > 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
> > ok, this makes some sense, but you're not guaranteed to either have that
> > prefetch be needed, nor are you certain it will still be in cache by the time
> > you get to the free call.  Seems like it might be preferable to prefecth the
> > data pointed to by tx_pkt, as you're sure to use that every loop iteration.
> 
> The vast majority of the times the prefetch is necessary, and it does help 
> performance doing things this way. If the prefetch is not necessary, it's 
> just one extra instruction, while, if it is needed, having the prefetch 
> occur 20 cycles before access (picking an arbitrary value) means that we 
> have cut down the time it takes to pull the data from cache when it is 
> needed by 20 cycles.
I understand how prefetch works. What I'm concerned about is its overuse, and
its tendency to frequently need re-calibration (though I admit I missed the &
operator in the patch, and thought you were prefetching the contents of the
struct, not the pointer value itself).  As you say, if the pool pointer is
almost certain to be used, then it may well make sense to prefetch the data, but
in doing so, you potentially evict something that you were about to use, so
you're not doing yourself any favors.  I understand that you've validated this
experimentally, and so it works, right now.  I just like to be very careful
about how prefetch happens, as it can easily (and sliently) start hurting far
more than it helps.

> As for the value pointed to by tx_pkt, since this is a 
> packet the app has just been working on, it's almost certainly already in 
> l1/l2 cache.  
> 
Not sure I follow you here.  tx_pkts is an array of mbufs passed to the pmd from
rte_eth_tx_burts, which in turn is called by the application.  I don't see any
reasonable guarantee that any of those packets have been touch in sufficiently
recent history that they are likely to be in cache.  It seems like, if you do
want to do prefetching, interrotagting nb_tx and doing a prefetch of an
approriate stride to fill multiple cachelines with successive mbuf headers might
provide superior performance.
Neil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-18 15:29           ` Neil Horman
@ 2014-09-18 15:42             ` Bruce Richardson
  2014-09-18 17:56               ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-18 15:42 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Thu, Sep 18, 2014 at 11:29:30AM -0400, Neil Horman wrote:
> On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> > On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > > On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> > > > 
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > > To: Richardson, Bruce
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > > > > perf
> > > > > 
> > > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > > prefetch for the second mbuf cache line.
> > > > > > Also move assignment of l2/l3 length values only when needed.
> > > > > >
> > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > > ---
> > > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > index 6f702b3..c0bb49f 100644
> > > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > > > **tx_pkts,
> > > > > >  		ixgbe_xmit_cleanup(txq);
> > > > > >  	}
> > > > > >
> > > > > > +	rte_prefetch0(&txe->mbuf->pool);
> > > > > > +
> > > > > 
> > > > > Can you explain what all of these prefetches are doing?  It looks to me like
> > > > > they're just fetching the first caheline of the mempool structure, which it
> > > > > appears amounts to the pools name.  I don't see that having any use here.
> > > > > 
> > > > This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> > > > 
> > > > What I've done with the prefetches is two-fold:
> > > > 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
> > > ok, this makes some sense, but you're not guaranteed to either have that
> > > prefetch be needed, nor are you certain it will still be in cache by the time
> > > you get to the free call.  Seems like it might be preferable to prefecth the
> > > data pointed to by tx_pkt, as you're sure to use that every loop iteration.
> > 
> > The vast majority of the times the prefetch is necessary, and it does help 
> > performance doing things this way. If the prefetch is not necessary, it's 
> > just one extra instruction, while, if it is needed, having the prefetch 
> > occur 20 cycles before access (picking an arbitrary value) means that we 
> > have cut down the time it takes to pull the data from cache when it is 
> > needed by 20 cycles.
> I understand how prefetch works. What I'm concerned about is its overuse, and
> its tendency to frequently need re-calibration (though I admit I missed the &
> operator in the patch, and thought you were prefetching the contents of the
> struct, not the pointer value itself).  As you say, if the pool pointer is
> almost certain to be used, then it may well make sense to prefetch the data, but
> in doing so, you potentially evict something that you were about to use, so
> you're not doing yourself any favors.  I understand that you've validated this
> experimentally, and so it works, right now.  I just like to be very careful
> about how prefetch happens, as it can easily (and sliently) start hurting far
> more than it helps.
> 
> > As for the value pointed to by tx_pkt, since this is a 
> > packet the app has just been working on, it's almost certainly already in 
> > l1/l2 cache.  
> > 
> Not sure I follow you here.  tx_pkts is an array of mbufs passed to the pmd from
> rte_eth_tx_burts, which in turn is called by the application.  I don't see any
> reasonable guarantee that any of those packets have been touch in sufficiently
> recent history that they are likely to be in cache.  It seems like, if you do
> want to do prefetching, interrotagting nb_tx and doing a prefetch of an
> approriate stride to fill multiple cachelines with successive mbuf headers might
> provide superior performance.
> Neil
>
Prefetching the mbuf is probably best left to the application. For all our 
sample applications used for benchmarking, and almost certainly the vast 
majority of all our example applications, the packet being transmitted is 
already in cache on the core itself. Adding a prefetch to the tx function I 
would expect to see a performance decrease in both testpmd and l3fwd apps.  
I would be useful for apps where the packets are passed from one core to 
another core which does no processing of them before transmitting them - but 
in that case, it's better to have the TX thread of the app do the prefetch 
rather than forcing it in the driver and reduce the performance of those 
apps that have the packets already in cache.

The prefetch added by the patch under discussion doesn't suffer from this 
issue as the data being prefetched is for the mbuf that was previously 
transmitted some time previously, and the tx function has fully looped back 
around the TX ring to get to it again.

/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-17 15:29   ` Neil Horman
@ 2014-09-18 15:53     ` Richardson, Bruce
  2014-09-18 17:13       ` Thomas Monjalon
  2014-09-18 18:03       ` Neil Horman
  0 siblings, 2 replies; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-18 15:53 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, September 17, 2014 4:30 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> 
> On Wed, Sep 17, 2014 at 11:01:40AM +0100, Bruce Richardson wrote:
> > To improve performance by using bulk alloc or vectored RX routines, we
> > need to set rx free threshold (rxfreet) value to 32, so make this the
> > testpmd default.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  app/test-pmd/testpmd.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > index 9f6cdc4..5751607 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> >  /*
> >   * Configurable value of RX free threshold.
> >   */
> > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets
> */
> >
> 
> Why 32?  Was that an experimentally determined value?  Does it hold true for all
> PMD's?

This is primarily for the ixgbe PMD, which is right now the most highly tuned driver, but it works fine for all other ones too, as far as I'm aware. Basically, this is the minimum setting needed to enable either the bulk alloc or vector RX routines inside the ixgbe driver, so it's best made the default for that reason. Please see " check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in the same file. 

/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-18 15:53     ` Richardson, Bruce
@ 2014-09-18 17:13       ` Thomas Monjalon
  2014-09-18 18:08         ` Neil Horman
  2014-09-18 18:03       ` Neil Horman
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Monjalon @ 2014-09-18 17:13 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

2014-09-18 15:53, Richardson, Bruce:
> > > --- a/app/test-pmd/testpmd.c
> > > +++ b/app/test-pmd/testpmd.c
> > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > >  /*
> > >   * Configurable value of RX free threshold.
> > >   */
> > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets
> > */
> > >
> > 
> > Why 32?  Was that an experimentally determined value?
> > Does it hold true for all PMD's?
> 
> This is primarily for the ixgbe PMD, which is right now the most
> highly tuned driver, but it works fine for all other ones too,
> as far as I'm aware.

Yes, you are changing this value for all PMDs but you're targetting
only one.
These thresholds are dependent of the PMD implementation. There's
something wrong here.

> Basically, this is the minimum setting needed to enable either the
> bulk alloc or vector RX routines inside the ixgbe driver, so it's
> best made the default for that reason. Please see
> "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> the same file.

Since this parameter is so important, it could be a default value somewhere.

I think we should split generic tuning parameters and tuning parameters
related to driver implementation or specific hardware.
Then we should provide some good default values for each of them.
At last, if needed, applications should be able to easily tune the
pmd-specific parameters.

Thoughts?

-- 
Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-18 15:42             ` Bruce Richardson
@ 2014-09-18 17:56               ` Neil Horman
  0 siblings, 0 replies; 39+ messages in thread
From: Neil Horman @ 2014-09-18 17:56 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Thu, Sep 18, 2014 at 04:42:36PM +0100, Bruce Richardson wrote:
> On Thu, Sep 18, 2014 at 11:29:30AM -0400, Neil Horman wrote:
> > On Thu, Sep 18, 2014 at 02:36:13PM +0100, Bruce Richardson wrote:
> > > On Wed, Sep 17, 2014 at 01:59:36PM -0400, Neil Horman wrote:
> > > > On Wed, Sep 17, 2014 at 03:35:19PM +0000, Richardson, Bruce wrote:
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Wednesday, September 17, 2014 4:21 PM
> > > > > > To: Richardson, Bruce
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx
> > > > > > perf
> > > > > > 
> > > > > > On Wed, Sep 17, 2014 at 11:01:39AM +0100, Bruce Richardson wrote:
> > > > > > > Make a small improvement to slow path TX performance by adding in a
> > > > > > > prefetch for the second mbuf cache line.
> > > > > > > Also move assignment of l2/l3 length values only when needed.
> > > > > > >
> > > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > > > ---
> > > > > > >  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
> > > > > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > index 6f702b3..c0bb49f 100644
> > > > > > > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > > > > > > @@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf
> > > > > > **tx_pkts,
> > > > > > >  		ixgbe_xmit_cleanup(txq);
> > > > > > >  	}
> > > > > > >
> > > > > > > +	rte_prefetch0(&txe->mbuf->pool);
> > > > > > > +
> > > > > > 
> > > > > > Can you explain what all of these prefetches are doing?  It looks to me like
> > > > > > they're just fetching the first caheline of the mempool structure, which it
> > > > > > appears amounts to the pools name.  I don't see that having any use here.
> > > > > > 
> > > > > This does make a decent enough performance difference in my tests (the amount varies depending on the RX path being used by testpmd). 
> > > > > 
> > > > > What I've done with the prefetches is two-fold:
> > > > > 1) changed it from prefetching the mbuf (first cache line) to prefetching the mbuf pool pointer (second cache line) so that when we go to access the pool pointer to free transmitted mbufs we don't get a cache miss. When clearing the ring and freeing mbufs, the pool pointer is the only mbuf field used, so we don't need that first cache line.
> > > > ok, this makes some sense, but you're not guaranteed to either have that
> > > > prefetch be needed, nor are you certain it will still be in cache by the time
> > > > you get to the free call.  Seems like it might be preferable to prefecth the
> > > > data pointed to by tx_pkt, as you're sure to use that every loop iteration.
> > > 
> > > The vast majority of the times the prefetch is necessary, and it does help 
> > > performance doing things this way. If the prefetch is not necessary, it's 
> > > just one extra instruction, while, if it is needed, having the prefetch 
> > > occur 20 cycles before access (picking an arbitrary value) means that we 
> > > have cut down the time it takes to pull the data from cache when it is 
> > > needed by 20 cycles.
> > I understand how prefetch works. What I'm concerned about is its overuse, and
> > its tendency to frequently need re-calibration (though I admit I missed the &
> > operator in the patch, and thought you were prefetching the contents of the
> > struct, not the pointer value itself).  As you say, if the pool pointer is
> > almost certain to be used, then it may well make sense to prefetch the data, but
> > in doing so, you potentially evict something that you were about to use, so
> > you're not doing yourself any favors.  I understand that you've validated this
> > experimentally, and so it works, right now.  I just like to be very careful
> > about how prefetch happens, as it can easily (and sliently) start hurting far
> > more than it helps.
> > 
> > > As for the value pointed to by tx_pkt, since this is a 
> > > packet the app has just been working on, it's almost certainly already in 
> > > l1/l2 cache.  
> > > 
> > Not sure I follow you here.  tx_pkts is an array of mbufs passed to the pmd from
> > rte_eth_tx_burts, which in turn is called by the application.  I don't see any
> > reasonable guarantee that any of those packets have been touch in sufficiently
> > recent history that they are likely to be in cache.  It seems like, if you do
> > want to do prefetching, interrotagting nb_tx and doing a prefetch of an
> > approriate stride to fill multiple cachelines with successive mbuf headers might
> > provide superior performance.
> > Neil
> >
> Prefetching the mbuf is probably best left to the application. For all our 
> sample applications used for benchmarking, and almost certainly the vast 
> majority of all our example applications, the packet being transmitted is 
> already in cache on the core itself. Adding a prefetch to the tx function I 
> would expect to see a performance decrease in both testpmd and l3fwd apps.  
> I would be useful for apps where the packets are passed from one core to 
> another core which does no processing of them before transmitting them - but 
> in that case, it's better to have the TX thread of the app do the prefetch 
> rather than forcing it in the driver and reduce the performance of those 
> apps that have the packets already in cache.
> 

Regarding the performance decrease, I think you're trying to have it both ways
here.  Above you indicate that if the prefetch of the pool pointer isn't needed
its just an extra instruction, which I think is true.  But now you are saying
that if the tx buffers are in cache, the extra instructions will have an impact.
Granted its potentially nb_tx prefetches, not one, but none of them stall the
cpu pipeline as far as Im aware, so I can't imagine 1 prefetch vs several will
have a significant impact on performance.

Regarding where to do prefecth.  Leaving prefetch in the hands of the application is a
bad idea, because the application has no visibility into the code path once you
enter the DPDK. It doesn't know if the buffers are going to be accessed in 20
cycles or 20,000 cycles, which will be all the difference between a useful and
harmful prefetch.  Sure you can calibrate your application to correspond to a
given version of the dpdk and optimize such a prefetch, but that will be
completely obsoleted the first time the dpdk transmit path changes.

As for the use of prefetching tx buffers at all, I think theres several cases
where you might find that those buffers are vicimized in cache.  consider the
situation where a receive interrupt triggers on a cpu right before rte_eth_trans
is called.  For a heavily loaded system, the receive buffers may frequently push
the soon-to-be-transmitted buffers out of cache.

> The prefetch added by the patch under discussion doesn't suffer from this 
> issue as the data being prefetched is for the mbuf that was previously 
> transmitted some time previously, and the tx function has fully looped back 
> around the TX ring to get to it again.
> 

I get what you're saying here, that after the first prefetch the data stays hot
in cache because it is continually re-accessed.  Thats fine.  But that would
happen after the first fetch anyway, without the prefetch.

You know what would put this argument to rest?  If you could run whatever
benchmark you were running under the perf utility so we could see the L1 cache
misses from the baseline dpdk, the variant where you prefetch the pool pointer,
and a variant in which you prefetch the next tx buf at the top of the loop.

Neil



> /Bruce
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-18 15:53     ` Richardson, Bruce
  2014-09-18 17:13       ` Thomas Monjalon
@ 2014-09-18 18:03       ` Neil Horman
  1 sibling, 0 replies; 39+ messages in thread
From: Neil Horman @ 2014-09-18 18:03 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Thu, Sep 18, 2014 at 03:53:33PM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Wednesday, September 17, 2014 4:30 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Wed, Sep 17, 2014 at 11:01:40AM +0100, Bruce Richardson wrote:
> > > To improve performance by using bulk alloc or vectored RX routines, we
> > > need to set rx free threshold (rxfreet) value to 32, so make this the
> > > testpmd default.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  app/test-pmd/testpmd.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > > index 9f6cdc4..5751607 100644
> > > --- a/app/test-pmd/testpmd.c
> > > +++ b/app/test-pmd/testpmd.c
> > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > >  /*
> > >   * Configurable value of RX free threshold.
> > >   */
> > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets
> > */
> > >
> > 
> > Why 32?  Was that an experimentally determined value?  Does it hold true for all
> > PMD's?
> 
> This is primarily for the ixgbe PMD, which is right now the most highly tuned driver, but it works fine for all other ones too, as far as I'm aware. Basically, this is the minimum setting needed to enable either the bulk alloc or vector RX routines inside the ixgbe driver, so it's best made the default for that reason. Please see " check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in the same file. 
> 

Maybe codify that information as a macro, or with some documentation right above
the setting, so people not using ixgbe don't have to wonder where that value
came from? :)

Neil

> /Bruce
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-18 17:13       ` Thomas Monjalon
@ 2014-09-18 18:08         ` Neil Horman
  2014-09-19  9:18           ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-18 18:08 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> 2014-09-18 15:53, Richardson, Bruce:
> > > > --- a/app/test-pmd/testpmd.c
> > > > +++ b/app/test-pmd/testpmd.c
> > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > >  /*
> > > >   * Configurable value of RX free threshold.
> > > >   */
> > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets
> > > */
> > > >
> > > 
> > > Why 32?  Was that an experimentally determined value?
> > > Does it hold true for all PMD's?
> > 
> > This is primarily for the ixgbe PMD, which is right now the most
> > highly tuned driver, but it works fine for all other ones too,
> > as far as I'm aware.
> 
> Yes, you are changing this value for all PMDs but you're targetting
> only one.
> These thresholds are dependent of the PMD implementation. There's
> something wrong here.
> 
I agree. Its fine to do this, but it does seem like the sample application
should document why it does this and make note of the fact that other PMDs may
have a separate optimal value.

> > Basically, this is the minimum setting needed to enable either the
> > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > best made the default for that reason. Please see
> > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > the same file.
> 
> Since this parameter is so important, it could be a default value somewhere.
> 
> I think we should split generic tuning parameters and tuning parameters
> related to driver implementation or specific hardware.
> Then we should provide some good default values for each of them.
> At last, if needed, applications should be able to easily tune the
> pmd-specific parameters.
> 
I like this idea.  I've not got an idea of how much work it is to do so, but in
principle it makes sense.

Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the config
struct to get passed directly to PMDs, we can create a reserved value
RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select whatever
threshold is optimal for its own hardware?

Neil

> Thoughts?
> 
> -- 
> Thomas
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-18 18:08         ` Neil Horman
@ 2014-09-19  9:18           ` Richardson, Bruce
  2014-09-19 10:24             ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-19  9:18 UTC (permalink / raw)
  To: Neil Horman, Thomas Monjalon; +Cc: dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, September 18, 2014 7:09 PM
> To: Thomas Monjalon
> Cc: Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> 
> On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > 2014-09-18 15:53, Richardson, Bruce:
> > > > > --- a/app/test-pmd/testpmd.c
> > > > > +++ b/app/test-pmd/testpmd.c
> > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > >  /*
> > > > >   * Configurable value of RX free threshold.
> > > > >   */
> > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> default. */
> > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32
> packets
> > > > */
> > > > >
> > > >
> > > > Why 32?  Was that an experimentally determined value?
> > > > Does it hold true for all PMD's?
> > >
> > > This is primarily for the ixgbe PMD, which is right now the most
> > > highly tuned driver, but it works fine for all other ones too,
> > > as far as I'm aware.
> >
> > Yes, you are changing this value for all PMDs but you're targetting
> > only one.
> > These thresholds are dependent of the PMD implementation. There's
> > something wrong here.
> >
> I agree. Its fine to do this, but it does seem like the sample application
> should document why it does this and make note of the fact that other PMDs
> may
> have a separate optimal value.
> 
> > > Basically, this is the minimum setting needed to enable either the
> > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > best made the default for that reason. Please see
> > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > the same file.
> >
> > Since this parameter is so important, it could be a default value somewhere.
> >
> > I think we should split generic tuning parameters and tuning parameters
> > related to driver implementation or specific hardware.
> > Then we should provide some good default values for each of them.
> > At last, if needed, applications should be able to easily tune the
> > pmd-specific parameters.
> >
> I like this idea.  I've not got an idea of how much work it is to do so, but in
> principle it makes sense.
> 
> Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> config
> struct to get passed directly to PMDs, we can create a reserved value
> RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> whatever
> threshold is optimal for its own hardware?
> 
> Neil
> 
Actually, looking at the code, I would suggest a couple of options, some of which may be used together.
	1) we make NULL a valid value for the rxconf structure parameter to rte_eth_rx_queue_setup. There is little information in it that should really need to be passed in by applications to the drivers, and that would allow the drivers to be completely free to select the best options for their own operation. 
	2) As a companion to that (or as an alternative), we could also allow each driver to provide its own functions for rte_eth_get_rxconf_default, and rte_eth_get_txconf_default, to be used by applications that want to use known-good values for thresholds but also want to tweak one of the other values e.g. for rx, set the drop_en bit, and for tx set the txqflags to disable offloads.
	3) Lastly, we could also consider removing the threshold and other not-generally-used values from the rxconf and txconf structures and make those removed fields completely driver-set values. Optionally, we could provide an alternate API to tune them, but I don't really see this being useful in most cases, and I'd probably omit it unless someone can prove a need for such APIs.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-19  9:18           ` Richardson, Bruce
@ 2014-09-19 10:24             ` Neil Horman
  2014-09-19 10:28               ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-19 10:24 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Fri, Sep 19, 2014 at 09:18:26AM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Thursday, September 18, 2014 7:09 PM
> > To: Thomas Monjalon
> > Cc: Richardson, Bruce; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > >  /*
> > > > > >   * Configurable value of RX free threshold.
> > > > > >   */
> > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> > default. */
> > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32
> > packets
> > > > > */
> > > > > >
> > > > >
> > > > > Why 32?  Was that an experimentally determined value?
> > > > > Does it hold true for all PMD's?
> > > >
> > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > highly tuned driver, but it works fine for all other ones too,
> > > > as far as I'm aware.
> > >
> > > Yes, you are changing this value for all PMDs but you're targetting
> > > only one.
> > > These thresholds are dependent of the PMD implementation. There's
> > > something wrong here.
> > >
> > I agree. Its fine to do this, but it does seem like the sample application
> > should document why it does this and make note of the fact that other PMDs
> > may
> > have a separate optimal value.
> > 
> > > > Basically, this is the minimum setting needed to enable either the
> > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > best made the default for that reason. Please see
> > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > the same file.
> > >
> > > Since this parameter is so important, it could be a default value somewhere.
> > >
> > > I think we should split generic tuning parameters and tuning parameters
> > > related to driver implementation or specific hardware.
> > > Then we should provide some good default values for each of them.
> > > At last, if needed, applications should be able to easily tune the
> > > pmd-specific parameters.
> > >
> > I like this idea.  I've not got an idea of how much work it is to do so, but in
> > principle it makes sense.
> > 
> > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > config
> > struct to get passed directly to PMDs, we can create a reserved value
> > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > whatever
> > threshold is optimal for its own hardware?
> > 
> > Neil
> > 
> Actually, looking at the code, I would suggest a couple of options, some of which may be used together.
> 	1) we make NULL a valid value for the rxconf structure parameter to rte_eth_rx_queue_setup. There is little information in it that should really need to be passed in by applications to the drivers, and that would allow the drivers to be completely free to select the best options for their own operation. 
> 	2) As a companion to that (or as an alternative), we could also allow each driver to provide its own functions for rte_eth_get_rxconf_default, and rte_eth_get_txconf_default, to be used by applications that want to use known-good values for thresholds but also want to tweak one of the other values e.g. for rx, set the drop_en bit, and for tx set the txqflags to disable offloads.
> 	3) Lastly, we could also consider removing the threshold and other not-generally-used values from the rxconf and txconf structures and make those removed fields completely driver-set values. Optionally, we could provide an alternate API to tune them, but I don't really see this being useful in most cases, and I'd probably omit it unless someone can prove a need for such APIs.
> 
These all sound fairly reasonable to me.
Neil

> Regards,
> /Bruce
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-19 10:24             ` Neil Horman
@ 2014-09-19 10:28               ` Richardson, Bruce
  2014-09-19 15:18                 ` Neil Horman
  0 siblings, 1 reply; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-19 10:28 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Friday, September 19, 2014 11:25 AM
> To: Richardson, Bruce
> Cc: Thomas Monjalon; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> 
> On Fri, Sep 19, 2014 at 09:18:26AM +0000, Richardson, Bruce wrote:
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Thursday, September 18, 2014 7:09 PM
> > > To: Thomas Monjalon
> > > Cc: Richardson, Bruce; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > >
> > > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > > >  /*
> > > > > > >   * Configurable value of RX free threshold.
> > > > > > >   */
> > > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> > > default. */
> > > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32
> > > packets
> > > > > > */
> > > > > > >
> > > > > >
> > > > > > Why 32?  Was that an experimentally determined value?
> > > > > > Does it hold true for all PMD's?
> > > > >
> > > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > > highly tuned driver, but it works fine for all other ones too,
> > > > > as far as I'm aware.
> > > >
> > > > Yes, you are changing this value for all PMDs but you're targetting
> > > > only one.
> > > > These thresholds are dependent of the PMD implementation. There's
> > > > something wrong here.
> > > >
> > > I agree. Its fine to do this, but it does seem like the sample application
> > > should document why it does this and make note of the fact that other PMDs
> > > may
> > > have a separate optimal value.
> > >
> > > > > Basically, this is the minimum setting needed to enable either the
> > > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > > best made the default for that reason. Please see
> > > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > > the same file.
> > > >
> > > > Since this parameter is so important, it could be a default value
> somewhere.
> > > >
> > > > I think we should split generic tuning parameters and tuning parameters
> > > > related to driver implementation or specific hardware.
> > > > Then we should provide some good default values for each of them.
> > > > At last, if needed, applications should be able to easily tune the
> > > > pmd-specific parameters.
> > > >
> > > I like this idea.  I've not got an idea of how much work it is to do so, but in
> > > principle it makes sense.
> > >
> > > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > > config
> > > struct to get passed directly to PMDs, we can create a reserved value
> > > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > > whatever
> > > threshold is optimal for its own hardware?
> > >
> > > Neil
> > >
> > Actually, looking at the code, I would suggest a couple of options, some of
> which may be used together.
> > 	1) we make NULL a valid value for the rxconf structure parameter to
> rte_eth_rx_queue_setup. There is little information in it that should really need
> to be passed in by applications to the drivers, and that would allow the drivers to
> be completely free to select the best options for their own operation.
> > 	2) As a companion to that (or as an alternative), we could also allow
> each driver to provide its own functions for rte_eth_get_rxconf_default, and
> rte_eth_get_txconf_default, to be used by applications that want to use known-
> good values for thresholds but also want to tweak one of the other values e.g.
> for rx, set the drop_en bit, and for tx set the txqflags to disable offloads.
> > 	3) Lastly, we could also consider removing the threshold and other not-
> generally-used values from the rxconf and txconf structures and make those
> removed fields completely driver-set values. Optionally, we could provide an
> alternate API to tune them, but I don't really see this being useful in most cases,
> and I'd probably omit it unless someone can prove a need for such APIs.
> >
> These all sound fairly reasonable to me.
> Neil

Further thinking seems to me like 1 doesn't really go very far, so it falls between 2 and 3. Any preference between them?

/Bruce

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
  2014-09-19 10:28               ` Richardson, Bruce
@ 2014-09-19 15:18                 ` Neil Horman
  0 siblings, 0 replies; 39+ messages in thread
From: Neil Horman @ 2014-09-19 15:18 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Fri, Sep 19, 2014 at 10:28:31AM +0000, Richardson, Bruce wrote:
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Friday, September 19, 2014 11:25 AM
> > To: Richardson, Bruce
> > Cc: Thomas Monjalon; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Fri, Sep 19, 2014 at 09:18:26AM +0000, Richardson, Bruce wrote:
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Thursday, September 18, 2014 7:09 PM
> > > > To: Thomas Monjalon
> > > > Cc: Richardson, Bruce; dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32
> > > >
> > > > On Thu, Sep 18, 2014 at 07:13:52PM +0200, Thomas Monjalon wrote:
> > > > > 2014-09-18 15:53, Richardson, Bruce:
> > > > > > > > --- a/app/test-pmd/testpmd.c
> > > > > > > > +++ b/app/test-pmd/testpmd.c
> > > > > > > > @@ -225,7 +225,7 @@ struct rte_eth_thresh tx_thresh = {
> > > > > > > >  /*
> > > > > > > >   * Configurable value of RX free threshold.
> > > > > > > >   */
> > > > > > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by
> > > > default. */
> > > > > > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32
> > > > packets
> > > > > > > */
> > > > > > > >
> > > > > > >
> > > > > > > Why 32?  Was that an experimentally determined value?
> > > > > > > Does it hold true for all PMD's?
> > > > > >
> > > > > > This is primarily for the ixgbe PMD, which is right now the most
> > > > > > highly tuned driver, but it works fine for all other ones too,
> > > > > > as far as I'm aware.
> > > > >
> > > > > Yes, you are changing this value for all PMDs but you're targetting
> > > > > only one.
> > > > > These thresholds are dependent of the PMD implementation. There's
> > > > > something wrong here.
> > > > >
> > > > I agree. Its fine to do this, but it does seem like the sample application
> > > > should document why it does this and make note of the fact that other PMDs
> > > > may
> > > > have a separate optimal value.
> > > >
> > > > > > Basically, this is the minimum setting needed to enable either the
> > > > > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > > > > best made the default for that reason. Please see
> > > > > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > > > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > > > > the same file.
> > > > >
> > > > > Since this parameter is so important, it could be a default value
> > somewhere.
> > > > >
> > > > > I think we should split generic tuning parameters and tuning parameters
> > > > > related to driver implementation or specific hardware.
> > > > > Then we should provide some good default values for each of them.
> > > > > At last, if needed, applications should be able to easily tune the
> > > > > pmd-specific parameters.
> > > > >
> > > > I like this idea.  I've not got an idea of how much work it is to do so, but in
> > > > principle it makes sense.
> > > >
> > > > Perhaps for the immediate need, since rte_eth_rx_queue_setup allows the
> > > > config
> > > > struct to get passed directly to PMDs, we can create a reserved value
> > > > RTE_ETH_RX_FREE_THRESH_OPTIMAL, that instructs the pmd to select
> > > > whatever
> > > > threshold is optimal for its own hardware?
> > > >
> > > > Neil
> > > >
> > > Actually, looking at the code, I would suggest a couple of options, some of
> > which may be used together.
> > > 	1) we make NULL a valid value for the rxconf structure parameter to
> > rte_eth_rx_queue_setup. There is little information in it that should really need
> > to be passed in by applications to the drivers, and that would allow the drivers to
> > be completely free to select the best options for their own operation.
> > > 	2) As a companion to that (or as an alternative), we could also allow
> > each driver to provide its own functions for rte_eth_get_rxconf_default, and
> > rte_eth_get_txconf_default, to be used by applications that want to use known-
> > good values for thresholds but also want to tweak one of the other values e.g.
> > for rx, set the drop_en bit, and for tx set the txqflags to disable offloads.
> > > 	3) Lastly, we could also consider removing the threshold and other not-
> > generally-used values from the rxconf and txconf structures and make those
> > removed fields completely driver-set values. Optionally, we could provide an
> > alternate API to tune them, but I don't really see this being useful in most cases,
> > and I'd probably omit it unless someone can prove a need for such APIs.
> > >
> > These all sound fairly reasonable to me.
> > Neil
> 
> Further thinking seems to me like 1 doesn't really go very far, so it falls between 2 and 3. Any preference between them?
> 
Not to answer a question with a question, but, because I'm not really sure, how
much does an application really need to know or set in regards to hardware queue
parameters.  I ask because I'm inclined to just go with option 3 (since it
prevents expansion of the application visible API), but I'm not sure if theres
important functionality you loose in doing so.

Neil

> /Bruce
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3
  2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                   ` (4 preceding siblings ...)
  2014-09-17 10:01 ` [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf Bruce Richardson
@ 2014-09-23 11:08 ` Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
                     ` (5 more replies)
  5 siblings, 6 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

This is the final planned set of patches to make changes to the mbuf
data structure and associated files. This patch set makes more changes to
help improve performance following the mbuf changes and adds in two new
fields into the mbuf structure.

It is planned to add other fields other than the two provided here, but
patches for adding those fields will be included in the patch sets for the
changes making use of those fields, since adding them does not affect, or
move, any other mbuf fields.

Changes in V2:
* Updated userdata pointer in mbuf to always be 8 bytes big
* Updated a number of commit messages to have more details about the performance benefits of the changes proposed in the patches
* Removed old patch 5 which added the second vlan tag, and replaced it with a new, smaller patch which just moves the existing vlan_tci field above the 16-bit reserved space.

Bruce Richardson (5):
  mbuf: ensure next pointer is set to null on free
  ixgbe: add prefetch to improve slow-path tx perf
  testpmd: Change rxfreet default to 32
  mbuf: add userdata pointer field
  mbuf: switch vlan_tci and reserved2 fields

 app/test-pmd/testpmd.c                                      |  4 +++-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h          |  6 ++++--
 lib/librte_mbuf/rte_mbuf.h                                  | 12 ++++++++++--
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                           | 13 ++++++++-----
 4 files changed, 25 insertions(+), 10 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
@ 2014-09-23 11:08   ` Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

The receive functions for packets do not modify the next pointer so
the next pointer should always be cleared on mbuf free, just in case.
The slow-path TX needs to clear it, and the standard mbuf free function
also needs to clear it. Fast path TX does not handle chained mbufs so
is unaffected

Changes in V2:
* None

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h        | 4 +++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 1c6e115..8e27d2e 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -682,8 +682,10 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 static inline void __attribute__((always_inline))
 rte_pktmbuf_free_seg(struct rte_mbuf *m)
 {
-	if (likely(NULL != (m = __rte_pktmbuf_prefree_seg(m))))
+	if (likely(NULL != (m = __rte_pktmbuf_prefree_seg(m)))) {
+		m->next = NULL;
 		__rte_mbuf_raw_free(m);
+	}
 }
 
 /**
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index a80cade..6f702b3 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -145,6 +145,7 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	/* free buffers one at a time */
 	if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
 		for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
+			txep->mbuf->next = NULL;
 			rte_mempool_put(txep->mbuf->pool, txep->mbuf);
 			txep->mbuf = NULL;
 		}
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] ixgbe: add prefetch to improve slow-path tx perf
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
@ 2014-09-23 11:08   ` Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

Make a small improvement to slow path TX performance by adding in a
prefetch for the second mbuf cache line.
Also move assignment of l2/l3 length values only when needed.

What I've done with the prefetches is two-fold:
1) changed it from prefetching the mbuf (first cache line) to prefetching
the mbuf pool pointer (second cache line) so that when we go to access
the pool pointer to free transmitted mbufs we don't get a cache miss. When
clearing the ring and freeing mbufs, the pool pointer is the only mbuf
field used, so we don't need that first cache line.
2) changed the code to prefetch earlier - in effect to prefetch one mbuf
ahead. The original code prefetched the mbuf to be freed as soon as it
started processing the mbuf to replace it. Instead now, every time we
calculate what the next mbuf position is going to be we prefetch the mbuf
in that position (i.e. the mbuf pool pointer we are going to free the mbuf
to), even while we are still updating the previous mbuf slot on the ring.
This gives the prefetch much more time to resolve and get the data we need
in the cache before we need it.

In terms of performance difference, a quick sanity test using testpmd
on a Xeon (Sandy Bridge uarch) platform showed performance increases
between approx 8-18%, depending on the particular RX path used in
conjuntion with this TX path code.

Changes in V2:
* Expanded commit message with extra details of change.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 6f702b3..c0bb49f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -565,25 +565,26 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ixgbe_xmit_cleanup(txq);
 	}
 
+	rte_prefetch0(&txe->mbuf->pool);
+
 	/* TX loop */
 	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
 		new_ctx = 0;
 		tx_pkt = *tx_pkts++;
 		pkt_len = tx_pkt->pkt_len;
 
-		RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
-
 		/*
 		 * Determine how many (if any) context descriptors
 		 * are needed for offload functionality.
 		 */
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
-		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
 
 		/* If hardware offload required */
 		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 		if (tx_ol_req) {
+			vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
+			vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
+
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
 				vlan_macip_lens.data);
@@ -720,7 +721,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 				    &txr[tx_id];
 
 				txn = &sw_ring[txe->next_id];
-				RTE_MBUF_PREFETCH_TO_FREE(txn->mbuf);
+				rte_prefetch0(&txn->mbuf->pool);
 
 				if (txe->mbuf != NULL) {
 					rte_pktmbuf_free_seg(txe->mbuf);
@@ -749,6 +750,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		do {
 			txd = &txr[tx_id];
 			txn = &sw_ring[txe->next_id];
+			rte_prefetch0(&txn->mbuf->pool);
 
 			if (txe->mbuf != NULL)
 				rte_pktmbuf_free_seg(txe->mbuf);
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
@ 2014-09-23 11:08   ` Bruce Richardson
  2014-09-23 17:02     ` Neil Horman
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 4/5] mbuf: add userdata pointer field Bruce Richardson
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

To improve performance by using bulk alloc or vectored RX routines, we
need to set rx free threshold (rxfreet) value to 32, so make this the
testpmd default.

Thirty-two is the minimum setting needed to enable either the
bulk alloc or vector RX routines inside the ixgbe driver, so it's
best made the default for that reason. Please see
"check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
the same file.

The difference in IO performance for testpmd when called without any
optional parameters, and using 10G NICs using the ixgbe driver, can be
significant - approx 25% or more.

Updates in V2:
* Updated commit message with additional details

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/testpmd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 9f6cdc4..f76406f 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
 /*
  * Configurable value of RX free threshold.
  */
-uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
+uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
+		This setting is needed for ixgbe to enable bulk alloc or vector
+		receive functionality. */
 
 /*
  * Configurable value of RX drop enable.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] mbuf: add userdata pointer field
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                     ` (2 preceding siblings ...)
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
@ 2014-09-23 11:08   ` Bruce Richardson
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 5/5] mbuf: switch vlan_tci and reserved2 fields Bruce Richardson
  2014-09-29 15:58   ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 De Lara Guarch, Pablo
  5 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

While some applications may store metadata about packets in the packet
mbuf headroom, this is not a workable solution for packet metadata which
is either:
* larger than the headroom (or headroom is needed for adding pkt headers)
* needs to be shared or copied among packets

To support these use cases in applications, we reserve a general
"userdata" pointer field inside the second cache-line of the mbuf. This
is better than having the application store the pointer to the external
metadata in the packet headroom, as it saves an additional cache-line
from being used.

Apart from storing metadata, this field also provides a general 8-byte
scratch space inside the mbuf for any other application uses that are
applicable.

Changes in V2:
* made the userdata field always have 8-bytes available, even on 32-bit

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 6 ++++--
 lib/librte_mbuf/rte_mbuf.h                                    | 6 ++++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index 25ed672..e548161 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -116,8 +116,10 @@ struct rte_kni_mbuf {
 	char pad2[2];
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
-	char pad3[8];
-	void *pool __attribute__((__aligned__(64)));
+
+	/* fields on second cache line */
+	char pad3[8] __attribute__((__aligned__(64)));
+	void *pool;
 	void *next;
 };
 
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 8e27d2e..9e70d3b 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -172,6 +172,12 @@ struct rte_mbuf {
 
 	/* second cache line - fields only used in slow path or on TX */
 	MARKER cacheline1 __rte_cache_aligned;
+
+	union {
+		void *userdata;   /**< Can be used for external metadata */
+		uint64_t udata64; /**< Allow 8-byte userdata on 32-bit */
+	};
+
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] mbuf: switch vlan_tci and reserved2 fields
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                     ` (3 preceding siblings ...)
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 4/5] mbuf: add userdata pointer field Bruce Richardson
@ 2014-09-23 11:08   ` Bruce Richardson
  2014-09-29 15:58   ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 De Lara Guarch, Pablo
  5 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-09-23 11:08 UTC (permalink / raw)
  To: dev

Move the vlan_tci field up by two bytes in the mbuf data structure. This
has two effects:
* Ensures the the ixgbe vector driver places the vlan tag in the correct
  place in the mbuf.
* Allows a second vlan tag field, if one is added in the future, to be
  placed after the existing vlan field, rather than before.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 9e70d3b..68304cc 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -159,8 +159,8 @@ struct rte_mbuf {
 	uint16_t reserved2;       /**< Unused field. Required for padding */
 	uint16_t data_len;        /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
-	uint16_t reserved;
 	uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
+	uint16_t reserved;
 	union {
 		uint32_t rss;     /**< RSS hash result if RSS enabled */
 		struct {
-- 
1.9.3

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
@ 2014-09-23 17:02     ` Neil Horman
  2014-09-24  9:03       ` Richardson, Bruce
  0 siblings, 1 reply; 39+ messages in thread
From: Neil Horman @ 2014-09-23 17:02 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Tue, Sep 23, 2014 at 12:08:15PM +0100, Bruce Richardson wrote:
> To improve performance by using bulk alloc or vectored RX routines, we
> need to set rx free threshold (rxfreet) value to 32, so make this the
> testpmd default.
> 
> Thirty-two is the minimum setting needed to enable either the
> bulk alloc or vector RX routines inside the ixgbe driver, so it's
> best made the default for that reason. Please see
> "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> the same file.
> 
> The difference in IO performance for testpmd when called without any
> optional parameters, and using 10G NICs using the ixgbe driver, can be
> significant - approx 25% or more.
> 
> Updates in V2:
> * Updated commit message with additional details
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  app/test-pmd/testpmd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 9f6cdc4..f76406f 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
>  /*
>   * Configurable value of RX free threshold.
>   */
> -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
> +		This setting is needed for ixgbe to enable bulk alloc or vector
> +		receive functionality. */

I thought we were talking about making this a pmd private selectable item, or
allowing a reserved "let the pmd decide" setting.  Or are we saving that for a
later time?

Neil

>  
>  /*
>   * Configurable value of RX drop enable.
> -- 
> 1.9.3
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-09-23 17:02     ` Neil Horman
@ 2014-09-24  9:03       ` Richardson, Bruce
  2014-09-24 10:05         ` Neil Horman
  2014-11-07 12:30         ` Thomas Monjalon
  0 siblings, 2 replies; 39+ messages in thread
From: Richardson, Bruce @ 2014-09-24  9:03 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Tuesday, September 23, 2014 6:03 PM
> To: Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
> 
> On Tue, Sep 23, 2014 at 12:08:15PM +0100, Bruce Richardson wrote:
> > To improve performance by using bulk alloc or vectored RX routines, we
> > need to set rx free threshold (rxfreet) value to 32, so make this the
> > testpmd default.
> >
> > Thirty-two is the minimum setting needed to enable either the
> > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > best made the default for that reason. Please see
> > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > the same file.
> >
> > The difference in IO performance for testpmd when called without any
> > optional parameters, and using 10G NICs using the ixgbe driver, can be
> > significant - approx 25% or more.
> >
> > Updates in V2:
> > * Updated commit message with additional details
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  app/test-pmd/testpmd.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > index 9f6cdc4..f76406f 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
> >  /*
> >   * Configurable value of RX free threshold.
> >   */
> > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
> > +		This setting is needed for ixgbe to enable bulk alloc or vector
> > +		receive functionality. */
> 
> I thought we were talking about making this a pmd private selectable item, or
> allowing a reserved "let the pmd decide" setting.  Or are we saving that for a
> later time?
> 

Yes, we are looking at that - and hopefully we can also get a patch for that in for our next release. However, I've left this patch in just in case that doesn't actually happen, as the performance improvements for 10G are just too good to leave aside for the sake of a 1-line change. Ideally, I'd like this go to in, and then be replaced by a "proper" fix.

/Bruce

> Neil
> 
> >
> >  /*
> >   * Configurable value of RX drop enable.
> > --
> > 1.9.3
> >
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-09-24  9:03       ` Richardson, Bruce
@ 2014-09-24 10:05         ` Neil Horman
  2014-11-07 12:30         ` Thomas Monjalon
  1 sibling, 0 replies; 39+ messages in thread
From: Neil Horman @ 2014-09-24 10:05 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

On Wed, Sep 24, 2014 at 09:03:20AM +0000, Richardson, Bruce wrote:
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Tuesday, September 23, 2014 6:03 PM
> > To: Richardson, Bruce
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
> > 
> > On Tue, Sep 23, 2014 at 12:08:15PM +0100, Bruce Richardson wrote:
> > > To improve performance by using bulk alloc or vectored RX routines, we
> > > need to set rx free threshold (rxfreet) value to 32, so make this the
> > > testpmd default.
> > >
> > > Thirty-two is the minimum setting needed to enable either the
> > > bulk alloc or vector RX routines inside the ixgbe driver, so it's
> > > best made the default for that reason. Please see
> > > "check_rx_burst_bulk_alloc_preconditions()" in ixgbe_rxtx.c, and
> > > RX function assignment logic in "ixgbe_dev_rx_queue_setup()" in
> > > the same file.
> > >
> > > The difference in IO performance for testpmd when called without any
> > > optional parameters, and using 10G NICs using the ixgbe driver, can be
> > > significant - approx 25% or more.
> > >
> > > Updates in V2:
> > > * Updated commit message with additional details
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > >  app/test-pmd/testpmd.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> > > index 9f6cdc4..f76406f 100644
> > > --- a/app/test-pmd/testpmd.c
> > > +++ b/app/test-pmd/testpmd.c
> > > @@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
> > >  /*
> > >   * Configurable value of RX free threshold.
> > >   */
> > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
> > > +		This setting is needed for ixgbe to enable bulk alloc or vector
> > > +		receive functionality. */
> > 
> > I thought we were talking about making this a pmd private selectable item, or
> > allowing a reserved "let the pmd decide" setting.  Or are we saving that for a
> > later time?
> > 
> 
> Yes, we are looking at that - and hopefully we can also get a patch for that in for our next release. However, I've left this patch in just in case that doesn't actually happen, as the performance improvements for 10G are just too good to leave aside for the sake of a 1-line change. Ideally, I'd like this go to in, and then be replaced by a "proper" fix.
> 
Ok, thanks
Neil

> /Bruce
> 
> > Neil
> > 
> > >
> > >  /*
> > >   * Configurable value of RX drop enable.
> > > --
> > > 1.9.3
> > >
> > >
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3
  2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
                     ` (4 preceding siblings ...)
  2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 5/5] mbuf: switch vlan_tci and reserved2 fields Bruce Richardson
@ 2014-09-29 15:58   ` De Lara Guarch, Pablo
  2014-10-08 12:31     ` Thomas Monjalon
  5 siblings, 1 reply; 39+ messages in thread
From: De Lara Guarch, Pablo @ 2014-09-29 15:58 UTC (permalink / raw)
  To: Richardson, Bruce, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
> Sent: Tuesday, September 23, 2014 12:08 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3
> 
> This is the final planned set of patches to make changes to the mbuf
> data structure and associated files. This patch set makes more changes to
> help improve performance following the mbuf changes and adds in two new
> fields into the mbuf structure.
> 
> It is planned to add other fields other than the two provided here, but
> patches for adding those fields will be included in the patch sets for the
> changes making use of those fields, since adding them does not affect, or
> move, any other mbuf fields.
> 
> Changes in V2:
> * Updated userdata pointer in mbuf to always be 8 bytes big
> * Updated a number of commit messages to have more details about the
> performance benefits of the changes proposed in the patches
> * Removed old patch 5 which added the second vlan tag, and replaced it with
> a new, smaller patch which just moves the existing vlan_tci field above the
> 16-bit reserved space.
> 
> Bruce Richardson (5):
>   mbuf: ensure next pointer is set to null on free
>   ixgbe: add prefetch to improve slow-path tx perf
>   testpmd: Change rxfreet default to 32
>   mbuf: add userdata pointer field
>   mbuf: switch vlan_tci and reserved2 fields
> 
>  app/test-pmd/testpmd.c                                      |  4 +++-
>  .../linuxapp/eal/include/exec-env/rte_kni_common.h          |  6 ++++--
>  lib/librte_mbuf/rte_mbuf.h                                  | 12 ++++++++++--
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c                           | 13 ++++++++-----
>  4 files changed, 25 insertions(+), 10 deletions(-)
> 
> --
> 1.9.3

Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3
  2014-09-29 15:58   ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 De Lara Guarch, Pablo
@ 2014-10-08 12:31     ` Thomas Monjalon
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Monjalon @ 2014-10-08 12:31 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

2014-09-29 15:58, De Lara Guarch, Pablo:
> > -----Original Message-----
> > This is the final planned set of patches to make changes to the mbuf
> > data structure and associated files. This patch set makes more changes to
> > help improve performance following the mbuf changes and adds in two new
> > fields into the mbuf structure.
> > 
> > It is planned to add other fields other than the two provided here, but
> > patches for adding those fields will be included in the patch sets for the
> > changes making use of those fields, since adding them does not affect, or
> > move, any other mbuf fields.
> > 
> > Changes in V2:
> > * Updated userdata pointer in mbuf to always be 8 bytes big
> > * Updated a number of commit messages to have more details about the
> > performance benefits of the changes proposed in the patches
> > * Removed old patch 5 which added the second vlan tag, and replaced it with
> > a new, smaller patch which just moves the existing vlan_tci field above the
> > 16-bit reserved space.
> > 
> > Bruce Richardson (5):
> >   mbuf: ensure next pointer is set to null on free
> >   ixgbe: add prefetch to improve slow-path tx perf
> >   testpmd: Change rxfreet default to 32
> >   mbuf: add userdata pointer field
> >   mbuf: switch vlan_tci and reserved2 fields
> 
> Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>

Applied

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-09-24  9:03       ` Richardson, Bruce
  2014-09-24 10:05         ` Neil Horman
@ 2014-11-07 12:30         ` Thomas Monjalon
  2014-11-07 13:49           ` Bruce Richardson
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Monjalon @ 2014-11-07 12:30 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

Hi Bruce,

2014-09-24 09:03, Richardson, Bruce:
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > On Tue, Sep 23, 2014 at 12:08:15PM +0100, Bruce Richardson wrote:
> > > --- a/app/test-pmd/testpmd.c
> > > +++ b/app/test-pmd/testpmd.c
> > > @@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
> > >  /*
> > >   * Configurable value of RX free threshold.
> > >   */
> > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
> > > +		This setting is needed for ixgbe to enable bulk alloc or vector
> > > +		receive functionality. */
> > 
> > I thought we were talking about making this a pmd private selectable item,
> > or allowing a reserved "let the pmd decide" setting.  Or are we saving
> > that for a later time?
> 
> Yes, we are looking at that - and hopefully we can also get a patch for that
> in for our next release. However, I've left this patch in just in case that
> doesn't actually happen, as the performance improvements for 10G are just
> too good to leave aside for the sake of a 1-line change. Ideally, I'd like
> this go to in, and then be replaced by a "proper" fix.

Now the patch for PMD defaults is integrated:
	http://dpdk.org/ml/archives/dev/2014-October/006511.html
Are you working on getting these defaults in testpmd?

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32
  2014-11-07 12:30         ` Thomas Monjalon
@ 2014-11-07 13:49           ` Bruce Richardson
  0 siblings, 0 replies; 39+ messages in thread
From: Bruce Richardson @ 2014-11-07 13:49 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev

On Fri, Nov 07, 2014 at 01:30:53PM +0100, Thomas Monjalon wrote:
> Hi Bruce,
> 
> 2014-09-24 09:03, Richardson, Bruce:
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > On Tue, Sep 23, 2014 at 12:08:15PM +0100, Bruce Richardson wrote:
> > > > --- a/app/test-pmd/testpmd.c
> > > > +++ b/app/test-pmd/testpmd.c
> > > > @@ -225,7 +225,9 @@ struct rte_eth_thresh tx_thresh = {
> > > >  /*
> > > >   * Configurable value of RX free threshold.
> > > >   */
> > > > -uint16_t rx_free_thresh = 0; /* Immediately free RX descriptors by default. */
> > > > +uint16_t rx_free_thresh = 32; /* Refill RX descriptors once every 32 packets,
> > > > +		This setting is needed for ixgbe to enable bulk alloc or vector
> > > > +		receive functionality. */
> > > 
> > > I thought we were talking about making this a pmd private selectable item,
> > > or allowing a reserved "let the pmd decide" setting.  Or are we saving
> > > that for a later time?
> > 
> > Yes, we are looking at that - and hopefully we can also get a patch for that
> > in for our next release. However, I've left this patch in just in case that
> > doesn't actually happen, as the performance improvements for 10G are just
> > too good to leave aside for the sake of a 1-line change. Ideally, I'd like
> > this go to in, and then be replaced by a "proper" fix.
> 
> Now the patch for PMD defaults is integrated:
> 	http://dpdk.org/ml/archives/dev/2014-October/006511.html
> Are you working on getting these defaults in testpmd?
> 

Not at the minute, as I'm busy on other things. If I find time I can look at it,
though, if nobody else volunteers to do so first.

/Bruce


> Thanks
> -- 
> Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-11-07 13:40 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-17 10:01 [dpdk-dev] [PATCH 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
2014-09-17 10:01 ` [dpdk-dev] [PATCH 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
2014-09-17 10:01 ` [dpdk-dev] [PATCH 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
2014-09-17 15:21   ` Neil Horman
2014-09-17 15:35     ` Richardson, Bruce
2014-09-17 17:59       ` Neil Horman
2014-09-18 13:36         ` Bruce Richardson
2014-09-18 15:29           ` Neil Horman
2014-09-18 15:42             ` Bruce Richardson
2014-09-18 17:56               ` Neil Horman
2014-09-17 10:01 ` [dpdk-dev] [PATCH 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
2014-09-17 15:29   ` Neil Horman
2014-09-18 15:53     ` Richardson, Bruce
2014-09-18 17:13       ` Thomas Monjalon
2014-09-18 18:08         ` Neil Horman
2014-09-19  9:18           ` Richardson, Bruce
2014-09-19 10:24             ` Neil Horman
2014-09-19 10:28               ` Richardson, Bruce
2014-09-19 15:18                 ` Neil Horman
2014-09-18 18:03       ` Neil Horman
2014-09-17 10:01 ` [dpdk-dev] [PATCH 4/5] mbuf: add userdata pointer field Bruce Richardson
2014-09-17 15:35   ` Neil Horman
2014-09-17 16:02     ` Richardson, Bruce
2014-09-17 18:29       ` Neil Horman
2014-09-17 10:01 ` [dpdk-dev] [PATCH 5/5] mbuf: Add in second vlan tag field to mbuf Bruce Richardson
2014-09-17 20:46   ` Stephen Hemminger
2014-09-23 11:08 ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 1/5] mbuf: ensure next pointer is set to null on free Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 2/5] ixgbe: add prefetch to improve slow-path tx perf Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 3/5] testpmd: Change rxfreet default to 32 Bruce Richardson
2014-09-23 17:02     ` Neil Horman
2014-09-24  9:03       ` Richardson, Bruce
2014-09-24 10:05         ` Neil Horman
2014-11-07 12:30         ` Thomas Monjalon
2014-11-07 13:49           ` Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 4/5] mbuf: add userdata pointer field Bruce Richardson
2014-09-23 11:08   ` [dpdk-dev] [PATCH v2 5/5] mbuf: switch vlan_tci and reserved2 fields Bruce Richardson
2014-09-29 15:58   ` [dpdk-dev] [PATCH v2 0/5] Mbuf Structure Rework, part 3 De Lara Guarch, Pablo
2014-10-08 12:31     ` Thomas Monjalon

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git