DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] Making space in mbuf data-structure
@ 2014-07-03 23:38 Richardson, Bruce
  2014-07-07 10:19 ` Olivier MATZ
  2014-07-15  9:31 ` Ananyev, Konstantin
  0 siblings, 2 replies; 9+ messages in thread
From: Richardson, Bruce @ 2014-07-03 23:38 UTC (permalink / raw)
  To: dev

Hi all,

At this stage it's been well recognised that we need to make more space in the mbuf data structure for new fields. We in Intel have had some discussion on this and this email is to outline what our current thinking and approach on this issue is, and look for additional suggestions and feedback.

Firstly, we believe that there is no possible way that we can ever fit all the fields we need to fit into a 64-byte mbuf, and so we need to start looking at a 128-byte mbuf instead. Examples of new fields that need to fit in, include - 32-64 bits for additional offload flags, 32-bits more for filter information for support for the new filters in the i40e driver, an additional 2-4 bytes for storing info on a second vlan tag, 4-bytes for storing a sequence number to enable packet reordering in future, as well as potentially a number of other fields or splitting out fields that are superimposed over each other right now, e.g. for the qos scheduler. We also want to allow space for use by other non-Intel NIC drivers that may be open-sourced to dpdk.org in the future too, where they support fields and offloads that our hardware doesn't.

If we accept the fact of a 2-cache-line mbuf, then the issue becomes how to rework things so that we spread our fields over the two cache lines while causing the lowest slow-down possible. The general approach that we are looking to take is to focus the first cache line on fields that are updated on RX , so that receive only deals with one cache line. The second cache line can be used for application data and information that will only be used on the TX leg. This would allow us to work on the first cache line in RX as now, and have the second cache line being prefetched in the background so that it is available when necessary. Hardware prefetches should help us out here. We also may move rarely used, or slow-path RX fields e.g. such as those for chained mbufs with jumbo frames, to the second cache line, depending upon the performance impact and bytes savings achieved.

With this approach, we still need to make space in the first cache line for information for the new or expanded receive offloads. First off the blocks is to look at moving the mempool pointer into the second cache line. This will free-up 8 bytes in cache line  0, with a field that is only used when cleaning up after TX. A prototype patch for this change is given below, for your review and comment. Initial quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=250 --txqflags=0xf01 --burst=32 --txrst=32 --txfreet=32 --rxfreet=32 --total-num-mbufs=65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --config="(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight performance decrease with testpmd and equal or slightly better performance with l3fwd. These would both have been using the vector PMD - I haven't looked to change the fallback TX implementation yet for this change, since it's not as performance optimized and hence cycle-count sensitive.

Beyond this change, I'm also investigating potentially moving the "next" pointer to the second cache line, but it's looking harder to move without serious impact, so we'll have to see there what can be done. Other fields I'll examine in due course too, and send on any findings I may have. 

Once we have freed up space, then we can start looking at what fields get to use that space and what way we shuffle the existing fields about, but that's a discussion for another day!

Please see patch below for moving pool pointer to second cache line of mbuf. All feedback welcome, naturally.

Regards,
/Bruce

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 2b87521..e0cf30c 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -832,7 +832,7 @@ test_failing_mbuf_sanity_check(void)
 int
 test_mbuf(void)
 {
-	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != 64);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != CACHE_LINE_SIZE*2);
 
 	/* create pktmbuf pool if it does not exist */
 	if (pktmbuf_pool == NULL) {
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 2735f37..aa3ec32 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -181,7 +181,8 @@ enum rte_mbuf_type {
  * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
  */
 struct rte_mbuf {
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+	uint64_t this_space_for_rent; // 8-bytes free for reuse.
+
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 	uint16_t buf_len;         /**< Length of segment buffer. */
@@ -210,12 +211,16 @@ struct rte_mbuf {
 		struct rte_pktmbuf pkt;
 	};
 
+// second cache line starts here
+	struct rte_mempool *pool __rte_cache_aligned;
+			/**< Pool from which mbuf was allocated. */
+
 	union {
 		uint8_t metadata[0];
 		uint16_t metadata16[0];
 		uint32_t metadata32[0];
 		uint64_t metadata64[0];
-	};
+	} __rte_cache_aligned;
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 64c0695..7374e0d 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -171,10 +171,6 @@ struct igb_tx_queue {
 	volatile union ixgbe_adv_tx_desc *tx_ring;
 	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
 	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
-#ifdef RTE_IXGBE_INC_VECTOR
-	/** continuous tx entry sequence within the same mempool */
-	struct igb_tx_entry_seq *sw_ring_seq;
-#endif
 	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
 	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
 	uint16_t            tx_tail;       /**< current value of TDT reg. */
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 09e19a3..e9fd739 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -401,9 +401,8 @@ static inline int __attribute__((always_inline))
 ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 {
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint32_t status;
-	uint32_t n, k;
+	uint32_t n;
 #ifdef RTE_MBUF_SCATTER_GATHER
 	uint32_t i;
 	int nb_free = 0;
@@ -423,25 +422,36 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
 			(n - 1)];
-	txsp = &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
 
-	while (n > 0) {
-		k = RTE_MIN(n, txsp[n-1].same_pool);
 #ifdef RTE_MBUF_SCATTER_GATHER
-		for (i = 0; i < k; i++) {
-			m = __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
+	m = __rte_pktmbuf_prefree_seg(txep[0].mbuf);
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool))
+					free[nb_free++] = m;
+				else
+					rte_mempool_put(m->pool, m);
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	}
+	else {
+		for (i = 1; i < n; i++) {
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
 			if (m != NULL)
-				free[nb_free++] = m;
+				rte_mempool_put(m->pool, m);
 		}
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)free, nb_free);
-#else
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)(txep+n-k), k);
-#endif
-		n -= k;
 	}
 
+#else /* no scatter_gather */
+	for (i = 0; i < n; i++)
+		rte_mempool_put(m->pool, txep[i].mbuf);
+#endif /* scatter_gather */
+
 	/* buffers were freed, update counters */
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
 	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
@@ -453,19 +463,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 
 static inline void __attribute__((always_inline))
 tx_backlog_entry(struct igb_tx_entry_v *txep,
-		 struct igb_tx_entry_seq *txsp,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	int i;
-	for (i = 0; i < (int)nb_pkts; ++i) {
+	for (i = 0; i < (int)nb_pkts; ++i)
 		txep[i].mbuf = tx_pkts[i];
-		/* check and update sequence number */
-		txsp[i].pool = tx_pkts[i]->pool;
-		if (txsp[i-1].pool == tx_pkts[i]->pool)
-			txsp[i].same_pool = txsp[i-1].same_pool + 1;
-		else
-			txsp[i].same_pool = 1;
-	}
 }
 
 uint16_t
@@ -475,7 +477,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_queue *txq = (struct igb_tx_queue *)tx_queue;
 	volatile union ixgbe_adv_tx_desc *txdp;
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
 	__m128i flags = _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0);
 	__m128i rs = _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS,
@@ -495,14 +496,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	tx_id = txq->tx_tail;
 	txdp = &txq->tx_ring[tx_id];
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
-	txsp = &txq->sw_ring_seq[tx_id];
 
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
 	n = (uint16_t)(txq->nb_tx_desc - tx_id);
 	if (nb_commit >= n) {
 
-		tx_backlog_entry(txep, txsp, tx_pkts, n);
+		tx_backlog_entry(txep, tx_pkts, n);
 
 		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
 			vtx1(txdp, *tx_pkts, flags);
@@ -517,10 +517,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* avoid reach the end of ring */
 		txdp = &(txq->tx_ring[tx_id]);
 		txep = &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
-		txsp = &(txq->sw_ring_seq[tx_id]);
 	}
 
-	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
 
 	vtx(txdp, tx_pkts, nb_commit, flags);
 
@@ -544,7 +543,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 {
 	unsigned i;
 	struct igb_tx_entry_v *txe;
-	struct igb_tx_entry_seq *txs;
 	uint16_t nb_free, max_desc;
 
 	if (txq->sw_ring != NULL) {
@@ -562,10 +560,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 		for (i = 0; i < txq->nb_tx_desc; i++) {
 			txe = (struct igb_tx_entry_v *)&txq->sw_ring[i];
 			txe->mbuf = NULL;
-
-			txs = &txq->sw_ring_seq[i];
-			txs->pool = NULL;
-			txs->same_pool = 0;
 		}
 	}
 }
@@ -580,11 +574,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
 		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
 		txq->sw_ring = NULL;
 	}
-
-	if (txq->sw_ring_seq != NULL) {
-		rte_free(txq->sw_ring_seq - 1);
-		txq->sw_ring_seq = NULL;
-	}
 }
 
 static void
@@ -593,7 +582,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 	static const union ixgbe_adv_tx_desc zeroed_desc = { .read = {
 			.buffer_addr = 0} };
 	struct igb_tx_entry_v *txe = (struct igb_tx_entry_v *)txq->sw_ring;
-	struct igb_tx_entry_seq *txs = txq->sw_ring_seq;
 	uint16_t i;
 
 	/* Zero out HW ring memory */
@@ -605,8 +593,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 		volatile union ixgbe_adv_tx_desc *txd = &txq->tx_ring[i];
 		txd->wb.status = IXGBE_TXD_STAT_DD;
 		txe[i].mbuf = NULL;
-		txs[i].pool = NULL;
-		txs[i].same_pool = 0;
 	}
 
 	txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
@@ -632,27 +618,14 @@ static struct ixgbe_txq_ops vec_txq_ops = {
 };
 
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
-			unsigned int socket_id)
+			__rte_unused unsigned int socket_id)
 {
-	uint16_t nb_desc;
-
 	if (txq->sw_ring == NULL)
 		return -1;
 
-	/* request addtional one entry for continous sequence check */
-	nb_desc = (uint16_t)(txq->nb_tx_desc + 1);
-
-	txq->sw_ring_seq = rte_zmalloc_socket("txq->sw_ring_seq",
-				sizeof(struct igb_tx_entry_seq) * nb_desc,
-				CACHE_LINE_SIZE, socket_id);
-	if (txq->sw_ring_seq == NULL)
-		return -1;
-
-
 	/* leave the first one for overflow */
 	txq->sw_ring = (struct igb_tx_entry *)
 		((struct igb_tx_entry_v *)txq->sw_ring + 1);
-	txq->sw_ring_seq += 1;
 	txq->ops = &vec_txq_ops;
 
 	return 0;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-03 23:38 [dpdk-dev] Making space in mbuf data-structure Richardson, Bruce
@ 2014-07-07 10:19 ` Olivier MATZ
  2014-07-08  7:04   ` Zhang, Helin
  2014-07-15  9:07   ` Ananyev, Konstantin
  2014-07-15  9:31 ` Ananyev, Konstantin
  1 sibling, 2 replies; 9+ messages in thread
From: Olivier MATZ @ 2014-07-07 10:19 UTC (permalink / raw)
  To: Richardson, Bruce, dev

Hello Bruce,

Thank you to revive this discussion now that the 1.7 is released.

First, I would like to reference my previous patch series that first
reworks the mbuf to gain 9 bytes [1]. The v1 of the patch was discussed
at [2].

Now, let's list what I would find useful to have in this mbuf rework:

- larger size for ol_flags: this is at least needed for TSO, but as it
   is completely full today, I expect to have this need for other
   features.
- add other offload fields: l4_len and mss, required for TSO
- remove ctrl_mbuf: they could be replaced by a packet mbuf. It will
   simplify the mbuf structure. Moreover, it would allow to save room
   in the mbuf.
- a new vlan tag, I suppose this could be useful in some use-cases
   where vlans are stacked.
- splitting out fields that are superimposed: if 2 features can be used
   at the same time

On the other hand, I'm not convinced by this:

- new filters in the i40e driver: I don't think the mbuf is the
   right place for driver-specific flags. If a feature is brought
   by a new driver requiring a flag in mbuf, we should take care that
   the flag is not bound to this particular driver and would match
   the same feature in another driver.
- sequence number: I'm not sure I understand the use-case, maybe this
   could stay in a mbuf meta data in the reordering module.

> Firstly, we believe that there is no possible way that we can ever fit
> all the fields we need to fit into a 64-byte mbuf, and so we need to
> start looking at a 128-byte mbuf instead.

The TSO patches show that it is possible to keep a 64 bytes mbuf (of
course, it depends on what we want to add in the mbuf). I'm not
fundamentally against having 128 bytes mbuf. But:

- it should not be a reason for just adding things and not reworking
   things that could be enhanced
- it should not be a reason for not optimizing the current mbuf
   structure
- if we can do the same with a 64 bytes mbuf, we need to carefuly
   compare the solutions as fetching a second cache line is not
   costless in all situations. The 64 bytes solution I'm proposing
   in [1] may cost a bit more in CPU cycles but avoids an additional
   cache prefetch (or miss). In some situations (I'm thinking about
   use-cases where we are memory-bound, e.g. an application processing
   a lot of data), it is better to loose a few CPU cycles.

> First off the blocks is to look at moving the mempool pointer into
> the second cache line
> [...]
> Beyond this change, I'm also investigating potentially moving the "next"
> pointer to the second cache line, but it's looking harder to move
> without serious impact

I think we can easily find DPDK applications that would use the "next"
field of the mbuf on rx side, as it is the standard way of chaining
packets. For instance: IP reassembly, TCP/UDP socket queues, or any
other protocol that needs a reassembly queue. This is at least what
we do in 6WINDGate fast path stack, and I suppose other network stack
implementations would do something similar, so we should probably avoid
moving this field to the 2nd cache line.

One more issue I do foresee, with slower CPUs like Atom, having 2
cache lines will add more cost than on Xeon. I'm wondering if it
make sense to have a compilation time option to select either limited
features with one cache line or full features 2 line caches. I don't
know if it's a good idea because it would make the code more complex,
but we could consider it. I think we don't target binary compatibility
today?

 From a functional point of view, we could check that my TSO
patch can be adapted to your proposal so we can challenge and merge
both approaches.

As this change would impact the core of DPDK, I think it would be
interesting to list some representative use-cases in order to evaluate
the cost of each solution. This will also help for future modifications,
and could be included in a sort of non-regression test?

Regards,
Olivier

[1] http://dpdk.org/ml/archives/dev/2014-May/002537.html
[2] http://dpdk.org/ml/archives/dev/2014-May/002322.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-07 10:19 ` Olivier MATZ
@ 2014-07-08  7:04   ` Zhang, Helin
  2014-07-08  7:16     ` Ivan Boule
  2014-07-15  9:07   ` Ananyev, Konstantin
  1 sibling, 1 reply; 9+ messages in thread
From: Zhang, Helin @ 2014-07-08  7:04 UTC (permalink / raw)
  To: Olivier MATZ, Richardson, Bruce; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Monday, July 7, 2014 6:19 PM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] Making space in mbuf data-structure
> 
> Hello Bruce,
> 
> Thank you to revive this discussion now that the 1.7 is released.
> 
> First, I would like to reference my previous patch series that first reworks the
> mbuf to gain 9 bytes [1]. The v1 of the patch was discussed at [2].
> 
> Now, let's list what I would find useful to have in this mbuf rework:
> 
> - larger size for ol_flags: this is at least needed for TSO, but as it
>    is completely full today, I expect to have this need for other
>    features.
> - add other offload fields: l4_len and mss, required for TSO
> - remove ctrl_mbuf: they could be replaced by a packet mbuf. It will
>    simplify the mbuf structure. Moreover, it would allow to save room
>    in the mbuf.
> - a new vlan tag, I suppose this could be useful in some use-cases
>    where vlans are stacked.
> - splitting out fields that are superimposed: if 2 features can be used
>    at the same time
> 
> On the other hand, I'm not convinced by this:
> 
> - new filters in the i40e driver: I don't think the mbuf is the
>    right place for driver-specific flags. If a feature is brought
>    by a new driver requiring a flag in mbuf, we should take care that
>    the flag is not bound to this particular driver and would match
>    the same feature in another driver.
> - sequence number: I'm not sure I understand the use-case, maybe this
>    could stay in a mbuf meta data in the reordering module.
> 
> > Firstly, we believe that there is no possible way that we can ever fit
> > all the fields we need to fit into a 64-byte mbuf, and so we need to
> > start looking at a 128-byte mbuf instead.
> 
> The TSO patches show that it is possible to keep a 64 bytes mbuf (of course, it
> depends on what we want to add in the mbuf). I'm not fundamentally against
> having 128 bytes mbuf. But:
> 
> - it should not be a reason for just adding things and not reworking
>    things that could be enhanced
> - it should not be a reason for not optimizing the current mbuf
>    structure
> - if we can do the same with a 64 bytes mbuf, we need to carefuly
>    compare the solutions as fetching a second cache line is not
>    costless in all situations. The 64 bytes solution I'm proposing
>    in [1] may cost a bit more in CPU cycles but avoids an additional
>    cache prefetch (or miss). In some situations (I'm thinking about
>    use-cases where we are memory-bound, e.g. an application processing
>    a lot of data), it is better to loose a few CPU cycles.
> 
> > First off the blocks is to look at moving the mempool pointer into the
> > second cache line [...] Beyond this change, I'm also investigating
> > potentially moving the "next"
> > pointer to the second cache line, but it's looking harder to move
> > without serious impact
> 
> I think we can easily find DPDK applications that would use the "next"
> field of the mbuf on rx side, as it is the standard way of chaining packets. For
> instance: IP reassembly, TCP/UDP socket queues, or any other protocol that
> needs a reassembly queue. This is at least what we do in 6WINDGate fast path
> stack, and I suppose other network stack implementations would do something
> similar, so we should probably avoid moving this field to the 2nd cache line.
> 
> One more issue I do foresee, with slower CPUs like Atom, having 2 cache lines
> will add more cost than on Xeon. I'm wondering if it make sense to have a
> compilation time option to select either limited features with one cache line or
> full features 2 line caches. I don't know if it's a good idea because it would make
> the code more complex, but we could consider it. I think we don't target binary
> compatibility today?
> 
>  From a functional point of view, we could check that my TSO patch can be
> adapted to your proposal so we can challenge and merge both approaches.
> 
> As this change would impact the core of DPDK, I think it would be interesting to
> list some representative use-cases in order to evaluate the cost of each
> solution. This will also help for future modifications, and could be included in a
> sort of non-regression test?
> 
> Regards,
> Olivier
> 
> [1] http://dpdk.org/ml/archives/dev/2014-May/002537.html
> [2] http://dpdk.org/ml/archives/dev/2014-May/002322.html

Hi Olivier

I am trying to convince you on the new field of "filter status".
It is for matched Flow Director Filter ID, and might be reused for HASH signature if it matches hash filter, or others.
It is quite useful for Flow Director, and not a flag. I guess there should have the similar feature even in non-Intel NICs.

Regards,
Helin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-08  7:04   ` Zhang, Helin
@ 2014-07-08  7:16     ` Ivan Boule
  2014-07-08  7:37       ` Zhang, Helin
  0 siblings, 1 reply; 9+ messages in thread
From: Ivan Boule @ 2014-07-08  7:16 UTC (permalink / raw)
  To: Zhang, Helin, Olivier MATZ, Richardson, Bruce; +Cc: dev

On 07/08/2014 09:04 AM, Zhang, Helin wrote:
>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
>> Sent: Monday, July 7, 2014 6:19 PM
>> To: Richardson, Bruce; dev@dpdk.org
>> Subject: Re: [dpdk-dev] Making space in mbuf data-structure
>>
>> Hello Bruce,
>>
>> Thank you to revive this discussion now that the 1.7 is released.
>>
>> First, I would like to reference my previous patch series that first reworks the
>> mbuf to gain 9 bytes [1]. The v1 of the patch was discussed at [2].
>>
>> Now, let's list what I would find useful to have in this mbuf rework:
>>
>> - larger size for ol_flags: this is at least needed for TSO, but as it
>>     is completely full today, I expect to have this need for other
>>     features.
>> - add other offload fields: l4_len and mss, required for TSO
>> - remove ctrl_mbuf: they could be replaced by a packet mbuf. It will
>>     simplify the mbuf structure. Moreover, it would allow to save room
>>     in the mbuf.
>> - a new vlan tag, I suppose this could be useful in some use-cases
>>     where vlans are stacked.
>> - splitting out fields that are superimposed: if 2 features can be used
>>     at the same time
>>
>> On the other hand, I'm not convinced by this:
>>
>> - new filters in the i40e driver: I don't think the mbuf is the
>>     right place for driver-specific flags. If a feature is brought
>>     by a new driver requiring a flag in mbuf, we should take care that
>>     the flag is not bound to this particular driver and would match
>>     the same feature in another driver.
>> - sequence number: I'm not sure I understand the use-case, maybe this
>>     could stay in a mbuf meta data in the reordering module.
>>
>>> Firstly, we believe that there is no possible way that we can ever fit
>>> all the fields we need to fit into a 64-byte mbuf, and so we need to
>>> start looking at a 128-byte mbuf instead.
>>
>> The TSO patches show that it is possible to keep a 64 bytes mbuf (of course, it
>> depends on what we want to add in the mbuf). I'm not fundamentally against
>> having 128 bytes mbuf. But:
>>
>> - it should not be a reason for just adding things and not reworking
>>     things that could be enhanced
>> - it should not be a reason for not optimizing the current mbuf
>>     structure
>> - if we can do the same with a 64 bytes mbuf, we need to carefuly
>>     compare the solutions as fetching a second cache line is not
>>     costless in all situations. The 64 bytes solution I'm proposing
>>     in [1] may cost a bit more in CPU cycles but avoids an additional
>>     cache prefetch (or miss). In some situations (I'm thinking about
>>     use-cases where we are memory-bound, e.g. an application processing
>>     a lot of data), it is better to loose a few CPU cycles.
>>
>>> First off the blocks is to look at moving the mempool pointer into the
>>> second cache line [...] Beyond this change, I'm also investigating
>>> potentially moving the "next"
>>> pointer to the second cache line, but it's looking harder to move
>>> without serious impact
>>
>> I think we can easily find DPDK applications that would use the "next"
>> field of the mbuf on rx side, as it is the standard way of chaining packets. For
>> instance: IP reassembly, TCP/UDP socket queues, or any other protocol that
>> needs a reassembly queue. This is at least what we do in 6WINDGate fast path
>> stack, and I suppose other network stack implementations would do something
>> similar, so we should probably avoid moving this field to the 2nd cache line.
>>
>> One more issue I do foresee, with slower CPUs like Atom, having 2 cache lines
>> will add more cost than on Xeon. I'm wondering if it make sense to have a
>> compilation time option to select either limited features with one cache line or
>> full features 2 line caches. I don't know if it's a good idea because it would make
>> the code more complex, but we could consider it. I think we don't target binary
>> compatibility today?
>>
>>   From a functional point of view, we could check that my TSO patch can be
>> adapted to your proposal so we can challenge and merge both approaches.
>>
>> As this change would impact the core of DPDK, I think it would be interesting to
>> list some representative use-cases in order to evaluate the cost of each
>> solution. This will also help for future modifications, and could be included in a
>> sort of non-regression test?
>>
>> Regards,
>> Olivier
>>
>> [1] http://dpdk.org/ml/archives/dev/2014-May/002537.html
>> [2] http://dpdk.org/ml/archives/dev/2014-May/002322.html
>
> Hi Olivier
>
> I am trying to convince you on the new field of "filter status".
> It is for matched Flow Director Filter ID, and might be reused for HASH signature if it matches hash filter, or others.
> It is quite useful for Flow Director, and not a flag. I guess there should have the similar feature even in non-Intel NICs.
>
By construction, since a packet cannot match more than 1 filter with an 
associated identifier, this is typically the kind of field that should 
be put in an union with the standard 32-bit RSS id.

Regards,
Ivan

> Regards,
> Helin
>


-- 
Ivan Boule
6WIND Development Engineer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-08  7:16     ` Ivan Boule
@ 2014-07-08  7:37       ` Zhang, Helin
  0 siblings, 0 replies; 9+ messages in thread
From: Zhang, Helin @ 2014-07-08  7:37 UTC (permalink / raw)
  To: Ivan Boule, Olivier MATZ, Richardson, Bruce; +Cc: dev



> -----Original Message-----
> From: Ivan Boule [mailto:ivan.boule@6wind.com]
> Sent: Tuesday, July 8, 2014 3:17 PM
> To: Zhang, Helin; Olivier MATZ; Richardson, Bruce
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] Making space in mbuf data-structure
> 
> On 07/08/2014 09:04 AM, Zhang, Helin wrote:
> >
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> >> Sent: Monday, July 7, 2014 6:19 PM
> >> To: Richardson, Bruce; dev@dpdk.org
> >> Subject: Re: [dpdk-dev] Making space in mbuf data-structure
> >>
> >> Hello Bruce,
> >>
> >> Thank you to revive this discussion now that the 1.7 is released.
> >>
> >> First, I would like to reference my previous patch series that first
> >> reworks the mbuf to gain 9 bytes [1]. The v1 of the patch was discussed at
> [2].
> >>
> >> Now, let's list what I would find useful to have in this mbuf rework:
> >>
> >> - larger size for ol_flags: this is at least needed for TSO, but as it
> >>     is completely full today, I expect to have this need for other
> >>     features.
> >> - add other offload fields: l4_len and mss, required for TSO
> >> - remove ctrl_mbuf: they could be replaced by a packet mbuf. It will
> >>     simplify the mbuf structure. Moreover, it would allow to save room
> >>     in the mbuf.
> >> - a new vlan tag, I suppose this could be useful in some use-cases
> >>     where vlans are stacked.
> >> - splitting out fields that are superimposed: if 2 features can be used
> >>     at the same time
> >>
> >> On the other hand, I'm not convinced by this:
> >>
> >> - new filters in the i40e driver: I don't think the mbuf is the
> >>     right place for driver-specific flags. If a feature is brought
> >>     by a new driver requiring a flag in mbuf, we should take care that
> >>     the flag is not bound to this particular driver and would match
> >>     the same feature in another driver.
> >> - sequence number: I'm not sure I understand the use-case, maybe this
> >>     could stay in a mbuf meta data in the reordering module.
> >>
> >>> Firstly, we believe that there is no possible way that we can ever
> >>> fit all the fields we need to fit into a 64-byte mbuf, and so we
> >>> need to start looking at a 128-byte mbuf instead.
> >>
> >> The TSO patches show that it is possible to keep a 64 bytes mbuf (of
> >> course, it depends on what we want to add in the mbuf). I'm not
> >> fundamentally against having 128 bytes mbuf. But:
> >>
> >> - it should not be a reason for just adding things and not reworking
> >>     things that could be enhanced
> >> - it should not be a reason for not optimizing the current mbuf
> >>     structure
> >> - if we can do the same with a 64 bytes mbuf, we need to carefuly
> >>     compare the solutions as fetching a second cache line is not
> >>     costless in all situations. The 64 bytes solution I'm proposing
> >>     in [1] may cost a bit more in CPU cycles but avoids an additional
> >>     cache prefetch (or miss). In some situations (I'm thinking about
> >>     use-cases where we are memory-bound, e.g. an application
> processing
> >>     a lot of data), it is better to loose a few CPU cycles.
> >>
> >>> First off the blocks is to look at moving the mempool pointer into
> >>> the second cache line [...] Beyond this change, I'm also
> >>> investigating potentially moving the "next"
> >>> pointer to the second cache line, but it's looking harder to move
> >>> without serious impact
> >>
> >> I think we can easily find DPDK applications that would use the "next"
> >> field of the mbuf on rx side, as it is the standard way of chaining
> >> packets. For
> >> instance: IP reassembly, TCP/UDP socket queues, or any other protocol
> >> that needs a reassembly queue. This is at least what we do in
> >> 6WINDGate fast path stack, and I suppose other network stack
> >> implementations would do something similar, so we should probably avoid
> moving this field to the 2nd cache line.
> >>
> >> One more issue I do foresee, with slower CPUs like Atom, having 2
> >> cache lines will add more cost than on Xeon. I'm wondering if it make
> >> sense to have a compilation time option to select either limited
> >> features with one cache line or full features 2 line caches. I don't
> >> know if it's a good idea because it would make the code more complex,
> >> but we could consider it. I think we don't target binary compatibility today?
> >>
> >>   From a functional point of view, we could check that my TSO patch
> >> can be adapted to your proposal so we can challenge and merge both
> approaches.
> >>
> >> As this change would impact the core of DPDK, I think it would be
> >> interesting to list some representative use-cases in order to
> >> evaluate the cost of each solution. This will also help for future
> >> modifications, and could be included in a sort of non-regression test?
> >>
> >> Regards,
> >> Olivier
> >>
> >> [1] http://dpdk.org/ml/archives/dev/2014-May/002537.html
> >> [2] http://dpdk.org/ml/archives/dev/2014-May/002322.html
> >
> > Hi Olivier
> >
> > I am trying to convince you on the new field of "filter status".
> > It is for matched Flow Director Filter ID, and might be reused for HASH
> signature if it matches hash filter, or others.
> > It is quite useful for Flow Director, and not a flag. I guess there should have
> the similar feature even in non-Intel NICs.
> >
> By construction, since a packet cannot match more than 1 filter with an
> associated identifier, this is typically the kind of field that should be put in an
> union with the standard 32-bit RSS id.
> 
> Regards,
> Ivan
> 
> > Regards,
> > Helin
> >
> 
> 
> --
> Ivan Boule
> 6WIND Development Engineer

Hi Ivan

Yes, you are right. The 32 bits 'filter status' can be union-ed with RSS ID.

But it still needs another 32 bits. Why?
i40e can report 64 bits 'flexible bytes', or report both 32 bits 'flexible bytes' and 32 bits 'matched FD filter ID' at the same time.

Regards,
Helin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-07 10:19 ` Olivier MATZ
  2014-07-08  7:04   ` Zhang, Helin
@ 2014-07-15  9:07   ` Ananyev, Konstantin
  2014-07-16 12:32     ` Olivier MATZ
  1 sibling, 1 reply; 9+ messages in thread
From: Ananyev, Konstantin @ 2014-07-15  9:07 UTC (permalink / raw)
  To: Olivier MATZ, Richardson, Bruce, dev


Hi Oliver,

> 
> As this change would impact the core of DPDK, I think it would be
> interesting to list some representative use-cases in order to evaluate
> the cost of each solution. This will also help for future modifications,
> and could be included in a sort of non-regression test?
> 

I think this is a very good idea.
Let's create a list of such test-cases.   
Two obvious apps from our side would probably be:
- test-pmd iofwd mode
- l3fwd
What else should we add here?
Konstantin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-03 23:38 [dpdk-dev] Making space in mbuf data-structure Richardson, Bruce
  2014-07-07 10:19 ` Olivier MATZ
@ 2014-07-15  9:31 ` Ananyev, Konstantin
  2014-07-15 18:07   ` Richardson, Bruce
  1 sibling, 1 reply; 9+ messages in thread
From: Ananyev, Konstantin @ 2014-07-15  9:31 UTC (permalink / raw)
  To: Richardson, Bruce, dev

Hi Bruce,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
> Sent: Friday, July 04, 2014 12:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] Making space in mbuf data-structure
> 
> Hi all,
> 
> At this stage it's been well recognised that we need to make more space in the mbuf data structure for new fields. We in Intel have
> had some discussion on this and this email is to outline what our current thinking and approach on this issue is, and look for additional
> suggestions and feedback.
> 
> Firstly, we believe that there is no possible way that we can ever fit all the fields we need to fit into a 64-byte mbuf, and so we need to
> start looking at a 128-byte mbuf instead. Examples of new fields that need to fit in, include - 32-64 bits for additional offload flags, 32-
> bits more for filter information for support for the new filters in the i40e driver, an additional 2-4 bytes for storing info on a second
> vlan tag, 4-bytes for storing a sequence number to enable packet reordering in future, as well as potentially a number of other fields
> or splitting out fields that are superimposed over each other right now, e.g. for the qos scheduler. We also want to allow space for use
> by other non-Intel NIC drivers that may be open-sourced to dpdk.org in the future too, where they support fields and offloads that
> our hardware doesn't.
> 
> If we accept the fact of a 2-cache-line mbuf, then the issue becomes how to rework things so that we spread our fields over the two
> cache lines while causing the lowest slow-down possible. The general approach that we are looking to take is to focus the first cache
> line on fields that are updated on RX , so that receive only deals with one cache line. The second cache line can be used for application
> data and information that will only be used on the TX leg. This would allow us to work on the first cache line in RX as now, and have the
> second cache line being prefetched in the background so that it is available when necessary. Hardware prefetches should help us out
> here. We also may move rarely used, or slow-path RX fields e.g. such as those for chained mbufs with jumbo frames, to the second
> cache line, depending upon the performance impact and bytes savings achieved.
> 
> With this approach, we still need to make space in the first cache line for information for the new or expanded receive offloads. First
> off the blocks is to look at moving the mempool pointer into the second cache line. This will free-up 8 bytes in cache line  0, with a field
> that is only used when cleaning up after TX. A prototype patch for this change is given below, for your review and comment. Initial
> quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=250 --txqflags=0xf01 --burst=32 --txrst=32 --txfreet=32 --rxfreet=32 --
> total-num-mbufs=65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --config="(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight
> performance decrease with testpmd and equal or slightly better performance with l3fwd. These would both have been using the
> vector PMD - I haven't looked to change the fallback TX implementation yet for this change, since it's not as performance optimized
> and hence cycle-count sensitive.

Regarding code changes itself:
If I understand your code changes correctly:
ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs in the bulk belong to the same mempool.
If we have let say 2 groups of mbufs from 2 different mempools - it wouldn't be able to use put_bulk. 
While, as I remember,  current implementation is able to use put_bulk() in such case.
I think, it is possible to change you code to do a better grouping.

> 
> Beyond this change, I'm also investigating potentially moving the "next" pointer to the second cache line, but it's looking harder to
> move without serious impact, so we'll have to see there what can be done. Other fields I'll examine in due course too, and send on
> any findings I may have.

Could you explain it a bit?
I always had an impression that moving next pointer to the 2-nd cache line would have less impact then moving mempool pointer.
My thinking was: next is used only in scatter versions of RX/TX which are slower than optimised RX/TX anyway.
So few extra cycles wouldn't be that noticeable. 
Though I admit I never did any measurements for that case.

About space savings for the first cache line:
- I still think that Oliver's suggestion of replacing data pointer (8 bytes) with data offset (2 bytes) makes a lot of sense.

> 
> Once we have freed up space, then we can start looking at what fields get to use that space and what way we shuffle the existing
> fields about, but that's a discussion for another day!
> 
> Please see patch below for moving pool pointer to second cache line of mbuf. All feedback welcome, naturally.
> 
> Regards,
> /Bruce
> 
> diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
> index 2b87521..e0cf30c 100644
> --- a/app/test/test_mbuf.c
> +++ b/app/test/test_mbuf.c
> @@ -832,7 +832,7 @@ test_failing_mbuf_sanity_check(void)
>  int
>  test_mbuf(void)
>  {
> -	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != 64);
> +	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != CACHE_LINE_SIZE*2);
> 
>  	/* create pktmbuf pool if it does not exist */
>  	if (pktmbuf_pool == NULL) {
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 2735f37..aa3ec32 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -181,7 +181,8 @@ enum rte_mbuf_type {
>   * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
>   */
>  struct rte_mbuf {
> -	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> +	uint64_t this_space_for_rent; // 8-bytes free for reuse.
> +
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
>  	uint16_t buf_len;         /**< Length of segment buffer. */
> @@ -210,12 +211,16 @@ struct rte_mbuf {
>  		struct rte_pktmbuf pkt;
>  	};
> 
> +// second cache line starts here
> +	struct rte_mempool *pool __rte_cache_aligned;
> +			/**< Pool from which mbuf was allocated. */
> +
>  	union {
>  		uint8_t metadata[0];
>  		uint16_t metadata16[0];
>  		uint32_t metadata32[0];
>  		uint64_t metadata64[0];
> -	};
> +	} __rte_cache_aligned;
>  } __rte_cache_aligned;
> 
>  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> index 64c0695..7374e0d 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> @@ -171,10 +171,6 @@ struct igb_tx_queue {
>  	volatile union ixgbe_adv_tx_desc *tx_ring;
>  	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
>  	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
> -#ifdef RTE_IXGBE_INC_VECTOR
> -	/** continuous tx entry sequence within the same mempool */
> -	struct igb_tx_entry_seq *sw_ring_seq;
> -#endif
>  	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
>  	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
>  	uint16_t            tx_tail;       /**< current value of TDT reg. */
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
> index 09e19a3..e9fd739 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
> @@ -401,9 +401,8 @@ static inline int __attribute__((always_inline))
>  ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
>  {
>  	struct igb_tx_entry_v *txep;
> -	struct igb_tx_entry_seq *txsp;
>  	uint32_t status;
> -	uint32_t n, k;
> +	uint32_t n;
>  #ifdef RTE_MBUF_SCATTER_GATHER
>  	uint32_t i;
>  	int nb_free = 0;
> @@ -423,25 +422,36 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
>  	 */
>  	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
>  			(n - 1)];
> -	txsp = &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
> 
> -	while (n > 0) {
> -		k = RTE_MIN(n, txsp[n-1].same_pool);
>  #ifdef RTE_MBUF_SCATTER_GATHER
> -		for (i = 0; i < k; i++) {
> -			m = __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
> +	m = __rte_pktmbuf_prefree_seg(txep[0].mbuf);
> +	if (likely(m != NULL)) {
> +		free[0] = m;
> +		nb_free = 1;
> +		for (i = 1; i < n; i++) {
> +			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +			if (likely(m != NULL)) {
> +				if (likely(m->pool == free[0]->pool))
> +					free[nb_free++] = m;
> +				else
> +					rte_mempool_put(m->pool, m);
> +			}
> +		}
> +		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
> +	}
> +	else {
> +		for (i = 1; i < n; i++) {
> +			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
>  			if (m != NULL)
> -				free[nb_free++] = m;
> +				rte_mempool_put(m->pool, m);
>  		}
> -		rte_mempool_put_bulk((void *)txsp[n-1].pool,
> -				(void **)free, nb_free);
> -#else
> -		rte_mempool_put_bulk((void *)txsp[n-1].pool,
> -				(void **)(txep+n-k), k);
> -#endif
> -		n -= k;
>  	}
> 
> +#else /* no scatter_gather */
> +	for (i = 0; i < n; i++)
> +		rte_mempool_put(m->pool, txep[i].mbuf);
> +#endif /* scatter_gather */
> +
>  	/* buffers were freed, update counters */
>  	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
>  	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> @@ -453,19 +463,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
> 
>  static inline void __attribute__((always_inline))
>  tx_backlog_entry(struct igb_tx_entry_v *txep,
> -		 struct igb_tx_entry_seq *txsp,
>  		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>  {
>  	int i;
> -	for (i = 0; i < (int)nb_pkts; ++i) {
> +	for (i = 0; i < (int)nb_pkts; ++i)
>  		txep[i].mbuf = tx_pkts[i];
> -		/* check and update sequence number */
> -		txsp[i].pool = tx_pkts[i]->pool;
> -		if (txsp[i-1].pool == tx_pkts[i]->pool)
> -			txsp[i].same_pool = txsp[i-1].same_pool + 1;
> -		else
> -			txsp[i].same_pool = 1;
> -	}
>  }
> 
>  uint16_t
> @@ -475,7 +477,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
>  	struct igb_tx_queue *txq = (struct igb_tx_queue *)tx_queue;
>  	volatile union ixgbe_adv_tx_desc *txdp;
>  	struct igb_tx_entry_v *txep;
> -	struct igb_tx_entry_seq *txsp;
>  	uint16_t n, nb_commit, tx_id;
>  	__m128i flags = _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0);
>  	__m128i rs = _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS,
> @@ -495,14 +496,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
>  	tx_id = txq->tx_tail;
>  	txdp = &txq->tx_ring[tx_id];
>  	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
> -	txsp = &txq->sw_ring_seq[tx_id];
> 
>  	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
> 
>  	n = (uint16_t)(txq->nb_tx_desc - tx_id);
>  	if (nb_commit >= n) {
> 
> -		tx_backlog_entry(txep, txsp, tx_pkts, n);
> +		tx_backlog_entry(txep, tx_pkts, n);
> 
>  		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
>  			vtx1(txdp, *tx_pkts, flags);
> @@ -517,10 +517,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		/* avoid reach the end of ring */
>  		txdp = &(txq->tx_ring[tx_id]);
>  		txep = &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
> -		txsp = &(txq->sw_ring_seq[tx_id]);
>  	}
> 
> -	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
> +	tx_backlog_entry(txep, tx_pkts, nb_commit);
> 
>  	vtx(txdp, tx_pkts, nb_commit, flags);
> 
> @@ -544,7 +543,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
>  {
>  	unsigned i;
>  	struct igb_tx_entry_v *txe;
> -	struct igb_tx_entry_seq *txs;
>  	uint16_t nb_free, max_desc;
> 
>  	if (txq->sw_ring != NULL) {
> @@ -562,10 +560,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
>  		for (i = 0; i < txq->nb_tx_desc; i++) {
>  			txe = (struct igb_tx_entry_v *)&txq->sw_ring[i];
>  			txe->mbuf = NULL;
> -
> -			txs = &txq->sw_ring_seq[i];
> -			txs->pool = NULL;
> -			txs->same_pool = 0;
>  		}
>  	}
>  }
> @@ -580,11 +574,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
>  		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
>  		txq->sw_ring = NULL;
>  	}
> -
> -	if (txq->sw_ring_seq != NULL) {
> -		rte_free(txq->sw_ring_seq - 1);
> -		txq->sw_ring_seq = NULL;
> -	}
>  }
> 
>  static void
> @@ -593,7 +582,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
>  	static const union ixgbe_adv_tx_desc zeroed_desc = { .read = {
>  			.buffer_addr = 0} };
>  	struct igb_tx_entry_v *txe = (struct igb_tx_entry_v *)txq->sw_ring;
> -	struct igb_tx_entry_seq *txs = txq->sw_ring_seq;
>  	uint16_t i;
> 
>  	/* Zero out HW ring memory */
> @@ -605,8 +593,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
>  		volatile union ixgbe_adv_tx_desc *txd = &txq->tx_ring[i];
>  		txd->wb.status = IXGBE_TXD_STAT_DD;
>  		txe[i].mbuf = NULL;
> -		txs[i].pool = NULL;
> -		txs[i].same_pool = 0;
>  	}
> 
>  	txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> @@ -632,27 +618,14 @@ static struct ixgbe_txq_ops vec_txq_ops = {
>  };
> 
>  int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
> -			unsigned int socket_id)
> +			__rte_unused unsigned int socket_id)
>  {
> -	uint16_t nb_desc;
> -
>  	if (txq->sw_ring == NULL)
>  		return -1;
> 
> -	/* request addtional one entry for continous sequence check */
> -	nb_desc = (uint16_t)(txq->nb_tx_desc + 1);
> -
> -	txq->sw_ring_seq = rte_zmalloc_socket("txq->sw_ring_seq",
> -				sizeof(struct igb_tx_entry_seq) * nb_desc,
> -				CACHE_LINE_SIZE, socket_id);
> -	if (txq->sw_ring_seq == NULL)
> -		return -1;
> -
> -
>  	/* leave the first one for overflow */
>  	txq->sw_ring = (struct igb_tx_entry *)
>  		((struct igb_tx_entry_v *)txq->sw_ring + 1);
> -	txq->sw_ring_seq += 1;
>  	txq->ops = &vec_txq_ops;
> 
>  	return 0;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-15  9:31 ` Ananyev, Konstantin
@ 2014-07-15 18:07   ` Richardson, Bruce
  0 siblings, 0 replies; 9+ messages in thread
From: Richardson, Bruce @ 2014-07-15 18:07 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev, Olivier MATZ (olivier.matz@6wind.com)

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, July 15, 2014 2:31 AM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: RE: Making space in mbuf data-structure
> 
> Hi Bruce,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
> > Sent: Friday, July 04, 2014 12:39 AM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] Making space in mbuf data-structure
> >
> > Hi all,
> >
> > At this stage it's been well recognised that we need to make more space in the
> mbuf data structure for new fields. We in Intel have
> > had some discussion on this and this email is to outline what our current
> thinking and approach on this issue is, and look for additional
> > suggestions and feedback.
> >
> > Firstly, we believe that there is no possible way that we can ever fit all the
> fields we need to fit into a 64-byte mbuf, and so we need to
> > start looking at a 128-byte mbuf instead. Examples of new fields that need to
> fit in, include - 32-64 bits for additional offload flags, 32-
> > bits more for filter information for support for the new filters in the i40e
> driver, an additional 2-4 bytes for storing info on a second
> > vlan tag, 4-bytes for storing a sequence number to enable packet reordering in
> future, as well as potentially a number of other fields
> > or splitting out fields that are superimposed over each other right now, e.g. for
> the qos scheduler. We also want to allow space for use
> > by other non-Intel NIC drivers that may be open-sourced to dpdk.org in the
> future too, where they support fields and offloads that
> > our hardware doesn't.
> >
> > If we accept the fact of a 2-cache-line mbuf, then the issue becomes how to
> rework things so that we spread our fields over the two
> > cache lines while causing the lowest slow-down possible. The general
> approach that we are looking to take is to focus the first cache
> > line on fields that are updated on RX , so that receive only deals with one
> cache line. The second cache line can be used for application
> > data and information that will only be used on the TX leg. This would allow us
> to work on the first cache line in RX as now, and have the
> > second cache line being prefetched in the background so that it is available
> when necessary. Hardware prefetches should help us out
> > here. We also may move rarely used, or slow-path RX fields e.g. such as those
> for chained mbufs with jumbo frames, to the second
> > cache line, depending upon the performance impact and bytes savings
> achieved.
> >
> > With this approach, we still need to make space in the first cache line for
> information for the new or expanded receive offloads. First
> > off the blocks is to look at moving the mempool pointer into the second cache
> line. This will free-up 8 bytes in cache line  0, with a field
> > that is only used when cleaning up after TX. A prototype patch for this change
> is given below, for your review and comment. Initial
> > quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=250 --
> txqflags=0xf01 --burst=32 --txrst=32 --txfreet=32 --rxfreet=32 --
> > total-num-mbufs=65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --
> config="(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight
> > performance decrease with testpmd and equal or slightly better performance
> with l3fwd. These would both have been using the
> > vector PMD - I haven't looked to change the fallback TX implementation yet
> for this change, since it's not as performance optimized
> > and hence cycle-count sensitive.
> 
> Regarding code changes itself:
> If I understand your code changes correctly:
> ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs in the
> bulk belong to the same mempool.
> If we have let say 2 groups of mbufs from 2 different mempools - it wouldn't be
> able to use put_bulk.
> While, as I remember,  current implementation is able to use put_bulk() in such
> case.
> I think, it is possible to change you code to do a better grouping.

Given two sets of mbufs from two separate pool this proposed change will use put bulk for the first group encountered, but not the second. We can look at enhancing it to work with different groups, but this version works well where all or most mbufs in a TX queue come from a common queue. In the case of having every second packet or every third packet coming from a different mempool (for the two or three mempool case), this version should also perform better as it doesn't rely on having long runs of packets from the same mempool.
However, this is very much WIP, so we'll see if we can come up with a better solution if we think one is necessary. (Most apps seem to use just one mempool per numa socket at most, though, so I don't think we should over-optimise for multiple mempools).

> 
> >
> > Beyond this change, I'm also investigating potentially moving the "next"
> pointer to the second cache line, but it's looking harder to
> > move without serious impact, so we'll have to see there what can be done.
> Other fields I'll examine in due course too, and send on
> > any findings I may have.
> 
> Could you explain it a bit?
> I always had an impression that moving next pointer to the 2-nd cache line
> would have less impact then moving mempool pointer.
> My thinking was: next is used only in scatter versions of RX/TX which are slower
> than optimised RX/TX anyway.
> So few extra cycles wouldn't be that noticeable.
> Though I admit I never did any measurements for that case.

I was busy with a number of things last week, so work on moving the next pointer slowed somewhat, but I think I have something reasonable now. The trouble with moving the next pointer was that it is being explicitly set to NULL in all our RX code paths, and the TX code path would read but not clear it. Once I examined all our TX and RX code paths - of which there are more than necessary, I feel, I was able to change things so that the NULL pointer is set on free rather than on allocation. The final issue was then to ensure that the rx_scattered_pkts function didn't suffer undue slowdown due to the second cache line, and after some experimentation, an extra prefetch seems to have helped here.
At this point, I think I have a prototype with both pool and next pointers on a second cache line where both vector rx/tx and scalar (full-featured) code paths are performing well, with sub 5% slowdown. However, more testing is needed with a wider range of apps and packet types to confirm this.

> 
> About space savings for the first cache line:
> - I still think that Oliver's suggestion of replacing data pointer (8 bytes) with data
> offset (2 bytes) makes a lot of sense.

Yes, I would tend to agree. I'm going to take this and run additional testing on this change and see how it fits in with everything else. Some concerns were expressed about potential performance impacts of such a change, so I'm going to try and follow those up so that we can get a data-driven answer to this. If no performance impact can be seen (and none has been observed so far, to the best of my knowledge), I'm all in favour of the change.
Also Olivier's change to merge pktmbuf into the main mbuf itself (and eliminate ctrlmbuf) is something that I think we want to keep. It allows us much greater freedom to move fields about to fit things in well, with good alignment.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] Making space in mbuf data-structure
  2014-07-15  9:07   ` Ananyev, Konstantin
@ 2014-07-16 12:32     ` Olivier MATZ
  0 siblings, 0 replies; 9+ messages in thread
From: Olivier MATZ @ 2014-07-16 12:32 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce, dev

Hi Konstantin,

On 07/15/2014 11:07 AM, Ananyev, Konstantin wrote:
>> As this change would impact the core of DPDK, I think it would be
>> interesting to list some representative use-cases in order to evaluate
>> the cost of each solution. This will also help for future modifications,
>> and could be included in a sort of non-regression test?
>>
>
> I think this is a very good idea.
> Let's create a list of such test-cases.
> Two obvious apps from our side would probably be:
> - test-pmd iofwd mode
> - l3fwd
> What else should we add here?

I agree that test-pmd iofwd is a good reference to track regressions
or analyze performance issues, but I'm not sure it's a real use-case.
To me, it's more important to ensure that l3fwd performance stay stable
compared to iofwd.

I think l3fwd is the most representative of a DPDK application.

Other use-case could include:
- virtualization, maybe using ovdk?
- ip fragmentation / reassembly
- an application that encapsulate or decapsulate in a tunnel

Regards,
Olivier

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-07-16 12:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-03 23:38 [dpdk-dev] Making space in mbuf data-structure Richardson, Bruce
2014-07-07 10:19 ` Olivier MATZ
2014-07-08  7:04   ` Zhang, Helin
2014-07-08  7:16     ` Ivan Boule
2014-07-08  7:37       ` Zhang, Helin
2014-07-15  9:07   ` Ananyev, Konstantin
2014-07-16 12:32     ` Olivier MATZ
2014-07-15  9:31 ` Ananyev, Konstantin
2014-07-15 18:07   ` Richardson, Bruce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).