From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id 62AAD58D9
 for <dev@dpdk.org>; Tue, 15 Jul 2014 11:31:19 +0200 (CEST)
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
 by fmsmga103.fm.intel.com with ESMTP; 15 Jul 2014 02:25:30 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.01,664,1400050800"; d="scan'208";a="561966742"
Received: from irsmsx102.ger.corp.intel.com ([163.33.3.155])
 by fmsmga001.fm.intel.com with ESMTP; 15 Jul 2014 02:31:44 -0700
Received: from irsmsx152.ger.corp.intel.com (163.33.192.66) by
 IRSMSX102.ger.corp.intel.com (163.33.3.155) with Microsoft SMTP Server (TLS)
 id 14.3.123.3; Tue, 15 Jul 2014 10:31:21 +0100
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.76]) by
 IRSMSX152.ger.corp.intel.com ([169.254.6.25]) with mapi id 14.03.0123.003;
 Tue, 15 Jul 2014 10:31:21 +0100
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>, "dev@dpdk.org"
 <dev@dpdk.org>
Thread-Topic: Making space in mbuf data-structure
Thread-Index: Ac+XFzceY+GkJCrySPOdN5NThUK1YwI9SOfA
Date: Tue, 15 Jul 2014 09:31:20 +0000
Message-ID: <2601191342CEEE43887BDE71AB97725821338148@IRSMSX105.ger.corp.intel.com>
References: <59AF69C657FD0841A61C55336867B5B02CF13E32@IRSMSX103.ger.corp.intel.com>
In-Reply-To: <59AF69C657FD0841A61C55336867B5B02CF13E32@IRSMSX103.ger.corp.intel.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] Making space in mbuf data-structure
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jul 2014 09:31:20 -0000

Hi Bruce,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
> Sent: Friday, July 04, 2014 12:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] Making space in mbuf data-structure
>=20
> Hi all,
>=20
> At this stage it's been well recognised that we need to make more space i=
n the mbuf data structure for new fields. We in Intel have
> had some discussion on this and this email is to outline what our current=
 thinking and approach on this issue is, and look for additional
> suggestions and feedback.
>=20
> Firstly, we believe that there is no possible way that we can ever fit al=
l the fields we need to fit into a 64-byte mbuf, and so we need to
> start looking at a 128-byte mbuf instead. Examples of new fields that nee=
d to fit in, include - 32-64 bits for additional offload flags, 32-
> bits more for filter information for support for the new filters in the i=
40e driver, an additional 2-4 bytes for storing info on a second
> vlan tag, 4-bytes for storing a sequence number to enable packet reorderi=
ng in future, as well as potentially a number of other fields
> or splitting out fields that are superimposed over each other right now, =
e.g. for the qos scheduler. We also want to allow space for use
> by other non-Intel NIC drivers that may be open-sourced to dpdk.org in th=
e future too, where they support fields and offloads that
> our hardware doesn't.
>=20
> If we accept the fact of a 2-cache-line mbuf, then the issue becomes how =
to rework things so that we spread our fields over the two
> cache lines while causing the lowest slow-down possible. The general appr=
oach that we are looking to take is to focus the first cache
> line on fields that are updated on RX , so that receive only deals with o=
ne cache line. The second cache line can be used for application
> data and information that will only be used on the TX leg. This would all=
ow us to work on the first cache line in RX as now, and have the
> second cache line being prefetched in the background so that it is availa=
ble when necessary. Hardware prefetches should help us out
> here. We also may move rarely used, or slow-path RX fields e.g. such as t=
hose for chained mbufs with jumbo frames, to the second
> cache line, depending upon the performance impact and bytes savings achie=
ved.
>=20
> With this approach, we still need to make space in the first cache line f=
or information for the new or expanded receive offloads. First
> off the blocks is to look at moving the mempool pointer into the second c=
ache line. This will free-up 8 bytes in cache line  0, with a field
> that is only used when cleaning up after TX. A prototype patch for this c=
hange is given below, for your review and comment. Initial
> quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=3D250 --txqfla=
gs=3D0xf01 --burst=3D32 --txrst=3D32 --txfreet=3D32 --rxfreet=3D32 --
> total-num-mbufs=3D65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --confi=
g=3D"(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight
> performance decrease with testpmd and equal or slightly better performanc=
e with l3fwd. These would both have been using the
> vector PMD - I haven't looked to change the fallback TX implementation ye=
t for this change, since it's not as performance optimized
> and hence cycle-count sensitive.

Regarding code changes itself:
If I understand your code changes correctly:
ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs in=
 the bulk belong to the same mempool.
If we have let say 2 groups of mbufs from 2 different mempools - it wouldn'=
t be able to use put_bulk.=20
While, as I remember,  current implementation is able to use put_bulk() in =
such case.
I think, it is possible to change you code to do a better grouping.

>=20
> Beyond this change, I'm also investigating potentially moving the "next" =
pointer to the second cache line, but it's looking harder to
> move without serious impact, so we'll have to see there what can be done.=
 Other fields I'll examine in due course too, and send on
> any findings I may have.

Could you explain it a bit?
I always had an impression that moving next pointer to the 2-nd cache line =
would have less impact then moving mempool pointer.
My thinking was: next is used only in scatter versions of RX/TX which are s=
lower than optimised RX/TX anyway.
So few extra cycles wouldn't be that noticeable.=20
Though I admit I never did any measurements for that case.

About space savings for the first cache line:
- I still think that Oliver's suggestion of replacing data pointer (8 bytes=
) with data offset (2 bytes) makes a lot of sense.

>=20
> Once we have freed up space, then we can start looking at what fields get=
 to use that space and what way we shuffle the existing
> fields about, but that's a discussion for another day!
>=20
> Please see patch below for moving pool pointer to second cache line of mb=
uf. All feedback welcome, naturally.
>=20
> Regards,
> /Bruce
>=20
> diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
> index 2b87521..e0cf30c 100644
> --- a/app/test/test_mbuf.c
> +++ b/app/test/test_mbuf.c
> @@ -832,7 +832,7 @@ test_failing_mbuf_sanity_check(void)
>  int
>  test_mbuf(void)
>  {
> -	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) !=3D 64);
> +	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) !=3D CACHE_LINE_SIZE*2);
>=20
>  	/* create pktmbuf pool if it does not exist */
>  	if (pktmbuf_pool =3D=3D NULL) {
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 2735f37..aa3ec32 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -181,7 +181,8 @@ enum rte_mbuf_type {
>   * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
>   */
>  struct rte_mbuf {
> -	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> +	uint64_t this_space_for_rent; // 8-bytes free for reuse.
> +
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
>  	uint16_t buf_len;         /**< Length of segment buffer. */
> @@ -210,12 +211,16 @@ struct rte_mbuf {
>  		struct rte_pktmbuf pkt;
>  	};
>=20
> +// second cache line starts here
> +	struct rte_mempool *pool __rte_cache_aligned;
> +			/**< Pool from which mbuf was allocated. */
> +
>  	union {
>  		uint8_t metadata[0];
>  		uint16_t metadata16[0];
>  		uint32_t metadata32[0];
>  		uint64_t metadata64[0];
> -	};
> +	} __rte_cache_aligned;
>  } __rte_cache_aligned;
>=20
>  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixg=
be_rxtx.h
> index 64c0695..7374e0d 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> @@ -171,10 +171,6 @@ struct igb_tx_queue {
>  	volatile union ixgbe_adv_tx_desc *tx_ring;
>  	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
>  	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
> -#ifdef RTE_IXGBE_INC_VECTOR
> -	/** continuous tx entry sequence within the same mempool */
> -	struct igb_tx_entry_seq *sw_ring_seq;
> -#endif
>  	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
>  	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
>  	uint16_t            tx_tail;       /**< current value of TDT reg. */
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe=
/ixgbe_rxtx_vec.c
> index 09e19a3..e9fd739 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
> @@ -401,9 +401,8 @@ static inline int __attribute__((always_inline))
>  ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
>  {
>  	struct igb_tx_entry_v *txep;
> -	struct igb_tx_entry_seq *txsp;
>  	uint32_t status;
> -	uint32_t n, k;
> +	uint32_t n;
>  #ifdef RTE_MBUF_SCATTER_GATHER
>  	uint32_t i;
>  	int nb_free =3D 0;
> @@ -423,25 +422,36 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
>  	 */
>  	txep =3D &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
>  			(n - 1)];
> -	txsp =3D &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
>=20
> -	while (n > 0) {
> -		k =3D RTE_MIN(n, txsp[n-1].same_pool);
>  #ifdef RTE_MBUF_SCATTER_GATHER
> -		for (i =3D 0; i < k; i++) {
> -			m =3D __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
> +	m =3D __rte_pktmbuf_prefree_seg(txep[0].mbuf);
> +	if (likely(m !=3D NULL)) {
> +		free[0] =3D m;
> +		nb_free =3D 1;
> +		for (i =3D 1; i < n; i++) {
> +			m =3D __rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +			if (likely(m !=3D NULL)) {
> +				if (likely(m->pool =3D=3D free[0]->pool))
> +					free[nb_free++] =3D m;
> +				else
> +					rte_mempool_put(m->pool, m);
> +			}
> +		}
> +		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
> +	}
> +	else {
> +		for (i =3D 1; i < n; i++) {
> +			m =3D __rte_pktmbuf_prefree_seg(txep[i].mbuf);
>  			if (m !=3D NULL)
> -				free[nb_free++] =3D m;
> +				rte_mempool_put(m->pool, m);
>  		}
> -		rte_mempool_put_bulk((void *)txsp[n-1].pool,
> -				(void **)free, nb_free);
> -#else
> -		rte_mempool_put_bulk((void *)txsp[n-1].pool,
> -				(void **)(txep+n-k), k);
> -#endif
> -		n -=3D k;
>  	}
>=20
> +#else /* no scatter_gather */
> +	for (i =3D 0; i < n; i++)
> +		rte_mempool_put(m->pool, txep[i].mbuf);
> +#endif /* scatter_gather */
> +
>  	/* buffers were freed, update counters */
>  	txq->nb_tx_free =3D (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
>  	txq->tx_next_dd =3D (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> @@ -453,19 +463,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
>=20
>  static inline void __attribute__((always_inline))
>  tx_backlog_entry(struct igb_tx_entry_v *txep,
> -		 struct igb_tx_entry_seq *txsp,
>  		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>  {
>  	int i;
> -	for (i =3D 0; i < (int)nb_pkts; ++i) {
> +	for (i =3D 0; i < (int)nb_pkts; ++i)
>  		txep[i].mbuf =3D tx_pkts[i];
> -		/* check and update sequence number */
> -		txsp[i].pool =3D tx_pkts[i]->pool;
> -		if (txsp[i-1].pool =3D=3D tx_pkts[i]->pool)
> -			txsp[i].same_pool =3D txsp[i-1].same_pool + 1;
> -		else
> -			txsp[i].same_pool =3D 1;
> -	}
>  }
>=20
>  uint16_t
> @@ -475,7 +477,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf *=
*tx_pkts,
>  	struct igb_tx_queue *txq =3D (struct igb_tx_queue *)tx_queue;
>  	volatile union ixgbe_adv_tx_desc *txdp;
>  	struct igb_tx_entry_v *txep;
> -	struct igb_tx_entry_seq *txsp;
>  	uint16_t n, nb_commit, tx_id;
>  	__m128i flags =3D _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0);
>  	__m128i rs =3D _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS,
> @@ -495,14 +496,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf=
 **tx_pkts,
>  	tx_id =3D txq->tx_tail;
>  	txdp =3D &txq->tx_ring[tx_id];
>  	txep =3D &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
> -	txsp =3D &txq->sw_ring_seq[tx_id];
>=20
>  	txq->nb_tx_free =3D (uint16_t)(txq->nb_tx_free - nb_pkts);
>=20
>  	n =3D (uint16_t)(txq->nb_tx_desc - tx_id);
>  	if (nb_commit >=3D n) {
>=20
> -		tx_backlog_entry(txep, txsp, tx_pkts, n);
> +		tx_backlog_entry(txep, tx_pkts, n);
>=20
>  		for (i =3D 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
>  			vtx1(txdp, *tx_pkts, flags);
> @@ -517,10 +517,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf =
**tx_pkts,
>  		/* avoid reach the end of ring */
>  		txdp =3D &(txq->tx_ring[tx_id]);
>  		txep =3D &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
> -		txsp =3D &(txq->sw_ring_seq[tx_id]);
>  	}
>=20
> -	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
> +	tx_backlog_entry(txep, tx_pkts, nb_commit);
>=20
>  	vtx(txdp, tx_pkts, nb_commit, flags);
>=20
> @@ -544,7 +543,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq=
)
>  {
>  	unsigned i;
>  	struct igb_tx_entry_v *txe;
> -	struct igb_tx_entry_seq *txs;
>  	uint16_t nb_free, max_desc;
>=20
>  	if (txq->sw_ring !=3D NULL) {
> @@ -562,10 +560,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *tx=
q)
>  		for (i =3D 0; i < txq->nb_tx_desc; i++) {
>  			txe =3D (struct igb_tx_entry_v *)&txq->sw_ring[i];
>  			txe->mbuf =3D NULL;
> -
> -			txs =3D &txq->sw_ring_seq[i];
> -			txs->pool =3D NULL;
> -			txs->same_pool =3D 0;
>  		}
>  	}
>  }
> @@ -580,11 +574,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
>  		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
>  		txq->sw_ring =3D NULL;
>  	}
> -
> -	if (txq->sw_ring_seq !=3D NULL) {
> -		rte_free(txq->sw_ring_seq - 1);
> -		txq->sw_ring_seq =3D NULL;
> -	}
>  }
>=20
>  static void
> @@ -593,7 +582,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
>  	static const union ixgbe_adv_tx_desc zeroed_desc =3D { .read =3D {
>  			.buffer_addr =3D 0} };
>  	struct igb_tx_entry_v *txe =3D (struct igb_tx_entry_v *)txq->sw_ring;
> -	struct igb_tx_entry_seq *txs =3D txq->sw_ring_seq;
>  	uint16_t i;
>=20
>  	/* Zero out HW ring memory */
> @@ -605,8 +593,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
>  		volatile union ixgbe_adv_tx_desc *txd =3D &txq->tx_ring[i];
>  		txd->wb.status =3D IXGBE_TXD_STAT_DD;
>  		txe[i].mbuf =3D NULL;
> -		txs[i].pool =3D NULL;
> -		txs[i].same_pool =3D 0;
>  	}
>=20
>  	txq->tx_next_dd =3D (uint16_t)(txq->tx_rs_thresh - 1);
> @@ -632,27 +618,14 @@ static struct ixgbe_txq_ops vec_txq_ops =3D {
>  };
>=20
>  int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
> -			unsigned int socket_id)
> +			__rte_unused unsigned int socket_id)
>  {
> -	uint16_t nb_desc;
> -
>  	if (txq->sw_ring =3D=3D NULL)
>  		return -1;
>=20
> -	/* request addtional one entry for continous sequence check */
> -	nb_desc =3D (uint16_t)(txq->nb_tx_desc + 1);
> -
> -	txq->sw_ring_seq =3D rte_zmalloc_socket("txq->sw_ring_seq",
> -				sizeof(struct igb_tx_entry_seq) * nb_desc,
> -				CACHE_LINE_SIZE, socket_id);
> -	if (txq->sw_ring_seq =3D=3D NULL)
> -		return -1;
> -
> -
>  	/* leave the first one for overflow */
>  	txq->sw_ring =3D (struct igb_tx_entry *)
>  		((struct igb_tx_entry_v *)txq->sw_ring + 1);
> -	txq->sw_ring_seq +=3D 1;
>  	txq->ops =3D &vec_txq_ops;
>=20
>  	return 0;