From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 62AAD58D9 for ; Tue, 15 Jul 2014 11:31:19 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga103.fm.intel.com with ESMTP; 15 Jul 2014 02:25:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,664,1400050800"; d="scan'208";a="561966742" Received: from irsmsx102.ger.corp.intel.com ([163.33.3.155]) by fmsmga001.fm.intel.com with ESMTP; 15 Jul 2014 02:31:44 -0700 Received: from irsmsx152.ger.corp.intel.com (163.33.192.66) by IRSMSX102.ger.corp.intel.com (163.33.3.155) with Microsoft SMTP Server (TLS) id 14.3.123.3; Tue, 15 Jul 2014 10:31:21 +0100 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.76]) by IRSMSX152.ger.corp.intel.com ([169.254.6.25]) with mapi id 14.03.0123.003; Tue, 15 Jul 2014 10:31:21 +0100 From: "Ananyev, Konstantin" To: "Richardson, Bruce" , "dev@dpdk.org" Thread-Topic: Making space in mbuf data-structure Thread-Index: Ac+XFzceY+GkJCrySPOdN5NThUK1YwI9SOfA Date: Tue, 15 Jul 2014 09:31:20 +0000 Message-ID: <2601191342CEEE43887BDE71AB97725821338148@IRSMSX105.ger.corp.intel.com> References: <59AF69C657FD0841A61C55336867B5B02CF13E32@IRSMSX103.ger.corp.intel.com> In-Reply-To: <59AF69C657FD0841A61C55336867B5B02CF13E32@IRSMSX103.ger.corp.intel.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] Making space in mbuf data-structure X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2014 09:31:20 -0000 Hi Bruce, > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce > Sent: Friday, July 04, 2014 12:39 AM > To: dev@dpdk.org > Subject: [dpdk-dev] Making space in mbuf data-structure >=20 > Hi all, >=20 > At this stage it's been well recognised that we need to make more space i= n the mbuf data structure for new fields. We in Intel have > had some discussion on this and this email is to outline what our current= thinking and approach on this issue is, and look for additional > suggestions and feedback. >=20 > Firstly, we believe that there is no possible way that we can ever fit al= l the fields we need to fit into a 64-byte mbuf, and so we need to > start looking at a 128-byte mbuf instead. Examples of new fields that nee= d to fit in, include - 32-64 bits for additional offload flags, 32- > bits more for filter information for support for the new filters in the i= 40e driver, an additional 2-4 bytes for storing info on a second > vlan tag, 4-bytes for storing a sequence number to enable packet reorderi= ng in future, as well as potentially a number of other fields > or splitting out fields that are superimposed over each other right now, = e.g. for the qos scheduler. We also want to allow space for use > by other non-Intel NIC drivers that may be open-sourced to dpdk.org in th= e future too, where they support fields and offloads that > our hardware doesn't. >=20 > If we accept the fact of a 2-cache-line mbuf, then the issue becomes how = to rework things so that we spread our fields over the two > cache lines while causing the lowest slow-down possible. The general appr= oach that we are looking to take is to focus the first cache > line on fields that are updated on RX , so that receive only deals with o= ne cache line. The second cache line can be used for application > data and information that will only be used on the TX leg. This would all= ow us to work on the first cache line in RX as now, and have the > second cache line being prefetched in the background so that it is availa= ble when necessary. Hardware prefetches should help us out > here. We also may move rarely used, or slow-path RX fields e.g. such as t= hose for chained mbufs with jumbo frames, to the second > cache line, depending upon the performance impact and bytes savings achie= ved. >=20 > With this approach, we still need to make space in the first cache line f= or information for the new or expanded receive offloads. First > off the blocks is to look at moving the mempool pointer into the second c= ache line. This will free-up 8 bytes in cache line 0, with a field > that is only used when cleaning up after TX. A prototype patch for this c= hange is given below, for your review and comment. Initial > quick tests with testpmd (testpmd -c 600 -n 4 -- --mbcache=3D250 --txqfla= gs=3D0xf01 --burst=3D32 --txrst=3D32 --txfreet=3D32 --rxfreet=3D32 -- > total-num-mbufs=3D65536), and l3fwd (l3fwd -c 400 -n 4 -- -p F -P --confi= g=3D"(0,0,10),(1,0,10),(2,0,10),(3,0,10)") showed only a slight > performance decrease with testpmd and equal or slightly better performanc= e with l3fwd. These would both have been using the > vector PMD - I haven't looked to change the fallback TX implementation ye= t for this change, since it's not as performance optimized > and hence cycle-count sensitive. Regarding code changes itself: If I understand your code changes correctly: ixgbe_tx_free_bufs() will use rte_mempool_put_bulk() *only* if all mbufs in= the bulk belong to the same mempool. If we have let say 2 groups of mbufs from 2 different mempools - it wouldn'= t be able to use put_bulk.=20 While, as I remember, current implementation is able to use put_bulk() in = such case. I think, it is possible to change you code to do a better grouping. >=20 > Beyond this change, I'm also investigating potentially moving the "next" = pointer to the second cache line, but it's looking harder to > move without serious impact, so we'll have to see there what can be done.= Other fields I'll examine in due course too, and send on > any findings I may have. Could you explain it a bit? I always had an impression that moving next pointer to the 2-nd cache line = would have less impact then moving mempool pointer. My thinking was: next is used only in scatter versions of RX/TX which are s= lower than optimised RX/TX anyway. So few extra cycles wouldn't be that noticeable.=20 Though I admit I never did any measurements for that case. About space savings for the first cache line: - I still think that Oliver's suggestion of replacing data pointer (8 bytes= ) with data offset (2 bytes) makes a lot of sense. >=20 > Once we have freed up space, then we can start looking at what fields get= to use that space and what way we shuffle the existing > fields about, but that's a discussion for another day! >=20 > Please see patch below for moving pool pointer to second cache line of mb= uf. All feedback welcome, naturally. >=20 > Regards, > /Bruce >=20 > diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c > index 2b87521..e0cf30c 100644 > --- a/app/test/test_mbuf.c > +++ b/app/test/test_mbuf.c > @@ -832,7 +832,7 @@ test_failing_mbuf_sanity_check(void) > int > test_mbuf(void) > { > - RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) !=3D 64); > + RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) !=3D CACHE_LINE_SIZE*2); >=20 > /* create pktmbuf pool if it does not exist */ > if (pktmbuf_pool =3D=3D NULL) { > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > index 2735f37..aa3ec32 100644 > --- a/lib/librte_mbuf/rte_mbuf.h > +++ b/lib/librte_mbuf/rte_mbuf.h > @@ -181,7 +181,8 @@ enum rte_mbuf_type { > * The generic rte_mbuf, containing a packet mbuf or a control mbuf. > */ > struct rte_mbuf { > - struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */ > + uint64_t this_space_for_rent; // 8-bytes free for reuse. > + > void *buf_addr; /**< Virtual address of segment buffer. */ > phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */ > uint16_t buf_len; /**< Length of segment buffer. */ > @@ -210,12 +211,16 @@ struct rte_mbuf { > struct rte_pktmbuf pkt; > }; >=20 > +// second cache line starts here > + struct rte_mempool *pool __rte_cache_aligned; > + /**< Pool from which mbuf was allocated. */ > + > union { > uint8_t metadata[0]; > uint16_t metadata16[0]; > uint32_t metadata32[0]; > uint64_t metadata64[0]; > - }; > + } __rte_cache_aligned; > } __rte_cache_aligned; >=20 > #define RTE_MBUF_METADATA_UINT8(mbuf, offset) \ > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixg= be_rxtx.h > index 64c0695..7374e0d 100644 > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h > @@ -171,10 +171,6 @@ struct igb_tx_queue { > volatile union ixgbe_adv_tx_desc *tx_ring; > uint64_t tx_ring_phys_addr; /**< TX ring DMA address. */ > struct igb_tx_entry *sw_ring; /**< virtual address of SW ring. */ > -#ifdef RTE_IXGBE_INC_VECTOR > - /** continuous tx entry sequence within the same mempool */ > - struct igb_tx_entry_seq *sw_ring_seq; > -#endif > volatile uint32_t *tdt_reg_addr; /**< Address of TDT register. */ > uint16_t nb_tx_desc; /**< number of TX descriptors. */ > uint16_t tx_tail; /**< current value of TDT reg. */ > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe= /ixgbe_rxtx_vec.c > index 09e19a3..e9fd739 100644 > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c > @@ -401,9 +401,8 @@ static inline int __attribute__((always_inline)) > ixgbe_tx_free_bufs(struct igb_tx_queue *txq) > { > struct igb_tx_entry_v *txep; > - struct igb_tx_entry_seq *txsp; > uint32_t status; > - uint32_t n, k; > + uint32_t n; > #ifdef RTE_MBUF_SCATTER_GATHER > uint32_t i; > int nb_free =3D 0; > @@ -423,25 +422,36 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq) > */ > txep =3D &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd - > (n - 1)]; > - txsp =3D &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)]; >=20 > - while (n > 0) { > - k =3D RTE_MIN(n, txsp[n-1].same_pool); > #ifdef RTE_MBUF_SCATTER_GATHER > - for (i =3D 0; i < k; i++) { > - m =3D __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf); > + m =3D __rte_pktmbuf_prefree_seg(txep[0].mbuf); > + if (likely(m !=3D NULL)) { > + free[0] =3D m; > + nb_free =3D 1; > + for (i =3D 1; i < n; i++) { > + m =3D __rte_pktmbuf_prefree_seg(txep[i].mbuf); > + if (likely(m !=3D NULL)) { > + if (likely(m->pool =3D=3D free[0]->pool)) > + free[nb_free++] =3D m; > + else > + rte_mempool_put(m->pool, m); > + } > + } > + rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free); > + } > + else { > + for (i =3D 1; i < n; i++) { > + m =3D __rte_pktmbuf_prefree_seg(txep[i].mbuf); > if (m !=3D NULL) > - free[nb_free++] =3D m; > + rte_mempool_put(m->pool, m); > } > - rte_mempool_put_bulk((void *)txsp[n-1].pool, > - (void **)free, nb_free); > -#else > - rte_mempool_put_bulk((void *)txsp[n-1].pool, > - (void **)(txep+n-k), k); > -#endif > - n -=3D k; > } >=20 > +#else /* no scatter_gather */ > + for (i =3D 0; i < n; i++) > + rte_mempool_put(m->pool, txep[i].mbuf); > +#endif /* scatter_gather */ > + > /* buffers were freed, update counters */ > txq->nb_tx_free =3D (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh); > txq->tx_next_dd =3D (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh); > @@ -453,19 +463,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq) >=20 > static inline void __attribute__((always_inline)) > tx_backlog_entry(struct igb_tx_entry_v *txep, > - struct igb_tx_entry_seq *txsp, > struct rte_mbuf **tx_pkts, uint16_t nb_pkts) > { > int i; > - for (i =3D 0; i < (int)nb_pkts; ++i) { > + for (i =3D 0; i < (int)nb_pkts; ++i) > txep[i].mbuf =3D tx_pkts[i]; > - /* check and update sequence number */ > - txsp[i].pool =3D tx_pkts[i]->pool; > - if (txsp[i-1].pool =3D=3D tx_pkts[i]->pool) > - txsp[i].same_pool =3D txsp[i-1].same_pool + 1; > - else > - txsp[i].same_pool =3D 1; > - } > } >=20 > uint16_t > @@ -475,7 +477,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf *= *tx_pkts, > struct igb_tx_queue *txq =3D (struct igb_tx_queue *)tx_queue; > volatile union ixgbe_adv_tx_desc *txdp; > struct igb_tx_entry_v *txep; > - struct igb_tx_entry_seq *txsp; > uint16_t n, nb_commit, tx_id; > __m128i flags =3D _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0); > __m128i rs =3D _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS, > @@ -495,14 +496,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf= **tx_pkts, > tx_id =3D txq->tx_tail; > txdp =3D &txq->tx_ring[tx_id]; > txep =3D &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]; > - txsp =3D &txq->sw_ring_seq[tx_id]; >=20 > txq->nb_tx_free =3D (uint16_t)(txq->nb_tx_free - nb_pkts); >=20 > n =3D (uint16_t)(txq->nb_tx_desc - tx_id); > if (nb_commit >=3D n) { >=20 > - tx_backlog_entry(txep, txsp, tx_pkts, n); > + tx_backlog_entry(txep, tx_pkts, n); >=20 > for (i =3D 0; i < n - 1; ++i, ++tx_pkts, ++txdp) > vtx1(txdp, *tx_pkts, flags); > @@ -517,10 +517,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf = **tx_pkts, > /* avoid reach the end of ring */ > txdp =3D &(txq->tx_ring[tx_id]); > txep =3D &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]); > - txsp =3D &(txq->sw_ring_seq[tx_id]); > } >=20 > - tx_backlog_entry(txep, txsp, tx_pkts, nb_commit); > + tx_backlog_entry(txep, tx_pkts, nb_commit); >=20 > vtx(txdp, tx_pkts, nb_commit, flags); >=20 > @@ -544,7 +543,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq= ) > { > unsigned i; > struct igb_tx_entry_v *txe; > - struct igb_tx_entry_seq *txs; > uint16_t nb_free, max_desc; >=20 > if (txq->sw_ring !=3D NULL) { > @@ -562,10 +560,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *tx= q) > for (i =3D 0; i < txq->nb_tx_desc; i++) { > txe =3D (struct igb_tx_entry_v *)&txq->sw_ring[i]; > txe->mbuf =3D NULL; > - > - txs =3D &txq->sw_ring_seq[i]; > - txs->pool =3D NULL; > - txs->same_pool =3D 0; > } > } > } > @@ -580,11 +574,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq) > rte_free((struct igb_rx_entry *)txq->sw_ring - 1); > txq->sw_ring =3D NULL; > } > - > - if (txq->sw_ring_seq !=3D NULL) { > - rte_free(txq->sw_ring_seq - 1); > - txq->sw_ring_seq =3D NULL; > - } > } >=20 > static void > @@ -593,7 +582,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq) > static const union ixgbe_adv_tx_desc zeroed_desc =3D { .read =3D { > .buffer_addr =3D 0} }; > struct igb_tx_entry_v *txe =3D (struct igb_tx_entry_v *)txq->sw_ring; > - struct igb_tx_entry_seq *txs =3D txq->sw_ring_seq; > uint16_t i; >=20 > /* Zero out HW ring memory */ > @@ -605,8 +593,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq) > volatile union ixgbe_adv_tx_desc *txd =3D &txq->tx_ring[i]; > txd->wb.status =3D IXGBE_TXD_STAT_DD; > txe[i].mbuf =3D NULL; > - txs[i].pool =3D NULL; > - txs[i].same_pool =3D 0; > } >=20 > txq->tx_next_dd =3D (uint16_t)(txq->tx_rs_thresh - 1); > @@ -632,27 +618,14 @@ static struct ixgbe_txq_ops vec_txq_ops =3D { > }; >=20 > int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, > - unsigned int socket_id) > + __rte_unused unsigned int socket_id) > { > - uint16_t nb_desc; > - > if (txq->sw_ring =3D=3D NULL) > return -1; >=20 > - /* request addtional one entry for continous sequence check */ > - nb_desc =3D (uint16_t)(txq->nb_tx_desc + 1); > - > - txq->sw_ring_seq =3D rte_zmalloc_socket("txq->sw_ring_seq", > - sizeof(struct igb_tx_entry_seq) * nb_desc, > - CACHE_LINE_SIZE, socket_id); > - if (txq->sw_ring_seq =3D=3D NULL) > - return -1; > - > - > /* leave the first one for overflow */ > txq->sw_ring =3D (struct igb_tx_entry *) > ((struct igb_tx_entry_v *)txq->sw_ring + 1); > - txq->sw_ring_seq +=3D 1; > txq->ops =3D &vec_txq_ops; >=20 > return 0;