From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <huawei.xie@intel.com>
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
 by dpdk.org (Postfix) with ESMTP id C22C32BAD
 for <dev@dpdk.org>; Thu,  3 Mar 2016 17:21:28 +0100 (CET)
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by fmsmga104.fm.intel.com with ESMTP; 03 Mar 2016 08:21:27 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.22,532,1449561600"; d="scan'208";a="59221213"
Received: from fmsmsx104.amr.corp.intel.com ([10.18.124.202])
 by fmsmga004.fm.intel.com with ESMTP; 03 Mar 2016 08:21:27 -0800
Received: from FMSMSX109.amr.corp.intel.com (10.18.116.9) by
 fmsmsx104.amr.corp.intel.com (10.18.124.202) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Thu, 3 Mar 2016 08:21:27 -0800
Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by
 fmsmsx109.amr.corp.intel.com (10.18.116.9) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Thu, 3 Mar 2016 08:21:27 -0800
Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.136]) by
 SHSMSX103.ccr.corp.intel.com ([169.254.4.24]) with mapi id 14.03.0248.002;
 Fri, 4 Mar 2016 00:21:19 +0800
From: "Xie, Huawei" <huawei.xie@intel.com>
To: Yuanhan Liu <yuanhan.liu@linux.intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH v2 1/7] vhost: refactor rte_vhost_dequeue_burst
Thread-Index: AdF1aLd5clXhimYJTx2/yKoyh8rOiw==
Date: Thu, 3 Mar 2016 16:21:19 +0000
Message-ID: <C37D651A908B024F974696C65296B57B4C626D94@SHSMSX101.ccr.corp.intel.com>
References: <1449122773-25510-1-git-send-email-yuanhan.liu@linux.intel.com>
 <1455803352-5518-1-git-send-email-yuanhan.liu@linux.intel.com>
 <1455803352-5518-2-git-send-email-yuanhan.liu@linux.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: Victor Kaplansky <vkaplans@redhat.com>,
 "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [dpdk-dev] [PATCH v2 1/7] vhost: refactor
	rte_vhost_dequeue_burst
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Mar 2016 16:21:29 -0000

On 2/18/2016 9:48 PM, Yuanhan Liu wrote:=0A=
> The current rte_vhost_dequeue_burst() implementation is a bit messy=0A=
> and logic twisted. And you could see repeat code here and there: it=0A=
> invokes rte_pktmbuf_alloc() three times at three different places!=0A=
>=0A=
> However, rte_vhost_dequeue_burst() acutally does a simple job: copy=0A=
> the packet data from vring desc to mbuf. What's tricky here is:=0A=
>=0A=
> - desc buff could be chained (by desc->next field), so that you need=0A=
>   fetch next one if current is wholly drained.=0A=
>=0A=
> - One mbuf could not be big enough to hold all desc buff, hence you=0A=
>   need to chain the mbuf as well, by the mbuf->next field.=0A=
>=0A=
> Even though, the logic could be simple. Here is the pseudo code.=0A=
>=0A=
> 	while (this_desc_is_not_drained_totally || has_next_desc) {=0A=
> 		if (this_desc_has_drained_totally) {=0A=
> 			this_desc =3D next_desc();=0A=
> 		}=0A=
>=0A=
> 		if (mbuf_has_no_room) {=0A=
> 			mbuf =3D allocate_a_new_mbuf();=0A=
> 		}=0A=
>=0A=
> 		COPY(mbuf, desc);=0A=
> 	}=0A=
>=0A=
> And this is how I refactored rte_vhost_dequeue_burst.=0A=
>=0A=
> Note that the old patch does a special handling for skipping virtio=0A=
> header. However, that could be simply done by adjusting desc_avail=0A=
> and desc_offset var:=0A=
>=0A=
> 	desc_avail  =3D desc->len - vq->vhost_hlen;=0A=
> 	desc_offset =3D vq->vhost_hlen;=0A=
>=0A=
> This refactor makes the code much more readable (IMO), yet it reduces=0A=
> binary code size (nearly 2K).=0A=
>=0A=
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>=0A=
> ---=0A=
>=0A=
> v2: - fix potential NULL dereference bug of var "prev" and "head"=0A=
> ---=0A=
>  lib/librte_vhost/vhost_rxtx.c | 297 +++++++++++++++++-------------------=
------=0A=
>  1 file changed, 116 insertions(+), 181 deletions(-)=0A=
>=0A=
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.=
c=0A=
> index 5e7e5b1..d5cd0fa 100644=0A=
> --- a/lib/librte_vhost/vhost_rxtx.c=0A=
> +++ b/lib/librte_vhost/vhost_rxtx.c=0A=
> @@ -702,21 +702,104 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, =
struct rte_mbuf *m)=0A=
>  	}=0A=
>  }=0A=
>  =0A=
> +static inline struct rte_mbuf *=0A=
> +copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,=0A=
> +		  uint16_t desc_idx, struct rte_mempool *mbuf_pool)=0A=
> +{=0A=
> +	struct vring_desc *desc;=0A=
> +	uint64_t desc_addr;=0A=
> +	uint32_t desc_avail, desc_offset;=0A=
> +	uint32_t mbuf_avail, mbuf_offset;=0A=
> +	uint32_t cpy_len;=0A=
> +	struct rte_mbuf *head =3D NULL;=0A=
> +	struct rte_mbuf *cur =3D NULL, *prev =3D NULL;=0A=
> +	struct virtio_net_hdr *hdr;=0A=
> +=0A=
> +	desc =3D &vq->desc[desc_idx];=0A=
> +	desc_addr =3D gpa_to_vva(dev, desc->addr);=0A=
> +	rte_prefetch0((void *)(uintptr_t)desc_addr);=0A=
> +=0A=
> +	/* Retrieve virtio net header */=0A=
> +	hdr =3D (struct virtio_net_hdr *)((uintptr_t)desc_addr);=0A=
> +	desc_avail  =3D desc->len - vq->vhost_hlen;=0A=
=0A=
There is a serious bug here, desc->len - vq->vhost_len could overflow.=0A=
VM could easily create this case. Let us fix it here.=0A=
=0A=
> +	desc_offset =3D vq->vhost_hlen;=0A=
> +=0A=
> +	mbuf_avail  =3D 0;=0A=
> +	mbuf_offset =3D 0;=0A=
> +	while (desc_avail || (desc->flags & VRING_DESC_F_NEXT) !=3D 0) {=0A=
> +		/* This desc reachs to its end, get the next one */=0A=
> +		if (desc_avail =3D=3D 0) {=0A=
> +			desc =3D &vq->desc[desc->next];=0A=
> +=0A=
> +			desc_addr =3D gpa_to_vva(dev, desc->addr);=0A=
> +			rte_prefetch0((void *)(uintptr_t)desc_addr);=0A=
> +=0A=
> +			desc_offset =3D 0;=0A=
> +			desc_avail  =3D desc->len;=0A=
> +=0A=
> +			PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);=0A=
> +		}=0A=
> +=0A=
> +		/*=0A=
> +		 * This mbuf reachs to its end, get a new one=0A=
> +		 * to hold more data.=0A=
> +		 */=0A=
> +		if (mbuf_avail =3D=3D 0) {=0A=
> +			cur =3D rte_pktmbuf_alloc(mbuf_pool);=0A=
> +			if (unlikely(!cur)) {=0A=
> +				RTE_LOG(ERR, VHOST_DATA, "Failed to "=0A=
> +					"allocate memory for mbuf.\n");=0A=
> +				if (head)=0A=
> +					rte_pktmbuf_free(head);=0A=
> +				return NULL;=0A=
> +			}=0A=
> +			if (!head) {=0A=
> +				head =3D cur;=0A=
> +			} else {=0A=
> +				prev->next =3D cur;=0A=
> +				prev->data_len =3D mbuf_offset;=0A=
> +				head->nb_segs +=3D 1;=0A=
> +			}=0A=
> +			head->pkt_len +=3D mbuf_offset;=0A=
> +			prev =3D cur;=0A=
> +=0A=
> +			mbuf_offset =3D 0;=0A=
> +			mbuf_avail  =3D cur->buf_len - RTE_PKTMBUF_HEADROOM;=0A=
> +		}=0A=
> +=0A=
> +		cpy_len =3D RTE_MIN(desc_avail, mbuf_avail);=0A=
> +		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),=0A=
> +			(void *)((uintptr_t)(desc_addr + desc_offset)),=0A=
> +			cpy_len);=0A=
> +=0A=
> +		mbuf_avail  -=3D cpy_len;=0A=
> +		mbuf_offset +=3D cpy_len;=0A=
> +		desc_avail  -=3D cpy_len;=0A=
> +		desc_offset +=3D cpy_len;=0A=
> +	}=0A=
> +=0A=
> +	if (prev) {=0A=
> +		prev->data_len =3D mbuf_offset;=0A=
> +		head->pkt_len +=3D mbuf_offset;=0A=
> +=0A=
> +		if (hdr->flags !=3D 0 || hdr->gso_type !=3D VIRTIO_NET_HDR_GSO_NONE)=
=0A=
> +			vhost_dequeue_offload(hdr, head);=0A=
> +	}=0A=
> +=0A=
> +	return head;=0A=
> +}=0A=
> +=0A=
>  uint16_t=0A=
>  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,=0A=
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)=
=0A=
>  {=0A=
> -	struct rte_mbuf *m, *prev;=0A=
>  	struct vhost_virtqueue *vq;=0A=
> -	struct vring_desc *desc;=0A=
> -	uint64_t vb_addr =3D 0;=0A=
> -	uint64_t vb_net_hdr_addr =3D 0;=0A=
> -	uint32_t head[MAX_PKT_BURST];=0A=
> +	uint32_t desc_indexes[MAX_PKT_BURST];=0A=
>  	uint32_t used_idx;=0A=
>  	uint32_t i;=0A=
> -	uint16_t free_entries, entry_success =3D 0;=0A=
> +	uint16_t free_entries;=0A=
>  	uint16_t avail_idx;=0A=
> -	struct virtio_net_hdr *hdr =3D NULL;=0A=
> +	struct rte_mbuf *m;=0A=
>  =0A=
>  	if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {=
=0A=
>  		RTE_LOG(ERR, VHOST_DATA,=0A=
> @@ -730,197 +813,49 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, ui=
nt16_t queue_id,=0A=
>  		return 0;=0A=
>  =0A=
>  	avail_idx =3D  *((volatile uint16_t *)&vq->avail->idx);=0A=
> -=0A=
> -	/* If there are no available buffers then return. */=0A=
> -	if (vq->last_used_idx =3D=3D avail_idx)=0A=
> +	free_entries =3D avail_idx - vq->last_used_idx;=0A=
> +	if (free_entries =3D=3D 0)=0A=
>  		return 0;=0A=
>  =0A=
> -	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,=0A=
> -		dev->device_fh);=0A=
> +	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__, dev->device_fh);=0A=
>  =0A=
> -	/* Prefetch available ring to retrieve head indexes. */=0A=
> -	rte_prefetch0(&vq->avail->ring[vq->last_used_idx & (vq->size - 1)]);=0A=
> +	used_idx =3D vq->last_used_idx & (vq->size -1);=0A=
>  =0A=
> -	/*get the number of free entries in the ring*/=0A=
> -	free_entries =3D (avail_idx - vq->last_used_idx);=0A=
> +	/* Prefetch available ring to retrieve head indexes. */=0A=
> +	rte_prefetch0(&vq->avail->ring[used_idx]);=0A=
>  =0A=
> -	free_entries =3D RTE_MIN(free_entries, count);=0A=
> -	/* Limit to MAX_PKT_BURST. */=0A=
> -	free_entries =3D RTE_MIN(free_entries, MAX_PKT_BURST);=0A=
> +	count =3D RTE_MIN(count, MAX_PKT_BURST);=0A=
> +	count =3D RTE_MIN(count, free_entries);=0A=
> +	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") about to dequeue %u buffers\n",=0A=
> +			dev->device_fh, count);=0A=
>  =0A=
> -	LOG_DEBUG(VHOST_DATA, "(%"PRIu64") Buffers available %d\n",=0A=
> -			dev->device_fh, free_entries);=0A=
>  	/* Retrieve all of the head indexes first to avoid caching issues. */=
=0A=
> -	for (i =3D 0; i < free_entries; i++)=0A=
> -		head[i] =3D vq->avail->ring[(vq->last_used_idx + i) & (vq->size - 1)];=
=0A=
> +	for (i =3D 0; i < count; i++) {=0A=
> +		desc_indexes[i] =3D vq->avail->ring[(vq->last_used_idx + i) &=0A=
> +					(vq->size - 1)];=0A=
> +	}=0A=
>  =0A=
>  	/* Prefetch descriptor index. */=0A=
> -	rte_prefetch0(&vq->desc[head[entry_success]]);=0A=
> +	rte_prefetch0(&vq->desc[desc_indexes[0]]);=0A=
>  	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);=0A=
>  =0A=
> -	while (entry_success < free_entries) {=0A=
> -		uint32_t vb_avail, vb_offset;=0A=
> -		uint32_t seg_avail, seg_offset;=0A=
> -		uint32_t cpy_len;=0A=
> -		uint32_t seg_num =3D 0;=0A=
> -		struct rte_mbuf *cur;=0A=
> -		uint8_t alloc_err =3D 0;=0A=
> -=0A=
> -		desc =3D &vq->desc[head[entry_success]];=0A=
> -=0A=
> -		vb_net_hdr_addr =3D gpa_to_vva(dev, desc->addr);=0A=
> -		hdr =3D (struct virtio_net_hdr *)((uintptr_t)vb_net_hdr_addr);=0A=
> -=0A=
> -		/* Discard first buffer as it is the virtio header */=0A=
> -		if (desc->flags & VRING_DESC_F_NEXT) {=0A=
> -			desc =3D &vq->desc[desc->next];=0A=
> -			vb_offset =3D 0;=0A=
> -			vb_avail =3D desc->len;=0A=
> -		} else {=0A=
> -			vb_offset =3D vq->vhost_hlen;=0A=
> -			vb_avail =3D desc->len - vb_offset;=0A=
> -		}=0A=
> -=0A=
> -		/* Buffer address translation. */=0A=
> -		vb_addr =3D gpa_to_vva(dev, desc->addr);=0A=
> -		/* Prefetch buffer address. */=0A=
> -		rte_prefetch0((void *)(uintptr_t)vb_addr);=0A=
> -=0A=
> -		used_idx =3D vq->last_used_idx & (vq->size - 1);=0A=
> -=0A=
> -		if (entry_success < (free_entries - 1)) {=0A=
> -			/* Prefetch descriptor index. */=0A=
> -			rte_prefetch0(&vq->desc[head[entry_success+1]]);=0A=
> -			rte_prefetch0(&vq->used->ring[(used_idx + 1) & (vq->size - 1)]);=0A=
> -		}=0A=
> -=0A=
> -		/* Update used index buffer information. */=0A=
> -		vq->used->ring[used_idx].id =3D head[entry_success];=0A=
> -		vq->used->ring[used_idx].len =3D 0;=0A=
> -=0A=
> -		/* Allocate an mbuf and populate the structure. */=0A=
> -		m =3D rte_pktmbuf_alloc(mbuf_pool);=0A=
> -		if (unlikely(m =3D=3D NULL)) {=0A=
> -			RTE_LOG(ERR, VHOST_DATA,=0A=
> -				"Failed to allocate memory for mbuf.\n");=0A=
> -			break;=0A=
> -		}=0A=
> -		seg_offset =3D 0;=0A=
> -		seg_avail =3D m->buf_len - RTE_PKTMBUF_HEADROOM;=0A=
> -		cpy_len =3D RTE_MIN(vb_avail, seg_avail);=0A=
> -=0A=
> -		PRINT_PACKET(dev, (uintptr_t)vb_addr, desc->len, 0);=0A=
> -=0A=
> -		seg_num++;=0A=
> -		cur =3D m;=0A=
> -		prev =3D m;=0A=
> -		while (cpy_len !=3D 0) {=0A=
> -			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, seg_offset),=0A=
> -				(void *)((uintptr_t)(vb_addr + vb_offset)),=0A=
> -				cpy_len);=0A=
> -=0A=
> -			seg_offset +=3D cpy_len;=0A=
> -			vb_offset +=3D cpy_len;=0A=
> -			vb_avail -=3D cpy_len;=0A=
> -			seg_avail -=3D cpy_len;=0A=
> -=0A=
> -			if (vb_avail !=3D 0) {=0A=
> -				/*=0A=
> -				 * The segment reachs to its end,=0A=
> -				 * while the virtio buffer in TX vring has=0A=
> -				 * more data to be copied.=0A=
> -				 */=0A=
> -				cur->data_len =3D seg_offset;=0A=
> -				m->pkt_len +=3D seg_offset;=0A=
> -				/* Allocate mbuf and populate the structure. */=0A=
> -				cur =3D rte_pktmbuf_alloc(mbuf_pool);=0A=
> -				if (unlikely(cur =3D=3D NULL)) {=0A=
> -					RTE_LOG(ERR, VHOST_DATA, "Failed to "=0A=
> -						"allocate memory for mbuf.\n");=0A=
> -					rte_pktmbuf_free(m);=0A=
> -					alloc_err =3D 1;=0A=
> -					break;=0A=
> -				}=0A=
> -=0A=
> -				seg_num++;=0A=
> -				prev->next =3D cur;=0A=
> -				prev =3D cur;=0A=
> -				seg_offset =3D 0;=0A=
> -				seg_avail =3D cur->buf_len - RTE_PKTMBUF_HEADROOM;=0A=
> -			} else {=0A=
> -				if (desc->flags & VRING_DESC_F_NEXT) {=0A=
> -					/*=0A=
> -					 * There are more virtio buffers in=0A=
> -					 * same vring entry need to be copied.=0A=
> -					 */=0A=
> -					if (seg_avail =3D=3D 0) {=0A=
> -						/*=0A=
> -						 * The current segment hasn't=0A=
> -						 * room to accomodate more=0A=
> -						 * data.=0A=
> -						 */=0A=
> -						cur->data_len =3D seg_offset;=0A=
> -						m->pkt_len +=3D seg_offset;=0A=
> -						/*=0A=
> -						 * Allocate an mbuf and=0A=
> -						 * populate the structure.=0A=
> -						 */=0A=
> -						cur =3D rte_pktmbuf_alloc(mbuf_pool);=0A=
> -						if (unlikely(cur =3D=3D NULL)) {=0A=
> -							RTE_LOG(ERR,=0A=
> -								VHOST_DATA,=0A=
> -								"Failed to "=0A=
> -								"allocate memory "=0A=
> -								"for mbuf\n");=0A=
> -							rte_pktmbuf_free(m);=0A=
> -							alloc_err =3D 1;=0A=
> -							break;=0A=
> -						}=0A=
> -						seg_num++;=0A=
> -						prev->next =3D cur;=0A=
> -						prev =3D cur;=0A=
> -						seg_offset =3D 0;=0A=
> -						seg_avail =3D cur->buf_len - RTE_PKTMBUF_HEADROOM;=0A=
> -					}=0A=
> -=0A=
> -					desc =3D &vq->desc[desc->next];=0A=
> -=0A=
> -					/* Buffer address translation. */=0A=
> -					vb_addr =3D gpa_to_vva(dev, desc->addr);=0A=
> -					/* Prefetch buffer address. */=0A=
> -					rte_prefetch0((void *)(uintptr_t)vb_addr);=0A=
> -					vb_offset =3D 0;=0A=
> -					vb_avail =3D desc->len;=0A=
> -=0A=
> -					PRINT_PACKET(dev, (uintptr_t)vb_addr,=0A=
> -						desc->len, 0);=0A=
> -				} else {=0A=
> -					/* The whole packet completes. */=0A=
> -					cur->data_len =3D seg_offset;=0A=
> -					m->pkt_len +=3D seg_offset;=0A=
> -					vb_avail =3D 0;=0A=
> -				}=0A=
> -			}=0A=
> -=0A=
> -			cpy_len =3D RTE_MIN(vb_avail, seg_avail);=0A=
> -		}=0A=
> -=0A=
> -		if (unlikely(alloc_err =3D=3D 1))=0A=
> +	for (i =3D 0; i < count; i++) {=0A=
> +		m =3D copy_desc_to_mbuf(dev, vq, desc_indexes[i], mbuf_pool);=0A=
> +		if (m =3D=3D NULL)=0A=
>  			break;=0A=
> +		pkts[i] =3D m;=0A=
>  =0A=
> -		m->nb_segs =3D seg_num;=0A=
> -		if ((hdr->flags !=3D 0) || (hdr->gso_type !=3D VIRTIO_NET_HDR_GSO_NONE=
))=0A=
> -			vhost_dequeue_offload(hdr, m);=0A=
> -=0A=
> -		pkts[entry_success] =3D m;=0A=
> -		vq->last_used_idx++;=0A=
> -		entry_success++;=0A=
> +		used_idx =3D vq->last_used_idx++ & (vq->size - 1);=0A=
> +		vq->used->ring[used_idx].id  =3D desc_indexes[i];=0A=
> +		vq->used->ring[used_idx].len =3D 0;=0A=
>  	}=0A=
>  =0A=
>  	rte_compiler_barrier();=0A=
> -	vq->used->idx +=3D entry_success;=0A=
> +	vq->used->idx +=3D i;=0A=
> +=0A=
>  	/* Kick guest if required. */=0A=
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))=0A=
>  		eventfd_write(vq->callfd, (eventfd_t)1);=0A=
> -	return entry_success;=0A=
> +=0A=
> +	return i;=0A=
>  }=0A=
=0A=