From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id BE3764B79 for ; Wed, 1 Jun 2016 08:40:45 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga103.jf.intel.com with ESMTP; 31 May 2016 23:40:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,400,1459839600"; d="scan'208";a="992784380" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by fmsmga002.fm.intel.com with ESMTP; 31 May 2016 23:40:44 -0700 Received: from fmsmsx155.amr.corp.intel.com (10.18.116.71) by FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 31 May 2016 23:40:43 -0700 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by FMSMSX155.amr.corp.intel.com (10.18.116.71) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 31 May 2016 23:40:43 -0700 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.150]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.107]) with mapi id 14.03.0248.002; Wed, 1 Jun 2016 14:40:41 +0800 From: "Xie, Huawei" To: Yuanhan Liu , "dev@dpdk.org" CC: "Michael S. Tsirkin" Thread-Topic: [PATCH 1/3] vhost: pre update used ring for Tx and Rx Thread-Index: AdG70IO5lAli24e+RZC4jHIGzfzDoA== Date: Wed, 1 Jun 2016 06:40:41 +0000 Message-ID: References: <1462236378-7604-1-git-send-email-yuanhan.liu@linux.intel.com> <1462236378-7604-2-git-send-email-yuanhan.liu@linux.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jun 2016 06:40:46 -0000 On 5/3/2016 8:42 AM, Yuanhan Liu wrote:=0A= > Pre update and update used ring in batch for Tx and Rx at the stage=0A= > while fetching all avail desc idx. This would reduce some cache misses=0A= > and hence, increase the performance a bit.=0A= >=0A= > Pre update would be feasible as guest driver will not start processing=0A= > those entries as far as we don't update "used->idx". (I'm not 100%=0A= > certain I don't miss anything, though).=0A= >=0A= > Cc: Michael S. Tsirkin =0A= > Signed-off-by: Yuanhan Liu =0A= > ---=0A= > lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++----------------= ------=0A= > 1 file changed, 28 insertions(+), 30 deletions(-)=0A= >=0A= > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.= c=0A= > index c9cd1c5..2c3b810 100644=0A= > --- a/lib/librte_vhost/vhost_rxtx.c=0A= > +++ b/lib/librte_vhost/vhost_rxtx.c=0A= > @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t = desc_addr,=0A= > =0A= > static inline int __attribute__((always_inline))=0A= > copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,=0A= > - struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)=0A= > + struct rte_mbuf *m, uint16_t desc_idx)=0A= > {=0A= > uint32_t desc_avail, desc_offset;=0A= > uint32_t mbuf_avail, mbuf_offset;=0A= > @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhos= t_virtqueue *vq,=0A= > desc_offset =3D dev->vhost_hlen;=0A= > desc_avail =3D desc->len - dev->vhost_hlen;=0A= > =0A= > - *copied =3D rte_pktmbuf_pkt_len(m);=0A= > mbuf_avail =3D rte_pktmbuf_data_len(m);=0A= > mbuf_offset =3D 0;=0A= > while (mbuf_avail !=3D 0 || m->next !=3D NULL) {=0A= > @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_= id,=0A= > struct vhost_virtqueue *vq;=0A= > uint16_t res_start_idx, res_end_idx;=0A= > uint16_t desc_indexes[MAX_PKT_BURST];=0A= > + uint16_t used_idx;=0A= > uint32_t i;=0A= > =0A= > LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);=0A= > @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queu= e_id,=0A= > /* Retrieve all of the desc indexes first to avoid caching issues. */= =0A= > rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]);=0A= > for (i =3D 0; i < count; i++) {=0A= > - desc_indexes[i] =3D vq->avail->ring[(res_start_idx + i) &=0A= > - (vq->size - 1)];=0A= > + used_idx =3D (res_start_idx + i) & (vq->size - 1);=0A= > + desc_indexes[i] =3D vq->avail->ring[used_idx];=0A= > + vq->used->ring[used_idx].id =3D desc_indexes[i];=0A= > + vq->used->ring[used_idx].len =3D pkts[i]->pkt_len +=0A= > + dev->vhost_hlen;=0A= > + vhost_log_used_vring(dev, vq,=0A= > + offsetof(struct vring_used, ring[used_idx]),=0A= > + sizeof(vq->used->ring[used_idx]));=0A= > }=0A= > =0A= > rte_prefetch0(&vq->desc[desc_indexes[0]]);=0A= > for (i =3D 0; i < count; i++) {=0A= > uint16_t desc_idx =3D desc_indexes[i];=0A= > - uint16_t used_idx =3D (res_start_idx + i) & (vq->size - 1);=0A= > - uint32_t copied;=0A= > int err;=0A= > =0A= > - err =3D copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied);=0A= > -=0A= > - vq->used->ring[used_idx].id =3D desc_idx;=0A= > - if (unlikely(err))=0A= > + err =3D copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);=0A= > + if (unlikely(err)) {=0A= > + used_idx =3D (res_start_idx + i) & (vq->size - 1);=0A= > vq->used->ring[used_idx].len =3D dev->vhost_hlen;=0A= > - else=0A= > - vq->used->ring[used_idx].len =3D copied + dev->vhost_hlen;=0A= > - vhost_log_used_vring(dev, vq,=0A= > - offsetof(struct vring_used, ring[used_idx]),=0A= > - sizeof(vq->used->ring[used_idx]));=0A= > + vhost_log_used_vring(dev, vq,=0A= > + offsetof(struct vring_used, ring[used_idx]),=0A= > + sizeof(vq->used->ring[used_idx]));=0A= > + }=0A= > =0A= > if (i + 1 < count)=0A= > rte_prefetch0(&vq->desc[desc_indexes[i+1]]);=0A= > @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,= =0A= > /* Prefetch available ring to retrieve head indexes. */=0A= > used_idx =3D vq->last_used_idx & (vq->size - 1);=0A= > rte_prefetch0(&vq->avail->ring[used_idx]);=0A= > + rte_prefetch0(&vq->used->ring[used_idx]);=0A= > =0A= > count =3D RTE_MIN(count, MAX_PKT_BURST);=0A= > count =3D RTE_MIN(count, free_entries);=0A= > @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,= =0A= > =0A= > /* Retrieve all of the head indexes first to avoid caching issues. */= =0A= > for (i =3D 0; i < count; i++) {=0A= > - desc_indexes[i] =3D vq->avail->ring[(vq->last_used_idx + i) &=0A= > - (vq->size - 1)];=0A= > + used_idx =3D (vq->last_used_idx + i) & (vq->size - 1);=0A= > + desc_indexes[i] =3D vq->avail->ring[used_idx];=0A= > +=0A= > + vq->used->ring[used_idx].id =3D desc_indexes[i];=0A= > + vq->used->ring[used_idx].len =3D 0;=0A= > + vhost_log_used_vring(dev, vq,=0A= > + offsetof(struct vring_used, ring[used_idx]),=0A= > + sizeof(vq->used->ring[used_idx]));=0A= > }=0A= > =0A= > /* Prefetch descriptor index. */=0A= > rte_prefetch0(&vq->desc[desc_indexes[0]]);=0A= > - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);=0A= > -=0A= > for (i =3D 0; i < count; i++) {=0A= > int err;=0A= > =0A= > - if (likely(i + 1 < count)) {=0A= > + if (likely(i + 1 < count))=0A= > rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);=0A= > - rte_prefetch0(&vq->used->ring[(used_idx + 1) &=0A= > - (vq->size - 1)]);=0A= > - }=0A= > =0A= > pkts[i] =3D rte_pktmbuf_alloc(mbuf_pool);=0A= > if (unlikely(pkts[i] =3D=3D NULL)) {=0A= > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,= =0A= > rte_pktmbuf_free(pkts[i]);=0A= > break;=0A= > }=0A= > -=0A= > - used_idx =3D vq->last_used_idx++ & (vq->size - 1);=0A= > - vq->used->ring[used_idx].id =3D desc_indexes[i];=0A= > - vq->used->ring[used_idx].len =3D 0;=0A= > - vhost_log_used_vring(dev, vq,=0A= > - offsetof(struct vring_used, ring[used_idx]),=0A= > - sizeof(vq->used->ring[used_idx]));=0A= > }=0A= =0A= Had tried post-updating used ring in batch, but forget the perf change.=0A= =0A= One optimization would be on vhost_log_used_ring.=0A= I have two ideas,=0A= a) In QEMU side, we always assume use ring will be changed. so that we=0A= don't need to log used ring in VHOST.=0A= =0A= Michael: feasible in QEMU? comments on this?=0A= =0A= b) We could always mark the total used ring modified rather than entry=0A= by entry.=0A= =0A= =0A= > =0A= > rte_smp_wmb();=0A= > rte_smp_rmb();=0A= > vq->used->idx +=3D i;=0A= > + vq->last_used_idx +=3D i;=0A= > vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),=0A= > sizeof(vq->used->idx));=0A= > =0A= =0A=