[dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization
@ 2016-05-03  0:46 Yuanhan Liu
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Yuanhan Liu @ 2016-05-03  0:46 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie, Yuanhan Liu

Here is a small patch set does the micro optimization, which brings about
10% performance boost in my 64B packet testing, with the following topo:

    pkt generator <----> NIC <-----> Virtio NIC

Patch 1 pre updates the used ring and update them in batch. It should be
feasible from my understanding: there will be no issue, guest driver will
not start processing them as far as we haven't updated the "used->idx"
yet. I could miss something though.

Patch 2 saves one check for small packets (that can be hold in one desc
buf and mbuf).

Patch 3 moves several frequently used fields into one cache line, for
better cache sharing. 

Note that this patch set is based on my latest vhost ABI refactoring patchset.

---
Yuanhan Liu (3):
  vhost: pre update used ring for Tx and Rx
  vhost: optimize dequeue for small packets
  vhost: arrange virtio_net fields for better cache sharing

 lib/librte_vhost/vhost-net.h  |   8 +--
 lib/librte_vhost/vhost_rxtx.c | 110 ++++++++++++++++++++++++------------------
 2 files changed, 68 insertions(+), 50 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx
  2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
@ 2016-05-03  0:46 ` Yuanhan Liu
  2016-06-01  6:40   ` Xie, Huawei
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Yuanhan Liu @ 2016-05-03  0:46 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie, Yuanhan Liu, Michael S. Tsirkin

Pre update and update used ring in batch for Tx and Rx at the stage
while fetching all avail desc idx. This would reduce some cache misses
and hence, increase the performance a bit.

Pre update would be feasible as guest driver will not start processing
those entries as far as we don't update "used->idx". (I'm not 100%
certain I don't miss anything, though).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index c9cd1c5..2c3b810 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
 
 static inline int __attribute__((always_inline))
 copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
-		  struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)
+		  struct rte_mbuf *m, uint16_t desc_idx)
 {
 	uint32_t desc_avail, desc_offset;
 	uint32_t mbuf_avail, mbuf_offset;
@@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	desc_offset = dev->vhost_hlen;
 	desc_avail  = desc->len - dev->vhost_hlen;
 
-	*copied = rte_pktmbuf_pkt_len(m);
 	mbuf_avail  = rte_pktmbuf_data_len(m);
 	mbuf_offset = 0;
 	while (mbuf_avail != 0 || m->next != NULL) {
@@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	struct vhost_virtqueue *vq;
 	uint16_t res_start_idx, res_end_idx;
 	uint16_t desc_indexes[MAX_PKT_BURST];
+	uint16_t used_idx;
 	uint32_t i;
 
 	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
@@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	/* Retrieve all of the desc indexes first to avoid caching issues. */
 	rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]);
 	for (i = 0; i < count; i++) {
-		desc_indexes[i] = vq->avail->ring[(res_start_idx + i) &
-						  (vq->size - 1)];
+		used_idx = (res_start_idx + i) & (vq->size - 1);
+		desc_indexes[i] = vq->avail->ring[used_idx];
+		vq->used->ring[used_idx].id = desc_indexes[i];
+		vq->used->ring[used_idx].len = pkts[i]->pkt_len +
+					       dev->vhost_hlen;
+		vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[used_idx]),
+			sizeof(vq->used->ring[used_idx]));
 	}
 
 	rte_prefetch0(&vq->desc[desc_indexes[0]]);
 	for (i = 0; i < count; i++) {
 		uint16_t desc_idx = desc_indexes[i];
-		uint16_t used_idx = (res_start_idx + i) & (vq->size - 1);
-		uint32_t copied;
 		int err;
 
-		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied);
-
-		vq->used->ring[used_idx].id = desc_idx;
-		if (unlikely(err))
+		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
+		if (unlikely(err)) {
+			used_idx = (res_start_idx + i) & (vq->size - 1);
 			vq->used->ring[used_idx].len = dev->vhost_hlen;
-		else
-			vq->used->ring[used_idx].len = copied + dev->vhost_hlen;
-		vhost_log_used_vring(dev, vq,
-			offsetof(struct vring_used, ring[used_idx]),
-			sizeof(vq->used->ring[used_idx]));
+			vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
+		}
 
 		if (i + 1 < count)
 			rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
@@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	/* Prefetch available ring to retrieve head indexes. */
 	used_idx = vq->last_used_idx & (vq->size - 1);
 	rte_prefetch0(&vq->avail->ring[used_idx]);
+	rte_prefetch0(&vq->used->ring[used_idx]);
 
 	count = RTE_MIN(count, MAX_PKT_BURST);
 	count = RTE_MIN(count, free_entries);
@@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < count; i++) {
-		desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
-					(vq->size - 1)];
+		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
+		desc_indexes[i] = vq->avail->ring[used_idx];
+
+		vq->used->ring[used_idx].id  = desc_indexes[i];
+		vq->used->ring[used_idx].len = 0;
+		vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 	}
 
 	/* Prefetch descriptor index. */
 	rte_prefetch0(&vq->desc[desc_indexes[0]]);
-	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
-
 	for (i = 0; i < count; i++) {
 		int err;
 
-		if (likely(i + 1 < count)) {
+		if (likely(i + 1 < count))
 			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
-			rte_prefetch0(&vq->used->ring[(used_idx + 1) &
-						      (vq->size - 1)]);
-		}
 
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
@@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 			rte_pktmbuf_free(pkts[i]);
 			break;
 		}
-
-		used_idx = vq->last_used_idx++ & (vq->size - 1);
-		vq->used->ring[used_idx].id  = desc_indexes[i];
-		vq->used->ring[used_idx].len = 0;
-		vhost_log_used_vring(dev, vq,
-				offsetof(struct vring_used, ring[used_idx]),
-				sizeof(vq->used->ring[used_idx]));
 	}
 
 	rte_smp_wmb();
 	rte_smp_rmb();
 	vq->used->idx += i;
+	vq->last_used_idx += i;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets
  2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu
@ 2016-05-03  0:46 ` Yuanhan Liu
  2016-06-01  6:24   ` Xie, Huawei
  2016-06-03  7:43   ` Xie, Huawei
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Yuanhan Liu @ 2016-05-03  0:46 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie, Yuanhan Liu

Both current kernel virtio driver and DPDK virtio driver use at least
2 desc buffer for Tx: the first for storing the header, and the others
for storing the data.

Therefore, we could fetch the first data desc buf before the main loop,
and do the copy first before the check of "are we done yet?". This
could save one check for small packets, that just have one data desc
buffer and need one mbuf to store it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 52 ++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 2c3b810..34d6ed1 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -753,18 +753,48 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		return -1;
 
 	desc_addr = gpa_to_vva(dev, desc->addr);
-	rte_prefetch0((void *)(uintptr_t)desc_addr);
-
-	/* Retrieve virtio net header */
 	hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
-	desc_avail  = desc->len - dev->vhost_hlen;
-	desc_offset = dev->vhost_hlen;
+	rte_prefetch0(hdr);
+
+	/*
+	 * Both current kernel virio driver and DPDK virtio driver
+	 * use at least 2 desc bufferr for Tx: the first for storing
+	 * the header, and others for storing the data.
+	 */
+	if (likely(desc->len == dev->vhost_hlen)) {
+		desc = &vq->desc[desc->next];
+
+		desc_addr = gpa_to_vva(dev, desc->addr);
+		rte_prefetch0((void *)(uintptr_t)desc_addr);
+
+		desc_offset = 0;
+		desc_avail  = desc->len;
+		nr_desc    += 1;
+
+		PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
+	} else {
+		desc_avail  = desc->len - dev->vhost_hlen;
+		desc_offset = dev->vhost_hlen;
+	}
 
 	mbuf_offset = 0;
 	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
-	while (desc_avail != 0 || (desc->flags & VRING_DESC_F_NEXT) != 0) {
+	while (1) {
+		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
+		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
+			(void *)((uintptr_t)(desc_addr + desc_offset)),
+			cpy_len);
+
+		mbuf_avail  -= cpy_len;
+		mbuf_offset += cpy_len;
+		desc_avail  -= cpy_len;
+		desc_offset += cpy_len;
+
 		/* This desc reaches to its end, get the next one */
 		if (desc_avail == 0) {
+			if ((desc->flags & VRING_DESC_F_NEXT) == 0)
+				break;
+
 			if (unlikely(desc->next >= vq->size ||
 				     ++nr_desc >= vq->size))
 				return -1;
@@ -800,16 +830,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			mbuf_offset = 0;
 			mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
 		}
-
-		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
-			(void *)((uintptr_t)(desc_addr + desc_offset)),
-			cpy_len);
-
-		mbuf_avail  -= cpy_len;
-		mbuf_offset += cpy_len;
-		desc_avail  -= cpy_len;
-		desc_offset += cpy_len;
 	}
 
 	prev->data_len = mbuf_offset;
-- 
1.9.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing
  2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu
@ 2016-05-03  0:46 ` Yuanhan Liu
  2016-06-01  6:42   ` Xie, Huawei
  2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane
  2016-06-14 12:42 ` Yuanhan Liu
  4 siblings, 1 reply; 16+ messages in thread
From: Yuanhan Liu @ 2016-05-03  0:46 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie, Yuanhan Liu

the ifname[] field takes so much space, that it seperate some frequently
used fields into different caches, say, features and broadcast_rarp.

This patch move all those fields that will be accessed frequently in Rx/Tx
together (before the ifname[] field) to let them share one cache line.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost-net.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/vhost-net.h b/lib/librte_vhost/vhost-net.h
index 9dec83c..3b0ffe7 100644
--- a/lib/librte_vhost/vhost-net.h
+++ b/lib/librte_vhost/vhost-net.h
@@ -123,16 +123,16 @@ struct virtio_net {
 	int			vid;
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
+	/* to tell if we need broadcast rarp packet */
+	rte_atomic16_t		broadcast_rarp;
+	uint32_t		virt_qp_nb;
+	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
-	uint32_t		virt_qp_nb;
 	uint64_t		log_size;
 	uint64_t		log_base;
 	struct ether_addr	mac;
 
-	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
-	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 } __rte_cache_aligned;
 
 /**
-- 
1.9.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization
  2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
                   ` (2 preceding siblings ...)
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu
@ 2016-05-10 21:49 ` Rich Lane
  2016-05-10 22:08   ` Yuanhan Liu
  2016-06-14 12:42 ` Yuanhan Liu
  4 siblings, 1 reply; 16+ messages in thread
From: Rich Lane @ 2016-05-10 21:49 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, huawei.xie

I see a significant performance improvement with these patches, around 5%
at 64 bytes.

The one patch that didn't give any performance boost for me was "vhost:
arrange virtio_net fields for better cache sharing".

Tested-by: Rich Lane <rich.lane@bigswitch.com>

On Mon, May 2, 2016 at 5:46 PM, Yuanhan Liu <yuanhan.liu@linux.intel.com>
wrote:

> Here is a small patch set does the micro optimization, which brings about
> 10% performance boost in my 64B packet testing, with the following topo:
>
>     pkt generator <----> NIC <-----> Virtio NIC
>
> Patch 1 pre updates the used ring and update them in batch. It should be
> feasible from my understanding: there will be no issue, guest driver will
> not start processing them as far as we haven't updated the "used->idx"
> yet. I could miss something though.
>
> Patch 2 saves one check for small packets (that can be hold in one desc
> buf and mbuf).
>
> Patch 3 moves several frequently used fields into one cache line, for
> better cache sharing.
>
> Note that this patch set is based on my latest vhost ABI refactoring
> patchset.
>
>
> ---
> Yuanhan Liu (3):
>   vhost: pre update used ring for Tx and Rx
>   vhost: optimize dequeue for small packets
>   vhost: arrange virtio_net fields for better cache sharing
>
>  lib/librte_vhost/vhost-net.h  |   8 +--
>  lib/librte_vhost/vhost_rxtx.c | 110
> ++++++++++++++++++++++++------------------
>  2 files changed, 68 insertions(+), 50 deletions(-)
>
> --
> 1.9.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization
  2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane
@ 2016-05-10 22:08   ` Yuanhan Liu
  0 siblings, 0 replies; 16+ messages in thread
From: Yuanhan Liu @ 2016-05-10 22:08 UTC (permalink / raw)
  To: Rich Lane; +Cc: dev, huawei.xie

On Tue, May 10, 2016 at 02:49:43PM -0700, Rich Lane wrote:
> I see a significant performance improvement with these patches, around 5% at 64
> bytes.

Thanks for testing.

> 
> The one patch that didn't give any performance boost for me was "vhost: arrange
> virtio_net fields for better cache sharing".

Yeah, same here from my test. I mean, in theory, it should give us a
tiny boost, it doesn't in real life though. And since it (should) do
no harm, I would still include this patch in this set. Maybe I should
have noted at first that no real perf gain from the 3rd patch.

	--yliu

> 
> Tested-by: Rich Lane <rich.lane@bigswitch.com>
> 
> On Mon, May 2, 2016 at 5:46 PM, Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
> 
>     Here is a small patch set does the micro optimization, which brings about
>     10% performance boost in my 64B packet testing, with the following topo:
> 
>         pkt generator <----> NIC <-----> Virtio NIC
> 
>     Patch 1 pre updates the used ring and update them in batch. It should be
>     feasible from my understanding: there will be no issue, guest driver will
>     not start processing them as far as we haven't updated the "used->idx"
>     yet. I could miss something though.
> 
>     Patch 2 saves one check for small packets (that can be hold in one desc
>     buf and mbuf).
> 
>     Patch 3 moves several frequently used fields into one cache line, for
>     better cache sharing.
> 
>     Note that this patch set is based on my latest vhost ABI refactoring
>     patchset.
> 
> 
>     ---
>     Yuanhan Liu (3):
>       vhost: pre update used ring for Tx and Rx
>       vhost: optimize dequeue for small packets
>       vhost: arrange virtio_net fields for better cache sharing
> 
>      lib/librte_vhost/vhost-net.h  |   8 +--
>      lib/librte_vhost/vhost_rxtx.c | 110
>     ++++++++++++++++++++++++------------------
>      2 files changed, 68 insertions(+), 50 deletions(-)
>    
>     --
>     1.9.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu
@ 2016-06-01  6:24   ` Xie, Huawei
  2016-06-01  6:44     ` Yuanhan Liu
  2016-06-03  7:43   ` Xie, Huawei
  1 sibling, 1 reply; 16+ messages in thread
From: Xie, Huawei @ 2016-06-01  6:24 UTC (permalink / raw)
  To: Yuanhan Liu, dev

On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Both current kernel virtio driver and DPDK virtio driver use at least
> 2 desc buffer for Tx: the first for storing the header, and the others
> for storing the data.

Tx could prepend some space for virtio net header whenever possible, so
that it could use only one descriptor.

Another thing is this doesn't reduce the check because you also add a check.


>
> Therefore, we could fetch the first data desc buf before the main loop,
> and do the copy first before the check of "are we done yet?". This
> could save one check for small packets, that just have one data desc
> buffer and need one mbuf to store it.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost_rxtx.c | 52 ++++++++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 16 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 2c3b810..34d6ed1 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -753,18 +753,48 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  		return -1;
>  
>  	desc_addr = gpa_to_vva(dev, desc->addr);
> -	rte_prefetch0((void *)(uintptr_t)desc_addr);
> -
> -	/* Retrieve virtio net header */
>  	hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
> -	desc_avail  = desc->len - dev->vhost_hlen;
> -	desc_offset = dev->vhost_hlen;
> +	rte_prefetch0(hdr);
> +
> +	/*
> +	 * Both current kernel virio driver and DPDK virtio driver
> +	 * use at least 2 desc bufferr for Tx: the first for storing
> +	 * the header, and others for storing the data.
> +	 */
> +	if (likely(desc->len == dev->vhost_hlen)) {
> +		desc = &vq->desc[desc->next];
> +
> +		desc_addr = gpa_to_vva(dev, desc->addr);
> +		rte_prefetch0((void *)(uintptr_t)desc_addr);
> +
> +		desc_offset = 0;
> +		desc_avail  = desc->len;
> +		nr_desc    += 1;
> +
> +		PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> +	} else {
> +		desc_avail  = desc->len - dev->vhost_hlen;
> +		desc_offset = dev->vhost_hlen;
> +	}
>  
>  	mbuf_offset = 0;
>  	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
> -	while (desc_avail != 0 || (desc->flags & VRING_DESC_F_NEXT) != 0) {
> +	while (1) {
> +		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> +		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> +			(void *)((uintptr_t)(desc_addr + desc_offset)),
> +			cpy_len);
> +
> +		mbuf_avail  -= cpy_len;
> +		mbuf_offset += cpy_len;
> +		desc_avail  -= cpy_len;
> +		desc_offset += cpy_len;
> +
>  		/* This desc reaches to its end, get the next one */
>  		if (desc_avail == 0) {
> +			if ((desc->flags & VRING_DESC_F_NEXT) == 0)
> +				break;
> +
>  			if (unlikely(desc->next >= vq->size ||
>  				     ++nr_desc >= vq->size))
>  				return -1;
> @@ -800,16 +830,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  			mbuf_offset = 0;
>  			mbuf_avail  = cur->buf_len - RTE_PKTMBUF_HEADROOM;
>  		}
> -
> -		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
> -		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
> -			(void *)((uintptr_t)(desc_addr + desc_offset)),
> -			cpy_len);
> -
> -		mbuf_avail  -= cpy_len;
> -		mbuf_offset += cpy_len;
> -		desc_avail  -= cpy_len;
> -		desc_offset += cpy_len;
>  	}
>  
>  	prev->data_len = mbuf_offset;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu
@ 2016-06-01  6:40   ` Xie, Huawei
  2016-06-01  6:55     ` Yuanhan Liu
  2016-06-01 13:05     ` Michael S. Tsirkin
  0 siblings, 2 replies; 16+ messages in thread
From: Xie, Huawei @ 2016-06-01  6:40 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Michael S. Tsirkin

On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Pre update and update used ring in batch for Tx and Rx at the stage
> while fetching all avail desc idx. This would reduce some cache misses
> and hence, increase the performance a bit.
>
> Pre update would be feasible as guest driver will not start processing
> those entries as far as we don't update "used->idx". (I'm not 100%
> certain I don't miss anything, though).
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++----------------------
>  1 file changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index c9cd1c5..2c3b810 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
>  
>  static inline int __attribute__((always_inline))
>  copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> -		  struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)
> +		  struct rte_mbuf *m, uint16_t desc_idx)
>  {
>  	uint32_t desc_avail, desc_offset;
>  	uint32_t mbuf_avail, mbuf_offset;
> @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	desc_offset = dev->vhost_hlen;
>  	desc_avail  = desc->len - dev->vhost_hlen;
>  
> -	*copied = rte_pktmbuf_pkt_len(m);
>  	mbuf_avail  = rte_pktmbuf_data_len(m);
>  	mbuf_offset = 0;
>  	while (mbuf_avail != 0 || m->next != NULL) {
> @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  	struct vhost_virtqueue *vq;
>  	uint16_t res_start_idx, res_end_idx;
>  	uint16_t desc_indexes[MAX_PKT_BURST];
> +	uint16_t used_idx;
>  	uint32_t i;
>  
>  	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
> @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  	/* Retrieve all of the desc indexes first to avoid caching issues. */
>  	rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]);
>  	for (i = 0; i < count; i++) {
> -		desc_indexes[i] = vq->avail->ring[(res_start_idx + i) &
> -						  (vq->size - 1)];
> +		used_idx = (res_start_idx + i) & (vq->size - 1);
> +		desc_indexes[i] = vq->avail->ring[used_idx];
> +		vq->used->ring[used_idx].id = desc_indexes[i];
> +		vq->used->ring[used_idx].len = pkts[i]->pkt_len +
> +					       dev->vhost_hlen;
> +		vhost_log_used_vring(dev, vq,
> +			offsetof(struct vring_used, ring[used_idx]),
> +			sizeof(vq->used->ring[used_idx]));
>  	}
>  
>  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
>  	for (i = 0; i < count; i++) {
>  		uint16_t desc_idx = desc_indexes[i];
> -		uint16_t used_idx = (res_start_idx + i) & (vq->size - 1);
> -		uint32_t copied;
>  		int err;
>  
> -		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied);
> -
> -		vq->used->ring[used_idx].id = desc_idx;
> -		if (unlikely(err))
> +		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
> +		if (unlikely(err)) {
> +			used_idx = (res_start_idx + i) & (vq->size - 1);
>  			vq->used->ring[used_idx].len = dev->vhost_hlen;
> -		else
> -			vq->used->ring[used_idx].len = copied + dev->vhost_hlen;
> -		vhost_log_used_vring(dev, vq,
> -			offsetof(struct vring_used, ring[used_idx]),
> -			sizeof(vq->used->ring[used_idx]));
> +			vhost_log_used_vring(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
> +		}
>  
>  		if (i + 1 < count)
>  			rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
> @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  	/* Prefetch available ring to retrieve head indexes. */
>  	used_idx = vq->last_used_idx & (vq->size - 1);
>  	rte_prefetch0(&vq->avail->ring[used_idx]);
> +	rte_prefetch0(&vq->used->ring[used_idx]);
>  
>  	count = RTE_MIN(count, MAX_PKT_BURST);
>  	count = RTE_MIN(count, free_entries);
> @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  
>  	/* Retrieve all of the head indexes first to avoid caching issues. */
>  	for (i = 0; i < count; i++) {
> -		desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
> -					(vq->size - 1)];
> +		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
> +		desc_indexes[i] = vq->avail->ring[used_idx];
> +
> +		vq->used->ring[used_idx].id  = desc_indexes[i];
> +		vq->used->ring[used_idx].len = 0;
> +		vhost_log_used_vring(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
>  	}
>  
>  	/* Prefetch descriptor index. */
>  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> -	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
> -
>  	for (i = 0; i < count; i++) {
>  		int err;
>  
> -		if (likely(i + 1 < count)) {
> +		if (likely(i + 1 < count))
>  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
> -			rte_prefetch0(&vq->used->ring[(used_idx + 1) &
> -						      (vq->size - 1)]);
> -		}
>  
>  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
>  		if (unlikely(pkts[i] == NULL)) {
> @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  			rte_pktmbuf_free(pkts[i]);
>  			break;
>  		}
> -
> -		used_idx = vq->last_used_idx++ & (vq->size - 1);
> -		vq->used->ring[used_idx].id  = desc_indexes[i];
> -		vq->used->ring[used_idx].len = 0;
> -		vhost_log_used_vring(dev, vq,
> -				offsetof(struct vring_used, ring[used_idx]),
> -				sizeof(vq->used->ring[used_idx]));
>  	}

Had tried post-updating used ring in batch,  but forget the perf change.

One optimization would be on vhost_log_used_ring.
I have two ideas,
a) In QEMU side, we always assume use ring will be changed. so that we
don't need to log used ring in VHOST.

Michael: feasible in QEMU? comments on this?

b) We could always mark the total used ring modified rather than entry
by entry.


>  
>  	rte_smp_wmb();
>  	rte_smp_rmb();
>  	vq->used->idx += i;
> +	vq->last_used_idx += i;
>  	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>  			sizeof(vq->used->idx));
>  


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu
@ 2016-06-01  6:42   ` Xie, Huawei
  0 siblings, 0 replies; 16+ messages in thread
From: Xie, Huawei @ 2016-06-01  6:42 UTC (permalink / raw)
  To: Yuanhan Liu, dev

On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> the ifname[] field takes so much space, that it seperate some frequently
> used fields into different caches, say, features and broadcast_rarp.
>
> This patch move all those fields that will be accessed frequently in Rx/Tx
> together (before the ifname[] field) to let them share one cache line.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Acked-by: Huawei Xie <huawei.xie@intel.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets
  2016-06-01  6:24   ` Xie, Huawei
@ 2016-06-01  6:44     ` Yuanhan Liu
  2016-06-03  7:42       ` Xie, Huawei
  0 siblings, 1 reply; 16+ messages in thread
From: Yuanhan Liu @ 2016-06-01  6:44 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev

On Wed, Jun 01, 2016 at 06:24:18AM +0000, Xie, Huawei wrote:
> On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> > Both current kernel virtio driver and DPDK virtio driver use at least
> > 2 desc buffer for Tx: the first for storing the header, and the others
> > for storing the data.
> 
> Tx could prepend some space for virtio net header whenever possible, so
> that it could use only one descriptor.

In such case, it will work as well: it will goto the "else" code path
then.

> Another thing is this doesn't reduce the check because you also add a check.

Actually, yes, it does save check. Before this patch, we have:

	while (not done yet) {
 	       if (desc_is_drain)
 	               ...;

 	       if (mbuf_is_full)
 	               ...;

 	       COPY();
	}

Note that the "while" check will be done twice, therefore, it's 4
checks in total. After this patch, it would be:

	if (virtio_net_hdr_takes_one_desc)
		...

	while (1) {
		COPY();

		if (desc_is_drain) {
			break if done;
			...;
		}

		if (mbuf_is_full {
			/* small packets will bypass this check */
			....;
		}
	}

So, for small packets, it takes 2 checks only, which actually saves
2 checks.

	--yliu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx
  2016-06-01  6:40   ` Xie, Huawei
@ 2016-06-01  6:55     ` Yuanhan Liu
  2016-06-03  8:18       ` Xie, Huawei
  2016-06-01 13:05     ` Michael S. Tsirkin
  1 sibling, 1 reply; 16+ messages in thread
From: Yuanhan Liu @ 2016-06-01  6:55 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Michael S. Tsirkin

On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote:
> >  	/* Retrieve all of the head indexes first to avoid caching issues. */
> >  	for (i = 0; i < count; i++) {
> > -		desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
> > -					(vq->size - 1)];
> > +		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
> > +		desc_indexes[i] = vq->avail->ring[used_idx];
> > +
> > +		vq->used->ring[used_idx].id  = desc_indexes[i];
> > +		vq->used->ring[used_idx].len = 0;
> > +		vhost_log_used_vring(dev, vq,
> > +				offsetof(struct vring_used, ring[used_idx]),
> > +				sizeof(vq->used->ring[used_idx]));
> >  	}
> >  
> >  	/* Prefetch descriptor index. */
> >  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> > -	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
> > -
> >  	for (i = 0; i < count; i++) {
> >  		int err;
> >  
> > -		if (likely(i + 1 < count)) {
> > +		if (likely(i + 1 < count))
> >  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
> > -			rte_prefetch0(&vq->used->ring[(used_idx + 1) &
> > -						      (vq->size - 1)]);
> > -		}
> >  
> >  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> >  		if (unlikely(pkts[i] == NULL)) {
> > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> >  			rte_pktmbuf_free(pkts[i]);
> >  			break;
> >  		}
> > -
> > -		used_idx = vq->last_used_idx++ & (vq->size - 1);
> > -		vq->used->ring[used_idx].id  = desc_indexes[i];
> > -		vq->used->ring[used_idx].len = 0;
> > -		vhost_log_used_vring(dev, vq,
> > -				offsetof(struct vring_used, ring[used_idx]),
> > -				sizeof(vq->used->ring[used_idx]));
> >  	}
> 
> Had tried post-updating used ring in batch,  but forget the perf change.

I would assume pre-updating gives better performance gain, as we are
fiddling with avail and used ring together, which would be more cache
friendly.

> One optimization would be on vhost_log_used_ring.
> I have two ideas,
> a) In QEMU side, we always assume use ring will be changed. so that we
> don't need to log used ring in VHOST.
> 
> Michael: feasible in QEMU? comments on this?
> 
> b) We could always mark the total used ring modified rather than entry
> by entry.

I doubt it's worthwhile. One fact is that vhost_log_used_ring is
a non operation in most time: it will take action only in the short
gap of during live migration.

And FYI, I even tried with all vhost_log_xxx being removed, it showed
no performance boost at all. Therefore, it's not a factor that will
impact performance.

	--yliu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx
  2016-06-01  6:40   ` Xie, Huawei
  2016-06-01  6:55     ` Yuanhan Liu
@ 2016-06-01 13:05     ` Michael S. Tsirkin
  1 sibling, 0 replies; 16+ messages in thread
From: Michael S. Tsirkin @ 2016-06-01 13:05 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Yuanhan Liu, dev

On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote:
> On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> > Pre update and update used ring in batch for Tx and Rx at the stage
> > while fetching all avail desc idx. This would reduce some cache misses
> > and hence, increase the performance a bit.
> >
> > Pre update would be feasible as guest driver will not start processing
> > those entries as far as we don't update "used->idx". (I'm not 100%
> > certain I don't miss anything, though).
> >
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++----------------------
> >  1 file changed, 28 insertions(+), 30 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index c9cd1c5..2c3b810 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr,
> >  
> >  static inline int __attribute__((always_inline))
> >  copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > -		  struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)
> > +		  struct rte_mbuf *m, uint16_t desc_idx)
> >  {
> >  	uint32_t desc_avail, desc_offset;
> >  	uint32_t mbuf_avail, mbuf_offset;
> > @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >  	desc_offset = dev->vhost_hlen;
> >  	desc_avail  = desc->len - dev->vhost_hlen;
> >  
> > -	*copied = rte_pktmbuf_pkt_len(m);
> >  	mbuf_avail  = rte_pktmbuf_data_len(m);
> >  	mbuf_offset = 0;
> >  	while (mbuf_avail != 0 || m->next != NULL) {
> > @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  	struct vhost_virtqueue *vq;
> >  	uint16_t res_start_idx, res_end_idx;
> >  	uint16_t desc_indexes[MAX_PKT_BURST];
> > +	uint16_t used_idx;
> >  	uint32_t i;
> >  
> >  	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
> > @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  	/* Retrieve all of the desc indexes first to avoid caching issues. */
> >  	rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]);
> >  	for (i = 0; i < count; i++) {
> > -		desc_indexes[i] = vq->avail->ring[(res_start_idx + i) &
> > -						  (vq->size - 1)];
> > +		used_idx = (res_start_idx + i) & (vq->size - 1);
> > +		desc_indexes[i] = vq->avail->ring[used_idx];
> > +		vq->used->ring[used_idx].id = desc_indexes[i];
> > +		vq->used->ring[used_idx].len = pkts[i]->pkt_len +
> > +					       dev->vhost_hlen;
> > +		vhost_log_used_vring(dev, vq,
> > +			offsetof(struct vring_used, ring[used_idx]),
> > +			sizeof(vq->used->ring[used_idx]));
> >  	}
> >  
> >  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> >  	for (i = 0; i < count; i++) {
> >  		uint16_t desc_idx = desc_indexes[i];
> > -		uint16_t used_idx = (res_start_idx + i) & (vq->size - 1);
> > -		uint32_t copied;
> >  		int err;
> >  
> > -		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied);
> > -
> > -		vq->used->ring[used_idx].id = desc_idx;
> > -		if (unlikely(err))
> > +		err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
> > +		if (unlikely(err)) {
> > +			used_idx = (res_start_idx + i) & (vq->size - 1);
> >  			vq->used->ring[used_idx].len = dev->vhost_hlen;
> > -		else
> > -			vq->used->ring[used_idx].len = copied + dev->vhost_hlen;
> > -		vhost_log_used_vring(dev, vq,
> > -			offsetof(struct vring_used, ring[used_idx]),
> > -			sizeof(vq->used->ring[used_idx]));
> > +			vhost_log_used_vring(dev, vq,
> > +				offsetof(struct vring_used, ring[used_idx]),
> > +				sizeof(vq->used->ring[used_idx]));
> > +		}
> >  
> >  		if (i + 1 < count)
> >  			rte_prefetch0(&vq->desc[desc_indexes[i+1]]);
> > @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> >  	/* Prefetch available ring to retrieve head indexes. */
> >  	used_idx = vq->last_used_idx & (vq->size - 1);
> >  	rte_prefetch0(&vq->avail->ring[used_idx]);
> > +	rte_prefetch0(&vq->used->ring[used_idx]);
> >  
> >  	count = RTE_MIN(count, MAX_PKT_BURST);
> >  	count = RTE_MIN(count, free_entries);
> > @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> >  
> >  	/* Retrieve all of the head indexes first to avoid caching issues. */
> >  	for (i = 0; i < count; i++) {
> > -		desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
> > -					(vq->size - 1)];
> > +		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
> > +		desc_indexes[i] = vq->avail->ring[used_idx];
> > +
> > +		vq->used->ring[used_idx].id  = desc_indexes[i];
> > +		vq->used->ring[used_idx].len = 0;
> > +		vhost_log_used_vring(dev, vq,
> > +				offsetof(struct vring_used, ring[used_idx]),
> > +				sizeof(vq->used->ring[used_idx]));
> >  	}
> >  
> >  	/* Prefetch descriptor index. */
> >  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
> > -	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
> > -
> >  	for (i = 0; i < count; i++) {
> >  		int err;
> >  
> > -		if (likely(i + 1 < count)) {
> > +		if (likely(i + 1 < count))
> >  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
> > -			rte_prefetch0(&vq->used->ring[(used_idx + 1) &
> > -						      (vq->size - 1)]);
> > -		}
> >  
> >  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
> >  		if (unlikely(pkts[i] == NULL)) {
> > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> >  			rte_pktmbuf_free(pkts[i]);
> >  			break;
> >  		}
> > -
> > -		used_idx = vq->last_used_idx++ & (vq->size - 1);
> > -		vq->used->ring[used_idx].id  = desc_indexes[i];
> > -		vq->used->ring[used_idx].len = 0;
> > -		vhost_log_used_vring(dev, vq,
> > -				offsetof(struct vring_used, ring[used_idx]),
> > -				sizeof(vq->used->ring[used_idx]));
> >  	}
> 
> Had tried post-updating used ring in batch,  but forget the perf change.
> 
> One optimization would be on vhost_log_used_ring.
> I have two ideas,
> a) In QEMU side, we always assume use ring will be changed. so that we
> don't need to log used ring in VHOST.
> 
> Michael: feasible in QEMU? comments on this?

To avoid breaking old QEMU, we'll need a new protocol feature bit,
but generally this sounds reasonable.

> b) We could always mark the total used ring modified rather than entry
> by entry.
> 
> >  
> >  	rte_smp_wmb();
> >  	rte_smp_rmb();
> >  	vq->used->idx += i;
> > +	vq->last_used_idx += i;
> >  	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
> >  			sizeof(vq->used->idx));
> >  
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets
  2016-06-01  6:44     ` Yuanhan Liu
@ 2016-06-03  7:42       ` Xie, Huawei
  0 siblings, 0 replies; 16+ messages in thread
From: Xie, Huawei @ 2016-06-03  7:42 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev

On 6/1/2016 2:41 PM, Yuanhan Liu wrote:
> On Wed, Jun 01, 2016 at 06:24:18AM +0000, Xie, Huawei wrote:
>> On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
>>> Both current kernel virtio driver and DPDK virtio driver use at least
>>> 2 desc buffer for Tx: the first for storing the header, and the others
>>> for storing the data.
>> Tx could prepend some space for virtio net header whenever possible, so
>> that it could use only one descriptor.
> In such case, it will work as well: it will goto the "else" code path
> then.

It works, but the problem is the statement.

>> Another thing is this doesn't reduce the check because you also add a check.
> Actually, yes, it does save check. Before this patch, we have:
>
> 	while (not done yet) {
>  	       if (desc_is_drain)
>  	               ...;
>
>  	       if (mbuf_is_full)
>  	               ...;
>
>  	       COPY();
> 	}
>
> Note that the "while" check will be done twice, therefore, it's 4
> checks in total. After this patch, it would be:
>
> 	if (virtio_net_hdr_takes_one_desc)
> 		...
>
> 	while (1) {
> 		COPY();
>
> 		if (desc_is_drain) {
> 			break if done;
> 			...;
> 		}
>
> 		if (mbuf_is_full {
> 			/* small packets will bypass this check */
> 			....;
> 		}
> 	}
>
> So, for small packets, it takes 2 checks only, which actually saves
> 2 checks.
>
> 	--yliu
>

For small packets, we have three checks totally. It worth we use space
to save checks, so would ack this patch, but note that there is one
assumption that rte_memcpy API could handle zero length copy.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets
  2016-05-03  0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu
  2016-06-01  6:24   ` Xie, Huawei
@ 2016-06-03  7:43   ` Xie, Huawei
  1 sibling, 0 replies; 16+ messages in thread
From: Xie, Huawei @ 2016-06-03  7:43 UTC (permalink / raw)
  To: Yuanhan Liu, dev

On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> Both current kernel virtio driver and DPDK virtio driver use at least
> 2 desc buffer for Tx: the first for storing the header, and the others
> for storing the data.
>
> Therefore, we could fetch the first data desc buf before the main loop,
> and do the copy first before the check of "are we done yet?". This
> could save one check for small packets, that just have one data desc
> buffer and need one mbuf to store it.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Acked-by: Huawei Xie <huawei.xie@intel.com>





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx
  2016-06-01  6:55     ` Yuanhan Liu
@ 2016-06-03  8:18       ` Xie, Huawei
  0 siblings, 0 replies; 16+ messages in thread
From: Xie, Huawei @ 2016-06-03  8:18 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On 6/1/2016 2:53 PM, Yuanhan Liu wrote:
> On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote:
>>>  	/* Retrieve all of the head indexes first to avoid caching issues. */
>>>  	for (i = 0; i < count; i++) {
>>> -		desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) &
>>> -					(vq->size - 1)];
>>> +		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
>>> +		desc_indexes[i] = vq->avail->ring[used_idx];
>>> +
>>> +		vq->used->ring[used_idx].id  = desc_indexes[i];
>>> +		vq->used->ring[used_idx].len = 0;
>>> +		vhost_log_used_vring(dev, vq,
>>> +				offsetof(struct vring_used, ring[used_idx]),
>>> +				sizeof(vq->used->ring[used_idx]));
>>>  	}
>>>  
>>>  	/* Prefetch descriptor index. */
>>>  	rte_prefetch0(&vq->desc[desc_indexes[0]]);
>>> -	rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]);
>>> -
>>>  	for (i = 0; i < count; i++) {
>>>  		int err;
>>>  
>>> -		if (likely(i + 1 < count)) {
>>> +		if (likely(i + 1 < count))
>>>  			rte_prefetch0(&vq->desc[desc_indexes[i + 1]]);
>>> -			rte_prefetch0(&vq->used->ring[(used_idx + 1) &
>>> -						      (vq->size - 1)]);
>>> -		}
>>>  
>>>  		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
>>>  		if (unlikely(pkts[i] == NULL)) {
>>> @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>>>  			rte_pktmbuf_free(pkts[i]);
>>>  			break;
>>>  		}
>>> -
>>> -		used_idx = vq->last_used_idx++ & (vq->size - 1);
>>> -		vq->used->ring[used_idx].id  = desc_indexes[i];
>>> -		vq->used->ring[used_idx].len = 0;
>>> -		vhost_log_used_vring(dev, vq,
>>> -				offsetof(struct vring_used, ring[used_idx]),
>>> -				sizeof(vq->used->ring[used_idx]));
>>>  	}
>> Had tried post-updating used ring in batch,  but forget the perf change.
> I would assume pre-updating gives better performance gain, as we are
> fiddling with avail and used ring together, which would be more cache
> friendly.
 
The distance between entry for avail ring and used ring are at least 8
cache lines.
The benefit comes from batch updates, if applicable.

>
>> One optimization would be on vhost_log_used_ring.
>> I have two ideas,
>> a) In QEMU side, we always assume use ring will be changed. so that we
>> don't need to log used ring in VHOST.
>>
>> Michael: feasible in QEMU? comments on this?
>>
>> b) We could always mark the total used ring modified rather than entry
>> by entry.
> I doubt it's worthwhile. One fact is that vhost_log_used_ring is
> a non operation in most time: it will take action only in the short
> gap of during live migration.
>
> And FYI, I even tried with all vhost_log_xxx being removed, it showed
> no performance boost at all. Therefore, it's not a factor that will
> impact performance.

I knew this.

> 	--yliu
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization
  2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
                   ` (3 preceding siblings ...)
  2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane
@ 2016-06-14 12:42 ` Yuanhan Liu
  4 siblings, 0 replies; 16+ messages in thread
From: Yuanhan Liu @ 2016-06-14 12:42 UTC (permalink / raw)
  To: dev; +Cc: huawei.xie

Applied to dpdk-next-virtio.

	--yliu

On Mon, May 02, 2016 at 05:46:15PM -0700, Yuanhan Liu wrote:
> Here is a small patch set does the micro optimization, which brings about
> 10% performance boost in my 64B packet testing, with the following topo:
> 
>     pkt generator <----> NIC <-----> Virtio NIC
> 
> Patch 1 pre updates the used ring and update them in batch. It should be
> feasible from my understanding: there will be no issue, guest driver will
> not start processing them as far as we haven't updated the "used->idx"
> yet. I could miss something though.
> 
> Patch 2 saves one check for small packets (that can be hold in one desc
> buf and mbuf).
> 
> Patch 3 moves several frequently used fields into one cache line, for
> better cache sharing. 
> 
> Note that this patch set is based on my latest vhost ABI refactoring patchset.
> 
> 
> ---
> Yuanhan Liu (3):
>   vhost: pre update used ring for Tx and Rx
>   vhost: optimize dequeue for small packets
>   vhost: arrange virtio_net fields for better cache sharing
> 
>  lib/librte_vhost/vhost-net.h  |   8 +--
>  lib/librte_vhost/vhost_rxtx.c | 110 ++++++++++++++++++++++++------------------
>  2 files changed, 68 insertions(+), 50 deletions(-)
> 
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-06-14 12:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-03  0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu
2016-05-03  0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu
2016-06-01  6:40   ` Xie, Huawei
2016-06-01  6:55     ` Yuanhan Liu
2016-06-03  8:18       ` Xie, Huawei
2016-06-01 13:05     ` Michael S. Tsirkin
2016-05-03  0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu
2016-06-01  6:24   ` Xie, Huawei
2016-06-01  6:44     ` Yuanhan Liu
2016-06-03  7:42       ` Xie, Huawei
2016-06-03  7:43   ` Xie, Huawei
2016-05-03  0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu
2016-06-01  6:42   ` Xie, Huawei
2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane
2016-05-10 22:08   ` Yuanhan Liu
2016-06-14 12:42 ` Yuanhan Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).