* [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization @ 2016-05-03 0:46 Yuanhan Liu 2016-05-03 0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu ` (4 more replies) 0 siblings, 5 replies; 16+ messages in thread From: Yuanhan Liu @ 2016-05-03 0:46 UTC (permalink / raw) To: dev; +Cc: huawei.xie, Yuanhan Liu Here is a small patch set does the micro optimization, which brings about 10% performance boost in my 64B packet testing, with the following topo: pkt generator <----> NIC <-----> Virtio NIC Patch 1 pre updates the used ring and update them in batch. It should be feasible from my understanding: there will be no issue, guest driver will not start processing them as far as we haven't updated the "used->idx" yet. I could miss something though. Patch 2 saves one check for small packets (that can be hold in one desc buf and mbuf). Patch 3 moves several frequently used fields into one cache line, for better cache sharing. Note that this patch set is based on my latest vhost ABI refactoring patchset. --- Yuanhan Liu (3): vhost: pre update used ring for Tx and Rx vhost: optimize dequeue for small packets vhost: arrange virtio_net fields for better cache sharing lib/librte_vhost/vhost-net.h | 8 +-- lib/librte_vhost/vhost_rxtx.c | 110 ++++++++++++++++++++++++------------------ 2 files changed, 68 insertions(+), 50 deletions(-) -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu @ 2016-05-03 0:46 ` Yuanhan Liu 2016-06-01 6:40 ` Xie, Huawei 2016-05-03 0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu ` (3 subsequent siblings) 4 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2016-05-03 0:46 UTC (permalink / raw) To: dev; +Cc: huawei.xie, Yuanhan Liu, Michael S. Tsirkin Pre update and update used ring in batch for Tx and Rx at the stage while fetching all avail desc idx. This would reduce some cache misses and hence, increase the performance a bit. Pre update would be feasible as guest driver will not start processing those entries as far as we don't update "used->idx". (I'm not 100% certain I don't miss anything, though). Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++---------------------- 1 file changed, 28 insertions(+), 30 deletions(-) diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c index c9cd1c5..2c3b810 100644 --- a/lib/librte_vhost/vhost_rxtx.c +++ b/lib/librte_vhost/vhost_rxtx.c @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr, static inline int __attribute__((always_inline)) copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, - struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied) + struct rte_mbuf *m, uint16_t desc_idx) { uint32_t desc_avail, desc_offset; uint32_t mbuf_avail, mbuf_offset; @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, desc_offset = dev->vhost_hlen; desc_avail = desc->len - dev->vhost_hlen; - *copied = rte_pktmbuf_pkt_len(m); mbuf_avail = rte_pktmbuf_data_len(m); mbuf_offset = 0; while (mbuf_avail != 0 || m->next != NULL) { @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, struct vhost_virtqueue *vq; uint16_t res_start_idx, res_end_idx; uint16_t desc_indexes[MAX_PKT_BURST]; + uint16_t used_idx; uint32_t i; LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, /* Retrieve all of the desc indexes first to avoid caching issues. */ rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]); for (i = 0; i < count; i++) { - desc_indexes[i] = vq->avail->ring[(res_start_idx + i) & - (vq->size - 1)]; + used_idx = (res_start_idx + i) & (vq->size - 1); + desc_indexes[i] = vq->avail->ring[used_idx]; + vq->used->ring[used_idx].id = desc_indexes[i]; + vq->used->ring[used_idx].len = pkts[i]->pkt_len + + dev->vhost_hlen; + vhost_log_used_vring(dev, vq, + offsetof(struct vring_used, ring[used_idx]), + sizeof(vq->used->ring[used_idx])); } rte_prefetch0(&vq->desc[desc_indexes[0]]); for (i = 0; i < count; i++) { uint16_t desc_idx = desc_indexes[i]; - uint16_t used_idx = (res_start_idx + i) & (vq->size - 1); - uint32_t copied; int err; - err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied); - - vq->used->ring[used_idx].id = desc_idx; - if (unlikely(err)) + err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx); + if (unlikely(err)) { + used_idx = (res_start_idx + i) & (vq->size - 1); vq->used->ring[used_idx].len = dev->vhost_hlen; - else - vq->used->ring[used_idx].len = copied + dev->vhost_hlen; - vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, ring[used_idx]), - sizeof(vq->used->ring[used_idx])); + vhost_log_used_vring(dev, vq, + offsetof(struct vring_used, ring[used_idx]), + sizeof(vq->used->ring[used_idx])); + } if (i + 1 < count) rte_prefetch0(&vq->desc[desc_indexes[i+1]]); @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, /* Prefetch available ring to retrieve head indexes. */ used_idx = vq->last_used_idx & (vq->size - 1); rte_prefetch0(&vq->avail->ring[used_idx]); + rte_prefetch0(&vq->used->ring[used_idx]); count = RTE_MIN(count, MAX_PKT_BURST); count = RTE_MIN(count, free_entries); @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, /* Retrieve all of the head indexes first to avoid caching issues. */ for (i = 0; i < count; i++) { - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & - (vq->size - 1)]; + used_idx = (vq->last_used_idx + i) & (vq->size - 1); + desc_indexes[i] = vq->avail->ring[used_idx]; + + vq->used->ring[used_idx].id = desc_indexes[i]; + vq->used->ring[used_idx].len = 0; + vhost_log_used_vring(dev, vq, + offsetof(struct vring_used, ring[used_idx]), + sizeof(vq->used->ring[used_idx])); } /* Prefetch descriptor index. */ rte_prefetch0(&vq->desc[desc_indexes[0]]); - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); - for (i = 0; i < count; i++) { int err; - if (likely(i + 1 < count)) { + if (likely(i + 1 < count)) rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); - rte_prefetch0(&vq->used->ring[(used_idx + 1) & - (vq->size - 1)]); - } pkts[i] = rte_pktmbuf_alloc(mbuf_pool); if (unlikely(pkts[i] == NULL)) { @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, rte_pktmbuf_free(pkts[i]); break; } - - used_idx = vq->last_used_idx++ & (vq->size - 1); - vq->used->ring[used_idx].id = desc_indexes[i]; - vq->used->ring[used_idx].len = 0; - vhost_log_used_vring(dev, vq, - offsetof(struct vring_used, ring[used_idx]), - sizeof(vq->used->ring[used_idx])); } rte_smp_wmb(); rte_smp_rmb(); vq->used->idx += i; + vq->last_used_idx += i; vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx), sizeof(vq->used->idx)); -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx 2016-05-03 0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu @ 2016-06-01 6:40 ` Xie, Huawei 2016-06-01 6:55 ` Yuanhan Liu 2016-06-01 13:05 ` Michael S. Tsirkin 0 siblings, 2 replies; 16+ messages in thread From: Xie, Huawei @ 2016-06-01 6:40 UTC (permalink / raw) To: Yuanhan Liu, dev; +Cc: Michael S. Tsirkin On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > Pre update and update used ring in batch for Tx and Rx at the stage > while fetching all avail desc idx. This would reduce some cache misses > and hence, increase the performance a bit. > > Pre update would be feasible as guest driver will not start processing > those entries as far as we don't update "used->idx". (I'm not 100% > certain I don't miss anything, though). > > Cc: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++---------------------- > 1 file changed, 28 insertions(+), 30 deletions(-) > > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c > index c9cd1c5..2c3b810 100644 > --- a/lib/librte_vhost/vhost_rxtx.c > +++ b/lib/librte_vhost/vhost_rxtx.c > @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr, > > static inline int __attribute__((always_inline)) > copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, > - struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied) > + struct rte_mbuf *m, uint16_t desc_idx) > { > uint32_t desc_avail, desc_offset; > uint32_t mbuf_avail, mbuf_offset; > @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, > desc_offset = dev->vhost_hlen; > desc_avail = desc->len - dev->vhost_hlen; > > - *copied = rte_pktmbuf_pkt_len(m); > mbuf_avail = rte_pktmbuf_data_len(m); > mbuf_offset = 0; > while (mbuf_avail != 0 || m->next != NULL) { > @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, > struct vhost_virtqueue *vq; > uint16_t res_start_idx, res_end_idx; > uint16_t desc_indexes[MAX_PKT_BURST]; > + uint16_t used_idx; > uint32_t i; > > LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); > @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, > /* Retrieve all of the desc indexes first to avoid caching issues. */ > rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]); > for (i = 0; i < count; i++) { > - desc_indexes[i] = vq->avail->ring[(res_start_idx + i) & > - (vq->size - 1)]; > + used_idx = (res_start_idx + i) & (vq->size - 1); > + desc_indexes[i] = vq->avail->ring[used_idx]; > + vq->used->ring[used_idx].id = desc_indexes[i]; > + vq->used->ring[used_idx].len = pkts[i]->pkt_len + > + dev->vhost_hlen; > + vhost_log_used_vring(dev, vq, > + offsetof(struct vring_used, ring[used_idx]), > + sizeof(vq->used->ring[used_idx])); > } > > rte_prefetch0(&vq->desc[desc_indexes[0]]); > for (i = 0; i < count; i++) { > uint16_t desc_idx = desc_indexes[i]; > - uint16_t used_idx = (res_start_idx + i) & (vq->size - 1); > - uint32_t copied; > int err; > > - err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied); > - > - vq->used->ring[used_idx].id = desc_idx; > - if (unlikely(err)) > + err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx); > + if (unlikely(err)) { > + used_idx = (res_start_idx + i) & (vq->size - 1); > vq->used->ring[used_idx].len = dev->vhost_hlen; > - else > - vq->used->ring[used_idx].len = copied + dev->vhost_hlen; > - vhost_log_used_vring(dev, vq, > - offsetof(struct vring_used, ring[used_idx]), > - sizeof(vq->used->ring[used_idx])); > + vhost_log_used_vring(dev, vq, > + offsetof(struct vring_used, ring[used_idx]), > + sizeof(vq->used->ring[used_idx])); > + } > > if (i + 1 < count) > rte_prefetch0(&vq->desc[desc_indexes[i+1]]); > @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > /* Prefetch available ring to retrieve head indexes. */ > used_idx = vq->last_used_idx & (vq->size - 1); > rte_prefetch0(&vq->avail->ring[used_idx]); > + rte_prefetch0(&vq->used->ring[used_idx]); > > count = RTE_MIN(count, MAX_PKT_BURST); > count = RTE_MIN(count, free_entries); > @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > > /* Retrieve all of the head indexes first to avoid caching issues. */ > for (i = 0; i < count; i++) { > - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & > - (vq->size - 1)]; > + used_idx = (vq->last_used_idx + i) & (vq->size - 1); > + desc_indexes[i] = vq->avail->ring[used_idx]; > + > + vq->used->ring[used_idx].id = desc_indexes[i]; > + vq->used->ring[used_idx].len = 0; > + vhost_log_used_vring(dev, vq, > + offsetof(struct vring_used, ring[used_idx]), > + sizeof(vq->used->ring[used_idx])); > } > > /* Prefetch descriptor index. */ > rte_prefetch0(&vq->desc[desc_indexes[0]]); > - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); > - > for (i = 0; i < count; i++) { > int err; > > - if (likely(i + 1 < count)) { > + if (likely(i + 1 < count)) > rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); > - rte_prefetch0(&vq->used->ring[(used_idx + 1) & > - (vq->size - 1)]); > - } > > pkts[i] = rte_pktmbuf_alloc(mbuf_pool); > if (unlikely(pkts[i] == NULL)) { > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > rte_pktmbuf_free(pkts[i]); > break; > } > - > - used_idx = vq->last_used_idx++ & (vq->size - 1); > - vq->used->ring[used_idx].id = desc_indexes[i]; > - vq->used->ring[used_idx].len = 0; > - vhost_log_used_vring(dev, vq, > - offsetof(struct vring_used, ring[used_idx]), > - sizeof(vq->used->ring[used_idx])); > } Had tried post-updating used ring in batch, but forget the perf change. One optimization would be on vhost_log_used_ring. I have two ideas, a) In QEMU side, we always assume use ring will be changed. so that we don't need to log used ring in VHOST. Michael: feasible in QEMU? comments on this? b) We could always mark the total used ring modified rather than entry by entry. > > rte_smp_wmb(); > rte_smp_rmb(); > vq->used->idx += i; > + vq->last_used_idx += i; > vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx), > sizeof(vq->used->idx)); > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx 2016-06-01 6:40 ` Xie, Huawei @ 2016-06-01 6:55 ` Yuanhan Liu 2016-06-03 8:18 ` Xie, Huawei 2016-06-01 13:05 ` Michael S. Tsirkin 1 sibling, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2016-06-01 6:55 UTC (permalink / raw) To: Xie, Huawei; +Cc: dev, Michael S. Tsirkin On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote: > > /* Retrieve all of the head indexes first to avoid caching issues. */ > > for (i = 0; i < count; i++) { > > - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & > > - (vq->size - 1)]; > > + used_idx = (vq->last_used_idx + i) & (vq->size - 1); > > + desc_indexes[i] = vq->avail->ring[used_idx]; > > + > > + vq->used->ring[used_idx].id = desc_indexes[i]; > > + vq->used->ring[used_idx].len = 0; > > + vhost_log_used_vring(dev, vq, > > + offsetof(struct vring_used, ring[used_idx]), > > + sizeof(vq->used->ring[used_idx])); > > } > > > > /* Prefetch descriptor index. */ > > rte_prefetch0(&vq->desc[desc_indexes[0]]); > > - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); > > - > > for (i = 0; i < count; i++) { > > int err; > > > > - if (likely(i + 1 < count)) { > > + if (likely(i + 1 < count)) > > rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); > > - rte_prefetch0(&vq->used->ring[(used_idx + 1) & > > - (vq->size - 1)]); > > - } > > > > pkts[i] = rte_pktmbuf_alloc(mbuf_pool); > > if (unlikely(pkts[i] == NULL)) { > > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > > rte_pktmbuf_free(pkts[i]); > > break; > > } > > - > > - used_idx = vq->last_used_idx++ & (vq->size - 1); > > - vq->used->ring[used_idx].id = desc_indexes[i]; > > - vq->used->ring[used_idx].len = 0; > > - vhost_log_used_vring(dev, vq, > > - offsetof(struct vring_used, ring[used_idx]), > > - sizeof(vq->used->ring[used_idx])); > > } > > Had tried post-updating used ring in batch, but forget the perf change. I would assume pre-updating gives better performance gain, as we are fiddling with avail and used ring together, which would be more cache friendly. > One optimization would be on vhost_log_used_ring. > I have two ideas, > a) In QEMU side, we always assume use ring will be changed. so that we > don't need to log used ring in VHOST. > > Michael: feasible in QEMU? comments on this? > > b) We could always mark the total used ring modified rather than entry > by entry. I doubt it's worthwhile. One fact is that vhost_log_used_ring is a non operation in most time: it will take action only in the short gap of during live migration. And FYI, I even tried with all vhost_log_xxx being removed, it showed no performance boost at all. Therefore, it's not a factor that will impact performance. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx 2016-06-01 6:55 ` Yuanhan Liu @ 2016-06-03 8:18 ` Xie, Huawei 0 siblings, 0 replies; 16+ messages in thread From: Xie, Huawei @ 2016-06-03 8:18 UTC (permalink / raw) To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin On 6/1/2016 2:53 PM, Yuanhan Liu wrote: > On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote: >>> /* Retrieve all of the head indexes first to avoid caching issues. */ >>> for (i = 0; i < count; i++) { >>> - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & >>> - (vq->size - 1)]; >>> + used_idx = (vq->last_used_idx + i) & (vq->size - 1); >>> + desc_indexes[i] = vq->avail->ring[used_idx]; >>> + >>> + vq->used->ring[used_idx].id = desc_indexes[i]; >>> + vq->used->ring[used_idx].len = 0; >>> + vhost_log_used_vring(dev, vq, >>> + offsetof(struct vring_used, ring[used_idx]), >>> + sizeof(vq->used->ring[used_idx])); >>> } >>> >>> /* Prefetch descriptor index. */ >>> rte_prefetch0(&vq->desc[desc_indexes[0]]); >>> - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); >>> - >>> for (i = 0; i < count; i++) { >>> int err; >>> >>> - if (likely(i + 1 < count)) { >>> + if (likely(i + 1 < count)) >>> rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); >>> - rte_prefetch0(&vq->used->ring[(used_idx + 1) & >>> - (vq->size - 1)]); >>> - } >>> >>> pkts[i] = rte_pktmbuf_alloc(mbuf_pool); >>> if (unlikely(pkts[i] == NULL)) { >>> @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, >>> rte_pktmbuf_free(pkts[i]); >>> break; >>> } >>> - >>> - used_idx = vq->last_used_idx++ & (vq->size - 1); >>> - vq->used->ring[used_idx].id = desc_indexes[i]; >>> - vq->used->ring[used_idx].len = 0; >>> - vhost_log_used_vring(dev, vq, >>> - offsetof(struct vring_used, ring[used_idx]), >>> - sizeof(vq->used->ring[used_idx])); >>> } >> Had tried post-updating used ring in batch, but forget the perf change. > I would assume pre-updating gives better performance gain, as we are > fiddling with avail and used ring together, which would be more cache > friendly. The distance between entry for avail ring and used ring are at least 8 cache lines. The benefit comes from batch updates, if applicable. > >> One optimization would be on vhost_log_used_ring. >> I have two ideas, >> a) In QEMU side, we always assume use ring will be changed. so that we >> don't need to log used ring in VHOST. >> >> Michael: feasible in QEMU? comments on this? >> >> b) We could always mark the total used ring modified rather than entry >> by entry. > I doubt it's worthwhile. One fact is that vhost_log_used_ring is > a non operation in most time: it will take action only in the short > gap of during live migration. > > And FYI, I even tried with all vhost_log_xxx being removed, it showed > no performance boost at all. Therefore, it's not a factor that will > impact performance. I knew this. > --yliu > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx 2016-06-01 6:40 ` Xie, Huawei 2016-06-01 6:55 ` Yuanhan Liu @ 2016-06-01 13:05 ` Michael S. Tsirkin 1 sibling, 0 replies; 16+ messages in thread From: Michael S. Tsirkin @ 2016-06-01 13:05 UTC (permalink / raw) To: Xie, Huawei; +Cc: Yuanhan Liu, dev On Wed, Jun 01, 2016 at 06:40:41AM +0000, Xie, Huawei wrote: > On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > > Pre update and update used ring in batch for Tx and Rx at the stage > > while fetching all avail desc idx. This would reduce some cache misses > > and hence, increase the performance a bit. > > > > Pre update would be feasible as guest driver will not start processing > > those entries as far as we don't update "used->idx". (I'm not 100% > > certain I don't miss anything, though). > > > > Cc: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > > --- > > lib/librte_vhost/vhost_rxtx.c | 58 +++++++++++++++++++++---------------------- > > 1 file changed, 28 insertions(+), 30 deletions(-) > > > > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c > > index c9cd1c5..2c3b810 100644 > > --- a/lib/librte_vhost/vhost_rxtx.c > > +++ b/lib/librte_vhost/vhost_rxtx.c > > @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t desc_addr, > > > > static inline int __attribute__((always_inline)) > > copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, > > - struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied) > > + struct rte_mbuf *m, uint16_t desc_idx) > > { > > uint32_t desc_avail, desc_offset; > > uint32_t mbuf_avail, mbuf_offset; > > @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq, > > desc_offset = dev->vhost_hlen; > > desc_avail = desc->len - dev->vhost_hlen; > > > > - *copied = rte_pktmbuf_pkt_len(m); > > mbuf_avail = rte_pktmbuf_data_len(m); > > mbuf_offset = 0; > > while (mbuf_avail != 0 || m->next != NULL) { > > @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, > > struct vhost_virtqueue *vq; > > uint16_t res_start_idx, res_end_idx; > > uint16_t desc_indexes[MAX_PKT_BURST]; > > + uint16_t used_idx; > > uint32_t i; > > > > LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__); > > @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id, > > /* Retrieve all of the desc indexes first to avoid caching issues. */ > > rte_prefetch0(&vq->avail->ring[res_start_idx & (vq->size - 1)]); > > for (i = 0; i < count; i++) { > > - desc_indexes[i] = vq->avail->ring[(res_start_idx + i) & > > - (vq->size - 1)]; > > + used_idx = (res_start_idx + i) & (vq->size - 1); > > + desc_indexes[i] = vq->avail->ring[used_idx]; > > + vq->used->ring[used_idx].id = desc_indexes[i]; > > + vq->used->ring[used_idx].len = pkts[i]->pkt_len + > > + dev->vhost_hlen; > > + vhost_log_used_vring(dev, vq, > > + offsetof(struct vring_used, ring[used_idx]), > > + sizeof(vq->used->ring[used_idx])); > > } > > > > rte_prefetch0(&vq->desc[desc_indexes[0]]); > > for (i = 0; i < count; i++) { > > uint16_t desc_idx = desc_indexes[i]; > > - uint16_t used_idx = (res_start_idx + i) & (vq->size - 1); > > - uint32_t copied; > > int err; > > > > - err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, &copied); > > - > > - vq->used->ring[used_idx].id = desc_idx; > > - if (unlikely(err)) > > + err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx); > > + if (unlikely(err)) { > > + used_idx = (res_start_idx + i) & (vq->size - 1); > > vq->used->ring[used_idx].len = dev->vhost_hlen; > > - else > > - vq->used->ring[used_idx].len = copied + dev->vhost_hlen; > > - vhost_log_used_vring(dev, vq, > > - offsetof(struct vring_used, ring[used_idx]), > > - sizeof(vq->used->ring[used_idx])); > > + vhost_log_used_vring(dev, vq, > > + offsetof(struct vring_used, ring[used_idx]), > > + sizeof(vq->used->ring[used_idx])); > > + } > > > > if (i + 1 < count) > > rte_prefetch0(&vq->desc[desc_indexes[i+1]]); > > @@ -879,6 +881,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > > /* Prefetch available ring to retrieve head indexes. */ > > used_idx = vq->last_used_idx & (vq->size - 1); > > rte_prefetch0(&vq->avail->ring[used_idx]); > > + rte_prefetch0(&vq->used->ring[used_idx]); > > > > count = RTE_MIN(count, MAX_PKT_BURST); > > count = RTE_MIN(count, free_entries); > > @@ -887,22 +890,23 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > > > > /* Retrieve all of the head indexes first to avoid caching issues. */ > > for (i = 0; i < count; i++) { > > - desc_indexes[i] = vq->avail->ring[(vq->last_used_idx + i) & > > - (vq->size - 1)]; > > + used_idx = (vq->last_used_idx + i) & (vq->size - 1); > > + desc_indexes[i] = vq->avail->ring[used_idx]; > > + > > + vq->used->ring[used_idx].id = desc_indexes[i]; > > + vq->used->ring[used_idx].len = 0; > > + vhost_log_used_vring(dev, vq, > > + offsetof(struct vring_used, ring[used_idx]), > > + sizeof(vq->used->ring[used_idx])); > > } > > > > /* Prefetch descriptor index. */ > > rte_prefetch0(&vq->desc[desc_indexes[0]]); > > - rte_prefetch0(&vq->used->ring[vq->last_used_idx & (vq->size - 1)]); > > - > > for (i = 0; i < count; i++) { > > int err; > > > > - if (likely(i + 1 < count)) { > > + if (likely(i + 1 < count)) > > rte_prefetch0(&vq->desc[desc_indexes[i + 1]]); > > - rte_prefetch0(&vq->used->ring[(used_idx + 1) & > > - (vq->size - 1)]); > > - } > > > > pkts[i] = rte_pktmbuf_alloc(mbuf_pool); > > if (unlikely(pkts[i] == NULL)) { > > @@ -916,18 +920,12 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id, > > rte_pktmbuf_free(pkts[i]); > > break; > > } > > - > > - used_idx = vq->last_used_idx++ & (vq->size - 1); > > - vq->used->ring[used_idx].id = desc_indexes[i]; > > - vq->used->ring[used_idx].len = 0; > > - vhost_log_used_vring(dev, vq, > > - offsetof(struct vring_used, ring[used_idx]), > > - sizeof(vq->used->ring[used_idx])); > > } > > Had tried post-updating used ring in batch, but forget the perf change. > > One optimization would be on vhost_log_used_ring. > I have two ideas, > a) In QEMU side, we always assume use ring will be changed. so that we > don't need to log used ring in VHOST. > > Michael: feasible in QEMU? comments on this? To avoid breaking old QEMU, we'll need a new protocol feature bit, but generally this sounds reasonable. > b) We could always mark the total used ring modified rather than entry > by entry. > > > > > rte_smp_wmb(); > > rte_smp_rmb(); > > vq->used->idx += i; > > + vq->last_used_idx += i; > > vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx), > > sizeof(vq->used->idx)); > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu 2016-05-03 0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu @ 2016-05-03 0:46 ` Yuanhan Liu 2016-06-01 6:24 ` Xie, Huawei 2016-06-03 7:43 ` Xie, Huawei 2016-05-03 0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu ` (2 subsequent siblings) 4 siblings, 2 replies; 16+ messages in thread From: Yuanhan Liu @ 2016-05-03 0:46 UTC (permalink / raw) To: dev; +Cc: huawei.xie, Yuanhan Liu Both current kernel virtio driver and DPDK virtio driver use at least 2 desc buffer for Tx: the first for storing the header, and the others for storing the data. Therefore, we could fetch the first data desc buf before the main loop, and do the copy first before the check of "are we done yet?". This could save one check for small packets, that just have one data desc buffer and need one mbuf to store it. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- lib/librte_vhost/vhost_rxtx.c | 52 ++++++++++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 16 deletions(-) diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c index 2c3b810..34d6ed1 100644 --- a/lib/librte_vhost/vhost_rxtx.c +++ b/lib/librte_vhost/vhost_rxtx.c @@ -753,18 +753,48 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq, return -1; desc_addr = gpa_to_vva(dev, desc->addr); - rte_prefetch0((void *)(uintptr_t)desc_addr); - - /* Retrieve virtio net header */ hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr); - desc_avail = desc->len - dev->vhost_hlen; - desc_offset = dev->vhost_hlen; + rte_prefetch0(hdr); + + /* + * Both current kernel virio driver and DPDK virtio driver + * use at least 2 desc bufferr for Tx: the first for storing + * the header, and others for storing the data. + */ + if (likely(desc->len == dev->vhost_hlen)) { + desc = &vq->desc[desc->next]; + + desc_addr = gpa_to_vva(dev, desc->addr); + rte_prefetch0((void *)(uintptr_t)desc_addr); + + desc_offset = 0; + desc_avail = desc->len; + nr_desc += 1; + + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0); + } else { + desc_avail = desc->len - dev->vhost_hlen; + desc_offset = dev->vhost_hlen; + } mbuf_offset = 0; mbuf_avail = m->buf_len - RTE_PKTMBUF_HEADROOM; - while (desc_avail != 0 || (desc->flags & VRING_DESC_F_NEXT) != 0) { + while (1) { + cpy_len = RTE_MIN(desc_avail, mbuf_avail); + rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset), + (void *)((uintptr_t)(desc_addr + desc_offset)), + cpy_len); + + mbuf_avail -= cpy_len; + mbuf_offset += cpy_len; + desc_avail -= cpy_len; + desc_offset += cpy_len; + /* This desc reaches to its end, get the next one */ if (desc_avail == 0) { + if ((desc->flags & VRING_DESC_F_NEXT) == 0) + break; + if (unlikely(desc->next >= vq->size || ++nr_desc >= vq->size)) return -1; @@ -800,16 +830,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq, mbuf_offset = 0; mbuf_avail = cur->buf_len - RTE_PKTMBUF_HEADROOM; } - - cpy_len = RTE_MIN(desc_avail, mbuf_avail); - rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset), - (void *)((uintptr_t)(desc_addr + desc_offset)), - cpy_len); - - mbuf_avail -= cpy_len; - mbuf_offset += cpy_len; - desc_avail -= cpy_len; - desc_offset += cpy_len; } prev->data_len = mbuf_offset; -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets 2016-05-03 0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu @ 2016-06-01 6:24 ` Xie, Huawei 2016-06-01 6:44 ` Yuanhan Liu 2016-06-03 7:43 ` Xie, Huawei 1 sibling, 1 reply; 16+ messages in thread From: Xie, Huawei @ 2016-06-01 6:24 UTC (permalink / raw) To: Yuanhan Liu, dev On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > Both current kernel virtio driver and DPDK virtio driver use at least > 2 desc buffer for Tx: the first for storing the header, and the others > for storing the data. Tx could prepend some space for virtio net header whenever possible, so that it could use only one descriptor. Another thing is this doesn't reduce the check because you also add a check. > > Therefore, we could fetch the first data desc buf before the main loop, > and do the copy first before the check of "are we done yet?". This > could save one check for small packets, that just have one data desc > buffer and need one mbuf to store it. > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> > --- > lib/librte_vhost/vhost_rxtx.c | 52 ++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 16 deletions(-) > > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c > index 2c3b810..34d6ed1 100644 > --- a/lib/librte_vhost/vhost_rxtx.c > +++ b/lib/librte_vhost/vhost_rxtx.c > @@ -753,18 +753,48 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq, > return -1; > > desc_addr = gpa_to_vva(dev, desc->addr); > - rte_prefetch0((void *)(uintptr_t)desc_addr); > - > - /* Retrieve virtio net header */ > hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr); > - desc_avail = desc->len - dev->vhost_hlen; > - desc_offset = dev->vhost_hlen; > + rte_prefetch0(hdr); > + > + /* > + * Both current kernel virio driver and DPDK virtio driver > + * use at least 2 desc bufferr for Tx: the first for storing > + * the header, and others for storing the data. > + */ > + if (likely(desc->len == dev->vhost_hlen)) { > + desc = &vq->desc[desc->next]; > + > + desc_addr = gpa_to_vva(dev, desc->addr); > + rte_prefetch0((void *)(uintptr_t)desc_addr); > + > + desc_offset = 0; > + desc_avail = desc->len; > + nr_desc += 1; > + > + PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0); > + } else { > + desc_avail = desc->len - dev->vhost_hlen; > + desc_offset = dev->vhost_hlen; > + } > > mbuf_offset = 0; > mbuf_avail = m->buf_len - RTE_PKTMBUF_HEADROOM; > - while (desc_avail != 0 || (desc->flags & VRING_DESC_F_NEXT) != 0) { > + while (1) { > + cpy_len = RTE_MIN(desc_avail, mbuf_avail); > + rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset), > + (void *)((uintptr_t)(desc_addr + desc_offset)), > + cpy_len); > + > + mbuf_avail -= cpy_len; > + mbuf_offset += cpy_len; > + desc_avail -= cpy_len; > + desc_offset += cpy_len; > + > /* This desc reaches to its end, get the next one */ > if (desc_avail == 0) { > + if ((desc->flags & VRING_DESC_F_NEXT) == 0) > + break; > + > if (unlikely(desc->next >= vq->size || > ++nr_desc >= vq->size)) > return -1; > @@ -800,16 +830,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq, > mbuf_offset = 0; > mbuf_avail = cur->buf_len - RTE_PKTMBUF_HEADROOM; > } > - > - cpy_len = RTE_MIN(desc_avail, mbuf_avail); > - rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset), > - (void *)((uintptr_t)(desc_addr + desc_offset)), > - cpy_len); > - > - mbuf_avail -= cpy_len; > - mbuf_offset += cpy_len; > - desc_avail -= cpy_len; > - desc_offset += cpy_len; > } > > prev->data_len = mbuf_offset; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets 2016-06-01 6:24 ` Xie, Huawei @ 2016-06-01 6:44 ` Yuanhan Liu 2016-06-03 7:42 ` Xie, Huawei 0 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2016-06-01 6:44 UTC (permalink / raw) To: Xie, Huawei; +Cc: dev On Wed, Jun 01, 2016 at 06:24:18AM +0000, Xie, Huawei wrote: > On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > > Both current kernel virtio driver and DPDK virtio driver use at least > > 2 desc buffer for Tx: the first for storing the header, and the others > > for storing the data. > > Tx could prepend some space for virtio net header whenever possible, so > that it could use only one descriptor. In such case, it will work as well: it will goto the "else" code path then. > Another thing is this doesn't reduce the check because you also add a check. Actually, yes, it does save check. Before this patch, we have: while (not done yet) { if (desc_is_drain) ...; if (mbuf_is_full) ...; COPY(); } Note that the "while" check will be done twice, therefore, it's 4 checks in total. After this patch, it would be: if (virtio_net_hdr_takes_one_desc) ... while (1) { COPY(); if (desc_is_drain) { break if done; ...; } if (mbuf_is_full { /* small packets will bypass this check */ ....; } } So, for small packets, it takes 2 checks only, which actually saves 2 checks. --yliu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets 2016-06-01 6:44 ` Yuanhan Liu @ 2016-06-03 7:42 ` Xie, Huawei 0 siblings, 0 replies; 16+ messages in thread From: Xie, Huawei @ 2016-06-03 7:42 UTC (permalink / raw) To: Yuanhan Liu; +Cc: dev On 6/1/2016 2:41 PM, Yuanhan Liu wrote: > On Wed, Jun 01, 2016 at 06:24:18AM +0000, Xie, Huawei wrote: >> On 5/3/2016 8:42 AM, Yuanhan Liu wrote: >>> Both current kernel virtio driver and DPDK virtio driver use at least >>> 2 desc buffer for Tx: the first for storing the header, and the others >>> for storing the data. >> Tx could prepend some space for virtio net header whenever possible, so >> that it could use only one descriptor. > In such case, it will work as well: it will goto the "else" code path > then. It works, but the problem is the statement. >> Another thing is this doesn't reduce the check because you also add a check. > Actually, yes, it does save check. Before this patch, we have: > > while (not done yet) { > if (desc_is_drain) > ...; > > if (mbuf_is_full) > ...; > > COPY(); > } > > Note that the "while" check will be done twice, therefore, it's 4 > checks in total. After this patch, it would be: > > if (virtio_net_hdr_takes_one_desc) > ... > > while (1) { > COPY(); > > if (desc_is_drain) { > break if done; > ...; > } > > if (mbuf_is_full { > /* small packets will bypass this check */ > ....; > } > } > > So, for small packets, it takes 2 checks only, which actually saves > 2 checks. > > --yliu > For small packets, we have three checks totally. It worth we use space to save checks, so would ack this patch, but note that there is one assumption that rte_memcpy API could handle zero length copy. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets 2016-05-03 0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu 2016-06-01 6:24 ` Xie, Huawei @ 2016-06-03 7:43 ` Xie, Huawei 1 sibling, 0 replies; 16+ messages in thread From: Xie, Huawei @ 2016-06-03 7:43 UTC (permalink / raw) To: Yuanhan Liu, dev On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > Both current kernel virtio driver and DPDK virtio driver use at least > 2 desc buffer for Tx: the first for storing the header, and the others > for storing the data. > > Therefore, we could fetch the first data desc buf before the main loop, > and do the copy first before the check of "are we done yet?". This > could save one check for small packets, that just have one data desc > buffer and need one mbuf to store it. > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Acked-by: Huawei Xie <huawei.xie@intel.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu 2016-05-03 0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu 2016-05-03 0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu @ 2016-05-03 0:46 ` Yuanhan Liu 2016-06-01 6:42 ` Xie, Huawei 2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane 2016-06-14 12:42 ` Yuanhan Liu 4 siblings, 1 reply; 16+ messages in thread From: Yuanhan Liu @ 2016-05-03 0:46 UTC (permalink / raw) To: dev; +Cc: huawei.xie, Yuanhan Liu the ifname[] field takes so much space, that it seperate some frequently used fields into different caches, say, features and broadcast_rarp. This patch move all those fields that will be accessed frequently in Rx/Tx together (before the ifname[] field) to let them share one cache line. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> --- lib/librte_vhost/vhost-net.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/librte_vhost/vhost-net.h b/lib/librte_vhost/vhost-net.h index 9dec83c..3b0ffe7 100644 --- a/lib/librte_vhost/vhost-net.h +++ b/lib/librte_vhost/vhost-net.h @@ -123,16 +123,16 @@ struct virtio_net { int vid; uint32_t flags; uint16_t vhost_hlen; + /* to tell if we need broadcast rarp packet */ + rte_atomic16_t broadcast_rarp; + uint32_t virt_qp_nb; + struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ) char ifname[IF_NAME_SZ]; - uint32_t virt_qp_nb; uint64_t log_size; uint64_t log_base; struct ether_addr mac; - /* to tell if we need broadcast rarp packet */ - rte_atomic16_t broadcast_rarp; - struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; } __rte_cache_aligned; /** -- 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing 2016-05-03 0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu @ 2016-06-01 6:42 ` Xie, Huawei 0 siblings, 0 replies; 16+ messages in thread From: Xie, Huawei @ 2016-06-01 6:42 UTC (permalink / raw) To: Yuanhan Liu, dev On 5/3/2016 8:42 AM, Yuanhan Liu wrote: > the ifname[] field takes so much space, that it seperate some frequently > used fields into different caches, say, features and broadcast_rarp. > > This patch move all those fields that will be accessed frequently in Rx/Tx > together (before the ifname[] field) to let them share one cache line. > > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Acked-by: Huawei Xie <huawei.xie@intel.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu ` (2 preceding siblings ...) 2016-05-03 0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu @ 2016-05-10 21:49 ` Rich Lane 2016-05-10 22:08 ` Yuanhan Liu 2016-06-14 12:42 ` Yuanhan Liu 4 siblings, 1 reply; 16+ messages in thread From: Rich Lane @ 2016-05-10 21:49 UTC (permalink / raw) To: Yuanhan Liu; +Cc: dev, huawei.xie I see a significant performance improvement with these patches, around 5% at 64 bytes. The one patch that didn't give any performance boost for me was "vhost: arrange virtio_net fields for better cache sharing". Tested-by: Rich Lane <rich.lane@bigswitch.com> On Mon, May 2, 2016 at 5:46 PM, Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote: > Here is a small patch set does the micro optimization, which brings about > 10% performance boost in my 64B packet testing, with the following topo: > > pkt generator <----> NIC <-----> Virtio NIC > > Patch 1 pre updates the used ring and update them in batch. It should be > feasible from my understanding: there will be no issue, guest driver will > not start processing them as far as we haven't updated the "used->idx" > yet. I could miss something though. > > Patch 2 saves one check for small packets (that can be hold in one desc > buf and mbuf). > > Patch 3 moves several frequently used fields into one cache line, for > better cache sharing. > > Note that this patch set is based on my latest vhost ABI refactoring > patchset. > > > --- > Yuanhan Liu (3): > vhost: pre update used ring for Tx and Rx > vhost: optimize dequeue for small packets > vhost: arrange virtio_net fields for better cache sharing > > lib/librte_vhost/vhost-net.h | 8 +-- > lib/librte_vhost/vhost_rxtx.c | 110 > ++++++++++++++++++++++++------------------ > 2 files changed, 68 insertions(+), 50 deletions(-) > > -- > 1.9.0 > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization 2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane @ 2016-05-10 22:08 ` Yuanhan Liu 0 siblings, 0 replies; 16+ messages in thread From: Yuanhan Liu @ 2016-05-10 22:08 UTC (permalink / raw) To: Rich Lane; +Cc: dev, huawei.xie On Tue, May 10, 2016 at 02:49:43PM -0700, Rich Lane wrote: > I see a significant performance improvement with these patches, around 5% at 64 > bytes. Thanks for testing. > > The one patch that didn't give any performance boost for me was "vhost: arrange > virtio_net fields for better cache sharing". Yeah, same here from my test. I mean, in theory, it should give us a tiny boost, it doesn't in real life though. And since it (should) do no harm, I would still include this patch in this set. Maybe I should have noted at first that no real perf gain from the 3rd patch. --yliu > > Tested-by: Rich Lane <rich.lane@bigswitch.com> > > On Mon, May 2, 2016 at 5:46 PM, Yuanhan Liu <yuanhan.liu@linux.intel.com> > wrote: > > Here is a small patch set does the micro optimization, which brings about > 10% performance boost in my 64B packet testing, with the following topo: > > pkt generator <----> NIC <-----> Virtio NIC > > Patch 1 pre updates the used ring and update them in batch. It should be > feasible from my understanding: there will be no issue, guest driver will > not start processing them as far as we haven't updated the "used->idx" > yet. I could miss something though. > > Patch 2 saves one check for small packets (that can be hold in one desc > buf and mbuf). > > Patch 3 moves several frequently used fields into one cache line, for > better cache sharing. > > Note that this patch set is based on my latest vhost ABI refactoring > patchset. > > > --- > Yuanhan Liu (3): > vhost: pre update used ring for Tx and Rx > vhost: optimize dequeue for small packets > vhost: arrange virtio_net fields for better cache sharing > > lib/librte_vhost/vhost-net.h | 8 +-- > lib/librte_vhost/vhost_rxtx.c | 110 > ++++++++++++++++++++++++------------------ > 2 files changed, 68 insertions(+), 50 deletions(-) > > -- > 1.9.0 > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu ` (3 preceding siblings ...) 2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane @ 2016-06-14 12:42 ` Yuanhan Liu 4 siblings, 0 replies; 16+ messages in thread From: Yuanhan Liu @ 2016-06-14 12:42 UTC (permalink / raw) To: dev; +Cc: huawei.xie Applied to dpdk-next-virtio. --yliu On Mon, May 02, 2016 at 05:46:15PM -0700, Yuanhan Liu wrote: > Here is a small patch set does the micro optimization, which brings about > 10% performance boost in my 64B packet testing, with the following topo: > > pkt generator <----> NIC <-----> Virtio NIC > > Patch 1 pre updates the used ring and update them in batch. It should be > feasible from my understanding: there will be no issue, guest driver will > not start processing them as far as we haven't updated the "used->idx" > yet. I could miss something though. > > Patch 2 saves one check for small packets (that can be hold in one desc > buf and mbuf). > > Patch 3 moves several frequently used fields into one cache line, for > better cache sharing. > > Note that this patch set is based on my latest vhost ABI refactoring patchset. > > > --- > Yuanhan Liu (3): > vhost: pre update used ring for Tx and Rx > vhost: optimize dequeue for small packets > vhost: arrange virtio_net fields for better cache sharing > > lib/librte_vhost/vhost-net.h | 8 +-- > lib/librte_vhost/vhost_rxtx.c | 110 ++++++++++++++++++++++++------------------ > 2 files changed, 68 insertions(+), 50 deletions(-) > > -- > 1.9.0 ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-06-14 12:40 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-03 0:46 [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Yuanhan Liu 2016-05-03 0:46 ` [dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx Yuanhan Liu 2016-06-01 6:40 ` Xie, Huawei 2016-06-01 6:55 ` Yuanhan Liu 2016-06-03 8:18 ` Xie, Huawei 2016-06-01 13:05 ` Michael S. Tsirkin 2016-05-03 0:46 ` [dpdk-dev] [PATCH 2/3] vhost: optimize dequeue for small packets Yuanhan Liu 2016-06-01 6:24 ` Xie, Huawei 2016-06-01 6:44 ` Yuanhan Liu 2016-06-03 7:42 ` Xie, Huawei 2016-06-03 7:43 ` Xie, Huawei 2016-05-03 0:46 ` [dpdk-dev] [PATCH 3/3] vhost: arrange virtio_net fields for better cache sharing Yuanhan Liu 2016-06-01 6:42 ` Xie, Huawei 2016-05-10 21:49 ` [dpdk-dev] [PATCH 0/3] [RFC] vhost: micro vhost optimization Rich Lane 2016-05-10 22:08 ` Yuanhan Liu 2016-06-14 12:42 ` Yuanhan Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).