* [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
@ 2018-04-30 15:59 Maxime Coquelin
2018-05-03 11:56 ` Tiwei Bie
0 siblings, 1 reply; 9+ messages in thread
From: Maxime Coquelin @ 2018-04-30 15:59 UTC (permalink / raw)
To: dev, jianfeng.tan, tiwei.bie, mst; +Cc: stable, Maxime Coquelin
This patch caches all dirty pages logging until the used ring index
is updated. These dirty pages won't be accessed by the guest as
long as the host doesn't give them back to it by updating the
index.
The goal of this optimization is to fix a performance regression
introduced when the vhost library started to use atomic operations
to set bits in the shared dirty log map. While the fix was valid
as previous implementation wasn't safe against concurent accesses,
contention was induced.
With this patch, during migration, we have:
1. Less atomic operations as only a single atomic OR operation
per 32 pages.
2. Less atomic operations as during a burst, the same page will
be marked dirty only once.
3. Less write memory barriers.
Fixes: 897f13a1f726 ("vhost: make page logging atomic")
Cc: stable@dpdk.org
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
Hi,
This series was tested with migrating a guest while running PVP
benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
With this patch we recover the packet drops regressions seen since
the use of atomic operations to log dirty pages.
Some numbers:
A) PVP Live migration using testpmd (Single queue pair)
-------------------------------------------------------
Without patch:
=========================Stream Rate: 1Mpps==============
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 134 18966 16 11628963.0
1 1Mpps 125 18790 16 8436300.0
2 1Mpps 122 19171 15 13881342.0
3 1Mpps 132 18913 15 12079492.0
<------------------------Summary------------------------>
Max 1Mpps 134 19171 16 13881342
Min 1Mpps 122 18790 15 8436300
Mean 1Mpps 128 18960 15 11506524
Median 1Mpps 128 18939 15 11854227
Stdev 0 5.68 158.81 0.58 2266371.52
=========================================================
With patch:
=========================Stream Rate: 1Mpps==============
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 119 13521 15 478174.0
1 1Mpps 116 14084 14 452018.0
2 1Mpps 122 13908 14 464486.0
3 1Mpps 129 13787 16 478234.0
<------------------------Summary------------------------>
Max 1Mpps 129 14084 16 478234
Min 1Mpps 116 13521 14 452018
Mean 1Mpps 121 13825 14 468228
Median 1Mpps 120 13847 14 471330
Stdev 0 5.57 236.52 0.96 12593.78
=========================================================
B) OVS-DPDK migration with 2 queue pairs
----------------------------------------
Without patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 146 20270 116 15937394.0
1 1Mpps 150 25617 16 11120370.0
2 1Mpps 138 40983 115 24779971.0
3 1Mpps 161 20435 17 15363519.0
<------------------------Summary------------------------>
Max 1Mpps 161 40983 116 24779971
Min 1Mpps 138 20270 16 11120370
Mean 1Mpps 148 26826 66 16800313
Median 1Mpps 148 23026 66 15650456
Stdev 0 9.57 9758.9 57.16 5737179.93
=========================================================
With patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 155 18915 17 481330.0
1 1Mpps 156 21097 18 370556.0
2 1Mpps 156 49610 15 369291.0
3 1Mpps 144 31124 15 361914.0
<------------------------Summary------------------------>
Max 1Mpps 156 49610 18 481330
Min 1Mpps 144 18915 15 361914
Mean 1Mpps 152 30186 16 395772
Median 1Mpps 155 26110 16 369923
Stdev 0 5.85 13997.82 1.5 57165.33
=========================================================
C) OVS-DPDK migration with single queue pair
--------------------------------------------
Without patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 129 17411 15 11105414.0
1 1Mpps 130 16544 15 8028438.0
2 1Mpps 132 15202 15 7491584.0
3 1Mpps 133 18100 15 8385047.0
<------------------------Summary------------------------>
Max 1Mpps 133 18100 15 11105414
Min 1Mpps 129 15202 15 7491584
Mean 1Mpps 131 16814 15 8752620
Median 1Mpps 131 16977 15 8206742
Stdev 0 1.83 1249.22 0.0 1610941.84
=========================================================
With patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
0 1Mpps 117 12966 14 275873.0
1 1Mpps 127 13398 14 460323.0
2 1Mpps 119 13216 14 280831.0
3 1Mpps 131 13544 15 465951.0
<------------------------Summary------------------------>
Max 1Mpps 131 13544 15 465951
Min 1Mpps 117 12966 14 275873
Mean 1Mpps 123 13281 14 370744
Median 1Mpps 123 13307 14 370577
Stdev 0 6.61 249.2 0.5 106729.6
=========================================================
lib/librte_vhost/vhost.h | 96 +++++++++++++++++++++++++++++++++++++++++++
lib/librte_vhost/virtio_net.c | 29 ++++++++-----
2 files changed, 115 insertions(+), 10 deletions(-)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 891978131..0d9129504 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,8 @@
#define BUF_VECTOR_MAX 256
+#define VHOST_LOG_CACHE_NR 32
+
/**
* Structure contains buffer address, length and descriptor index
* from vring to do scatter RX.
@@ -69,6 +71,14 @@ struct batch_copy_elem {
uint64_t log_addr;
};
+/*
+ * Structure that contains the info for batched dirty logging.
+ */
+struct log_cache_entry {
+ uint32_t offset;
+ uint32_t val;
+};
+
/**
* Structure contains variables relevant to RX/TX virtqueues.
*/
@@ -112,6 +122,9 @@ struct vhost_virtqueue {
struct batch_copy_elem *batch_copy_elems;
uint16_t batch_copy_nb_elems;
+ struct log_cache_entry log_cache[VHOST_LOG_CACHE_NR];
+ uint16_t log_cache_nb_elem;
+
rte_rwlock_t iotlb_lock;
rte_rwlock_t iotlb_pending_lock;
struct rte_mempool *iotlb_pool;
@@ -340,6 +353,89 @@ vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
}
}
+static __rte_always_inline void
+vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+ uint32_t *log_base;
+ int i;
+
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base))
+ return;
+
+ log_base = (uint32_t *)(uintptr_t)dev->log_base;
+
+ /* To make sure guest memory updates are committed before logging */
+ rte_smp_wmb();
+
+ for (i = 0; i < vq->log_cache_nb_elem; i++) {
+ struct log_cache_entry *elem = vq->log_cache + i;
+
+ __sync_fetch_and_or(log_base + elem->offset, elem->val);
+ }
+
+ vq->log_cache_nb_elem = 0;
+}
+
+static __rte_always_inline void
+vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t page)
+{
+ uint32_t bit_nr = page % 32;
+ uint32_t offset = page / 32;
+ int i;
+
+ for (i = 0; i < vq->log_cache_nb_elem; i++) {
+ struct log_cache_entry *elem = vq->log_cache + i;
+
+ if (elem->offset == offset) {
+ elem->val |= (1U << bit_nr);
+ return;
+ }
+ }
+
+ if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
+ /*
+ * No more room for a new log cache entry,
+ * so write the dirty log map directly.
+ */
+ rte_smp_wmb();
+ vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+
+ return;
+ }
+
+ vq->log_cache[i].offset = offset;
+ vq->log_cache[i].val = (1U << bit_nr);
+}
+
+static __rte_always_inline void
+vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t addr, uint64_t len)
+{
+ uint64_t page;
+
+ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+ !dev->log_base || !len))
+ return;
+
+ if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+ return;
+
+ page = addr / VHOST_LOG_PAGE;
+ while (page * VHOST_LOG_PAGE < addr + len) {
+ vhost_log_cache_page(dev, vq, page);
+ page += 1;
+ }
+}
+
+static __rte_always_inline void
+vhost_log_cache_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+ uint64_t offset, uint64_t len)
+{
+ vhost_log_cache_write(dev, vq, vq->log_guest_addr + offset, len);
+}
+
static __rte_always_inline void
vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
uint64_t offset, uint64_t len)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5fdd4172b..ac42cdbd0 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -78,7 +78,7 @@ do_flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_memcpy(&vq->used->ring[to],
&vq->shadow_used_ring[from],
size * sizeof(struct vring_used_elem));
- vhost_log_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
offsetof(struct vring_used, ring[to]),
size * sizeof(struct vring_used_elem));
}
@@ -106,6 +106,8 @@ flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq)
rte_smp_wmb();
+ vhost_log_cache_sync(dev, vq);
+
*(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
@@ -130,7 +132,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
for (i = 0; i < count; i++) {
rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
- vhost_log_write(dev, elem[i].log_addr, elem[i].len);
+ vhost_log_cache_write(dev, vq, elem[i].log_addr, elem[i].len);
PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0);
}
}
@@ -251,7 +253,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
virtio_enqueue_offload(m,
(struct virtio_net_hdr *)(uintptr_t)desc_addr);
PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
- vhost_log_write(dev, desc_gaddr, dev->vhost_hlen);
+ vhost_log_cache_write(dev, vq, desc_gaddr, dev->vhost_hlen);
} else {
struct virtio_net_hdr vnet_hdr;
uint64_t remain = dev->vhost_hlen;
@@ -274,7 +276,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
(void *)(uintptr_t)src, len);
PRINT_PACKET(dev, (uintptr_t)dst, (uint32_t)len, 0);
- vhost_log_write(dev, guest_addr, len);
+ vhost_log_cache_write(dev, vq, guest_addr, len);
remain -= len;
guest_addr += len;
dst += len;
@@ -355,7 +357,8 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
desc_offset)),
rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
cpy_len);
- vhost_log_write(dev, desc_gaddr + desc_offset, cpy_len);
+ vhost_log_cache_write(dev, vq, desc_gaddr + desc_offset,
+ cpy_len);
PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
cpy_len, 0);
} else {
@@ -444,7 +447,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
vq->used->ring[used_idx].id = desc_indexes[i];
vq->used->ring[used_idx].len = pkts[i]->pkt_len +
dev->vhost_hlen;
- vhost_log_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
offsetof(struct vring_used, ring[used_idx]),
sizeof(vq->used->ring[used_idx]));
}
@@ -504,6 +507,8 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
rte_smp_wmb();
+ vhost_log_cache_sync(dev, vq);
+
*(volatile uint16_t *)&vq->used->idx += count;
vq->last_used_idx += count;
vhost_log_used_vring(dev, vq,
@@ -767,7 +772,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
PRINT_PACKET(dev, (uintptr_t)dst,
(uint32_t)len, 0);
- vhost_log_write(dev, guest_addr, len);
+ vhost_log_cache_write(dev, vq,
+ guest_addr, len);
remain -= len;
guest_addr += len;
@@ -776,7 +782,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
} else {
PRINT_PACKET(dev, (uintptr_t)hdr_addr,
dev->vhost_hlen, 0);
- vhost_log_write(dev, hdr_phys_addr,
+ vhost_log_cache_write(dev, vq, hdr_phys_addr,
dev->vhost_hlen);
}
@@ -790,7 +796,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
desc_offset)),
rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
cpy_len);
- vhost_log_write(dev, desc_gaddr + desc_offset, cpy_len);
+ vhost_log_cache_write(dev, vq, desc_gaddr + desc_offset,
+ cpy_len);
PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
cpy_len, 0);
} else {
@@ -1320,7 +1327,7 @@ update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
{
vq->used->ring[used_idx].id = desc_idx;
vq->used->ring[used_idx].len = 0;
- vhost_log_used_vring(dev, vq,
+ vhost_log_cache_used_vring(dev, vq,
offsetof(struct vring_used, ring[used_idx]),
sizeof(vq->used->ring[used_idx]));
}
@@ -1335,6 +1342,8 @@ update_used_idx(struct virtio_net *dev, struct vhost_virtqueue *vq,
rte_smp_wmb();
rte_smp_rmb();
+ vhost_log_cache_sync(dev, vq);
+
vq->used->idx += count;
vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
sizeof(vq->used->idx));
--
2.14.3
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-04-30 15:59 [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance Maxime Coquelin
@ 2018-05-03 11:56 ` Tiwei Bie
2018-05-04 15:48 ` Maxime Coquelin
2018-05-15 13:50 ` Maxime Coquelin
0 siblings, 2 replies; 9+ messages in thread
From: Tiwei Bie @ 2018-05-03 11:56 UTC (permalink / raw)
To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable
On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> This patch caches all dirty pages logging until the used ring index
> is updated. These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.
Below sentence in above commit message isn't the reason why
we can cache the dirty page logging. Right?
"""
These dirty pages won't be accessed by the guest as
long as the host doesn't give them back to it by updating the
index.
"""
>
> The goal of this optimization is to fix a performance regression
> introduced when the vhost library started to use atomic operations
> to set bits in the shared dirty log map. While the fix was valid
> as previous implementation wasn't safe against concurent accesses,
> contention was induced.
>
> With this patch, during migration, we have:
> 1. Less atomic operations as only a single atomic OR operation
> per 32 pages.
Why not do it per 64 pages?
> 2. Less atomic operations as during a burst, the same page will
> be marked dirty only once.
> 3. Less write memory barriers.
>
> Fixes: 897f13a1f726 ("vhost: make page logging atomic")
>
> Cc: stable@dpdk.org
>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>
> Hi,
>
> This series was tested with migrating a guest while running PVP
> benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
If the throughput is higher (e.g. by adding more cores
and queues), will the live migration fail due to the
higher dirty page generating speed?
>
> With this patch we recover the packet drops regressions seen since
> the use of atomic operations to log dirty pages.
[...]
>
> +static __rte_always_inline void
> +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> +{
> + uint32_t *log_base;
> + int i;
> +
> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> + !dev->log_base))
> + return;
> +
> + log_base = (uint32_t *)(uintptr_t)dev->log_base;
> +
> + /* To make sure guest memory updates are committed before logging */
> + rte_smp_wmb();
It seems that __sync_fetch_and_or() can be considered a full
barrier [1]. So do we really need this rte_smp_wmb()?
Besides, based on the same doc [1], it seems that the __sync_
version is deprecated in favor of the __atomic_ one.
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> +
> + for (i = 0; i < vq->log_cache_nb_elem; i++) {
> + struct log_cache_entry *elem = vq->log_cache + i;
> +
> + __sync_fetch_and_or(log_base + elem->offset, elem->val);
> + }
> +
> + vq->log_cache_nb_elem = 0;
> +}
> +
[...]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-03 11:56 ` Tiwei Bie
@ 2018-05-04 15:48 ` Maxime Coquelin
2018-05-04 18:54 ` Michael S. Tsirkin
2018-05-07 3:49 ` Tiwei Bie
2018-05-15 13:50 ` Maxime Coquelin
1 sibling, 2 replies; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-04 15:48 UTC (permalink / raw)
To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable
Hi Tiwei,
On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>> This patch caches all dirty pages logging until the used ring index
>> is updated. These dirty pages won't be accessed by the guest as
>> long as the host doesn't give them back to it by updating the
>> index.
>
> Below sentence in above commit message isn't the reason why
> we can cache the dirty page logging. Right?
>
> """
> These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.
> """
>
>>
>> The goal of this optimization is to fix a performance regression
>> introduced when the vhost library started to use atomic operations
>> to set bits in the shared dirty log map. While the fix was valid
>> as previous implementation wasn't safe against concurent accesses,
>> contention was induced.
>>
>> With this patch, during migration, we have:
>> 1. Less atomic operations as only a single atomic OR operation
>> per 32 pages.
>
> Why not do it per 64 pages?
I wasn't sure it would be supported on 32bits CPU, but [0] seems do
indicate that only 128bits atomic operations aren't supported on all
architectures.
I will change to do it per 64 pages.
[0]:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
>> 2. Less atomic operations as during a burst, the same page will
>> be marked dirty only once.
>> 3. Less write memory barriers.
>>
>> Fixes: 897f13a1f726 ("vhost: make page logging atomic")
>>
>> Cc: stable@dpdk.org
>>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> ---
>>
>> Hi,
>>
>> This series was tested with migrating a guest while running PVP
>> benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
>
> If the throughput is higher (e.g. by adding more cores
> and queues), will the live migration fail due to the
> higher dirty page generating speed?
I haven't done the check, but I agree that the higher is the throughput,
the longer will be the migration duration and the higher the risk it
never converges.
In this case of scenario, postcopy live-migration way be a better fit,
as the hot pages will be copied only once. Postcopy support for
vhost-user have been added to QEMU in v2.12, and I have a prototype for
DPDK that I plan to propose for next release.
>
>>
>> With this patch we recover the packet drops regressions seen since
>> the use of atomic operations to log dirty pages.
> [...]
>>
>> +static __rte_always_inline void
>> +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
>> +{
>> + uint32_t *log_base;
>> + int i;
>> +
>> + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> + !dev->log_base))
>> + return;
>> +
>> + log_base = (uint32_t *)(uintptr_t)dev->log_base;
>> +
>> + /* To make sure guest memory updates are committed before logging */
>> + rte_smp_wmb();
>
> It seems that __sync_fetch_and_or() can be considered a full
> barrier [1]. So do we really need this rte_smp_wmb()?
That's a good point, thanks for the pointer.
> Besides, based on the same doc [1], it seems that the __sync_
> version is deprecated in favor of the __atomic_ one.
I will change to __atomic_. For the memory model, do you agree I should
use __ATOMIC_SEQ_CST?
> [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
>
>> +
>> + for (i = 0; i < vq->log_cache_nb_elem; i++) {
>> + struct log_cache_entry *elem = vq->log_cache + i;
>> +
>> + __sync_fetch_and_or(log_base + elem->offset, elem->val);
>> + }
>> +
>> + vq->log_cache_nb_elem = 0;
>> +}
>> +
> [...]
>
Thanks,
Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-04 15:48 ` Maxime Coquelin
@ 2018-05-04 18:54 ` Michael S. Tsirkin
2018-05-07 3:49 ` Tiwei Bie
1 sibling, 0 replies; 9+ messages in thread
From: Michael S. Tsirkin @ 2018-05-04 18:54 UTC (permalink / raw)
To: Maxime Coquelin; +Cc: Tiwei Bie, dev, jianfeng.tan, stable
On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
>
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> > > This patch caches all dirty pages logging until the used ring index
> > > is updated. These dirty pages won't be accessed by the guest as
> > > long as the host doesn't give them back to it by updating the
> > > index.
> >
> > Below sentence in above commit message isn't the reason why
> > we can cache the dirty page logging. Right?
> >
> > """
> > These dirty pages won't be accessed by the guest as
> > long as the host doesn't give them back to it by updating the
> > index.
> > """
> >
> > >
> > > The goal of this optimization is to fix a performance regression
> > > introduced when the vhost library started to use atomic operations
> > > to set bits in the shared dirty log map. While the fix was valid
> > > as previous implementation wasn't safe against concurent accesses,
> > > contention was induced.
> > >
> > > With this patch, during migration, we have:
> > > 1. Less atomic operations as only a single atomic OR operation
> > > per 32 pages.
> >
> > Why not do it per 64 pages?
>
> I wasn't sure it would be supported on 32bits CPU, but [0] seems do
> indicate that only 128bits atomic operations aren't supported on all
> architectures.
>
> I will change to do it per 64 pages.
""
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in function
if the specific data type size makes that possible; otherwise, an
external call is left to be resolved at run time. This external call is
the same format with the addition of a ‘size_t’ parameter inserted as
the first parameter indicating the size of the object being pointed to.
All objects must be the same size.
""
So I suspect it isn't well supported on all 32 bit CPUs, but
can't you do an atomic using long? That will do the right
thing depending on the CPU.
If not it's pretty easy to something along the lines of
#if BIT_PER_LONG == 64
> [0]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
> > > 2. Less atomic operations as during a burst, the same page will
> > > be marked dirty only once.
> > > 3. Less write memory barriers.
> > >
> > > Fixes: 897f13a1f726 ("vhost: make page logging atomic")
> > >
> > > Cc: stable@dpdk.org
> > >
> > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > ---
> > >
> > > Hi,
> > >
> > > This series was tested with migrating a guest while running PVP
> > > benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
> >
> > If the throughput is higher (e.g. by adding more cores
> > and queues), will the live migration fail due to the
> > higher dirty page generating speed?
>
> I haven't done the check, but I agree that the higher is the throughput,
> the longer will be the migration duration and the higher the risk it
> never converges.
Right. It's not a good reason to just randomly slow down networking :)
In fact if data ends up on the same page, there is a chance that instead
of dirtying the same page multiple times we will dirty it once, which
will help migration converge.
> In this case of scenario, postcopy live-migration way be a better fit,
> as the hot pages will be copied only once. Postcopy support for
> vhost-user have been added to QEMU in v2.12, and I have a prototype for
> DPDK that I plan to propose for next release.
>
> >
> > >
> > > With this patch we recover the packet drops regressions seen since
> > > the use of atomic operations to log dirty pages.
> > [...]
> > > +static __rte_always_inline void
> > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > +{
> > > + uint32_t *log_base;
> > > + int i;
> > > +
> > > + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > + !dev->log_base))
> > > + return;
> > > +
> > > + log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > +
> > > + /* To make sure guest memory updates are committed before logging */
> > > + rte_smp_wmb();
> >
> > It seems that __sync_fetch_and_or() can be considered a full
> > barrier [1]. So do we really need this rte_smp_wmb()?
>
> That's a good point, thanks for the pointer.
>
> > Besides, based on the same doc [1], it seems that the __sync_
> > version is deprecated in favor of the __atomic_ one.
>
> I will change to __atomic_. For the memory model, do you agree I should
> use __ATOMIC_SEQ_CST?
>
> > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> >
> > > +
> > > + for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > + struct log_cache_entry *elem = vq->log_cache + i;
> > > +
> > > + __sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > + }
> > > +
> > > + vq->log_cache_nb_elem = 0;
> > > +}
> > > +
> > [...]
> >
>
> Thanks,
> Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-04 15:48 ` Maxime Coquelin
2018-05-04 18:54 ` Michael S. Tsirkin
@ 2018-05-07 3:49 ` Tiwei Bie
2018-05-07 3:58 ` Michael S. Tsirkin
1 sibling, 1 reply; 9+ messages in thread
From: Tiwei Bie @ 2018-05-07 3:49 UTC (permalink / raw)
To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable
On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
>
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
[...]
> > > +static __rte_always_inline void
> > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > +{
> > > + uint32_t *log_base;
> > > + int i;
> > > +
> > > + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > + !dev->log_base))
> > > + return;
> > > +
> > > + log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > +
> > > + /* To make sure guest memory updates are committed before logging */
> > > + rte_smp_wmb();
> >
> > It seems that __sync_fetch_and_or() can be considered a full
> > barrier [1]. So do we really need this rte_smp_wmb()?
>
> That's a good point, thanks for the pointer.
>
> > Besides, based on the same doc [1], it seems that the __sync_
> > version is deprecated in favor of the __atomic_ one.
>
> I will change to __atomic_. For the memory model, do you agree I should
> use __ATOMIC_SEQ_CST?
Maybe we can use __ATOMIC_RELAXED and keep rte_smp_wmb().
Best regards,
Tiwei Bie
>
> > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> >
> > > +
> > > + for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > + struct log_cache_entry *elem = vq->log_cache + i;
> > > +
> > > + __sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > + }
> > > +
> > > + vq->log_cache_nb_elem = 0;
> > > +}
> > > +
> > [...]
> >
>
> Thanks,
> Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-07 3:49 ` Tiwei Bie
@ 2018-05-07 3:58 ` Michael S. Tsirkin
0 siblings, 0 replies; 9+ messages in thread
From: Michael S. Tsirkin @ 2018-05-07 3:58 UTC (permalink / raw)
To: Tiwei Bie; +Cc: Maxime Coquelin, dev, jianfeng.tan, stable
On Mon, May 07, 2018 at 11:49:49AM +0800, Tiwei Bie wrote:
> On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> > Hi Tiwei,
> >
> > On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> [...]
> > > > +static __rte_always_inline void
> > > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > > +{
> > > > + uint32_t *log_base;
> > > > + int i;
> > > > +
> > > > + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > > + !dev->log_base))
> > > > + return;
> > > > +
> > > > + log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > > +
> > > > + /* To make sure guest memory updates are committed before logging */
> > > > + rte_smp_wmb();
> > >
> > > It seems that __sync_fetch_and_or() can be considered a full
> > > barrier [1]. So do we really need this rte_smp_wmb()?
> >
> > That's a good point, thanks for the pointer.
> >
> > > Besides, based on the same doc [1], it seems that the __sync_
> > > version is deprecated in favor of the __atomic_ one.
> >
> > I will change to __atomic_. For the memory model, do you agree I should
> > use __ATOMIC_SEQ_CST?
>
> Maybe we can use __ATOMIC_RELAXED and keep rte_smp_wmb().
>
> Best regards,
> Tiwei Bie
Yes I think if you do your own barriers, you can use __ATOMIC_RELAXED
in theory.
One issue is that as one lwn article noted
"there will be some seriously suboptimal code production before
gcc-7.1". So if you are using these new APIs, you should also check
that a relatively new compiler is used.
> >
> > > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> > >
> > > > +
> > > > + for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > > + struct log_cache_entry *elem = vq->log_cache + i;
> > > > +
> > > > + __sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > > + }
> > > > +
> > > > + vq->log_cache_nb_elem = 0;
> > > > +}
> > > > +
> > > [...]
> > >
> >
> > Thanks,
> > Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-03 11:56 ` Tiwei Bie
2018-05-04 15:48 ` Maxime Coquelin
@ 2018-05-15 13:50 ` Maxime Coquelin
2018-05-16 6:10 ` Tiwei Bie
1 sibling, 1 reply; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-15 13:50 UTC (permalink / raw)
To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable
Hi Tiwei,
I just see I missed to reply to your comment on my commit message:
On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>> This patch caches all dirty pages logging until the used ring index
>> is updated. These dirty pages won't be accessed by the guest as
>> long as the host doesn't give them back to it by updating the
>> index.
> Below sentence in above commit message isn't the reason why
> we can cache the dirty page logging. Right?
>
> """
> These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.
That's my understanding.
As long as the used index is not updated, the guest will not process
the descs.
If the migration converges between the time the descs are written,
and the time the used index is updated on source side. Then the guest
running on destination will not see the descriptors as used but as
available, and so will be overwritten by the vhost backend on
destination.
Regards,
Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-15 13:50 ` Maxime Coquelin
@ 2018-05-16 6:10 ` Tiwei Bie
2018-05-16 15:00 ` Maxime Coquelin
0 siblings, 1 reply; 9+ messages in thread
From: Tiwei Bie @ 2018-05-16 6:10 UTC (permalink / raw)
To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable
On Tue, May 15, 2018 at 03:50:54PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
>
> I just see I missed to reply to your comment on my commit message:
>
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> > > This patch caches all dirty pages logging until the used ring index
> > > is updated. These dirty pages won't be accessed by the guest as
> > > long as the host doesn't give them back to it by updating the
> > > index.
> > Below sentence in above commit message isn't the reason why
> > we can cache the dirty page logging. Right?
> >
> > """
> > These dirty pages won't be accessed by the guest as
> > long as the host doesn't give them back to it by updating the
> > index.
>
> That's my understanding.
> As long as the used index is not updated, the guest will not process
> the descs.
> If the migration converges between the time the descs are written,
> and the time the used index is updated on source side. Then the guest
> running on destination will not see the descriptors as used but as
> available, and so will be overwritten by the vhost backend on
> destination.
If my understanding is correct, theoretically the vhost
backend can cache all the dirty page loggings before it
responds to the GET_VRING_BASE messages. Below are the
steps how QEMU live migration works (w/o postcopy):
1. Syncing dirty pages between src and dst;
2. The dirty page sync converges;
3. The src QEMU sends GET_VRING_BASE to vhost backend;
4. Vhost backend still has a chance to log some dirty
pages before responding the GET_VRING_BASE messages;
5. The src QEMU receives GET_VRING_BASE response (which
means the device has stopped);
6. QEMU sync the remaining dirty pages;
7. QEMU on the dst starts running.
(The steps 3~6 are the downtime which we want to minimize)
So I think the words in commit log isn't really related
to why we can cache the dirty page loggings.
Best regards,
Tiwei Bie
>
> Regards,
> Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance
2018-05-16 6:10 ` Tiwei Bie
@ 2018-05-16 15:00 ` Maxime Coquelin
0 siblings, 0 replies; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-16 15:00 UTC (permalink / raw)
To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable
On 05/16/2018 08:10 AM, Tiwei Bie wrote:
> On Tue, May 15, 2018 at 03:50:54PM +0200, Maxime Coquelin wrote:
>> Hi Tiwei,
>>
>> I just see I missed to reply to your comment on my commit message:
>>
>> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
>>> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>>>> This patch caches all dirty pages logging until the used ring index
>>>> is updated. These dirty pages won't be accessed by the guest as
>>>> long as the host doesn't give them back to it by updating the
>>>> index.
>>> Below sentence in above commit message isn't the reason why
>>> we can cache the dirty page logging. Right?
>>>
>>> """
>>> These dirty pages won't be accessed by the guest as
>>> long as the host doesn't give them back to it by updating the
>>> index.
>>
>> That's my understanding.
>> As long as the used index is not updated, the guest will not process
>> the descs.
>> If the migration converges between the time the descs are written,
>> and the time the used index is updated on source side. Then the guest
>> running on destination will not see the descriptors as used but as
>> available, and so will be overwritten by the vhost backend on
>> destination.
>
> If my understanding is correct, theoretically the vhost
> backend can cache all the dirty page loggings before it
> responds to the GET_VRING_BASE messages. Below are the
> steps how QEMU live migration works (w/o postcopy):
>
> 1. Syncing dirty pages between src and dst;
> 2. The dirty page sync converges;
> 3. The src QEMU sends GET_VRING_BASE to vhost backend;
> 4. Vhost backend still has a chance to log some dirty
> pages before responding the GET_VRING_BASE messages;
> 5. The src QEMU receives GET_VRING_BASE response (which
> means the device has stopped);
> 6. QEMU sync the remaining dirty pages;
> 7. QEMU on the dst starts running.
Thanks for the clarification.
> (The steps 3~6 are the downtime which we want to minimize)
Right, we want to minimize the downtime, but in the same time,
be able to converge.
> So I think the words in commit log isn't really related
> to why we can cache the dirty page loggings.
I'll try to improve it in v3.
Thanks,
Maxime
> Best regards,
> Tiwei Bie
>
>>
>> Regards,
>> Maxime
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2018-05-16 15:00 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-30 15:59 [dpdk-dev] [PATCH] vhost: improve dirty pages logging performance Maxime Coquelin
2018-05-03 11:56 ` Tiwei Bie
2018-05-04 15:48 ` Maxime Coquelin
2018-05-04 18:54 ` Michael S. Tsirkin
2018-05-07 3:49 ` Tiwei Bie
2018-05-07 3:58 ` Michael S. Tsirkin
2018-05-15 13:50 ` Maxime Coquelin
2018-05-16 6:10 ` Tiwei Bie
2018-05-16 15:00 ` Maxime Coquelin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).