From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maxime.coquelin@redhat.com>
Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73])
 by dpdk.org (Postfix) with ESMTP id CD9801B1D1;
 Wed, 16 May 2018 09:49:50 +0200 (CEST)
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com
 [10.11.54.5])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.redhat.com (Postfix) with ESMTPS id 6791D40201A0;
 Wed, 16 May 2018 07:49:50 +0000 (UTC)
Received: from [10.36.112.25] (ovpn-112-25.ams2.redhat.com [10.36.112.25])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id DEE4AD74AB;
 Wed, 16 May 2018 07:49:46 +0000 (UTC)
To: Tiwei Bie <tiwei.bie@intel.com>
Cc: dev@dpdk.org, mst@redhat.com, stable@dpdk.org
References: <20180515173021.31903-1-maxime.coquelin@redhat.com>
 <20180516062107.GA26229@debian>
From: Maxime Coquelin <maxime.coquelin@redhat.com>
Message-ID: <465b4530-f19e-66e0-fc32-437595a3819e@redhat.com>
Date: Wed, 16 May 2018 09:49:45 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <20180516062107.GA26229@debian>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
 (mx1.redhat.com [10.11.55.6]); Wed, 16 May 2018 07:49:50 +0000 (UTC)
X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]);
 Wed, 16 May 2018 07:49:50 +0000 (UTC) for IP:'10.11.54.5'
 DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com'
 HELO:'smtp.corp.redhat.com' FROM:'maxime.coquelin@redhat.com' RCPT:''
Subject: Re: [dpdk-dev] [PATCH v2] vhost: improve dirty pages logging
	performance
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 16 May 2018 07:49:51 -0000


On 05/16/2018 08:21 AM, Tiwei Bie wrote:
> On Tue, May 15, 2018 at 07:30:21PM +0200, Maxime Coquelin wrote:
>> This patch caches all dirty pages logging until the used ring index
>> is updated. These dirty pages won't be accessed by the guest as
>> long as the host doesn't give them back to it by updating the
>> index.
>>
>> The goal of this optimization is to fix a performance regression
>> introduced when the vhost library started to use atomic operations
>> to set bits in the shared dirty log map. While the fix was valid
>> as previous implementation wasn't safe against concurent accesses,
> 
> Typo: concurent

Right.

>> contention was induced.
>>
>> With this patch, during migration, we have:
>> 1. Less atomic operations as only a single atomic OR operation
>> per 32 or 64 (depending on CPU)  pages.
>> 2. Less atomic operations as during a burst, the same page will
>> be marked dirty only once.
>> 3. Less write memory barriers.
>>
>> Fixes: 897f13a1f726 ("vhost: make page logging atomic")
>>
>> Cc: stable@dpdk.org
>>
>> Cc: Tiwei Bie <tiwei.bie@intel.com>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> 
> This is a nice approach! Thanks for the work!

Thanks for the review, it is really appreciated.

>> ---
>>   lib/librte_vhost/vhost.h      | 113 ++++++++++++++++++++++++++++++++++++++++++
>>   lib/librte_vhost/virtio_net.c |  29 +++++++----
>>   2 files changed, 132 insertions(+), 10 deletions(-)
>>
>> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
>> index 891978131..8f6a41d7e 100644
>> --- a/lib/librte_vhost/vhost.h
>> +++ b/lib/librte_vhost/vhost.h
>> @@ -36,6 +36,8 @@
>>   
>>   #define BUF_VECTOR_MAX 256
>>   
>> +#define VHOST_LOG_CACHE_NR 32
>> +
>>   /**
>>    * Structure contains buffer address, length and descriptor index
>>    * from vring to do scatter RX.
>> @@ -69,6 +71,14 @@ struct batch_copy_elem {
>>   	uint64_t log_addr;
>>   };
>>   
>> +/*
>> + * Structure that contains the info for batched dirty logging.
>> + */
>> +struct log_cache_entry {
>> +	uint32_t offset;
>> +	unsigned long val;
>> +};
>> +
>>   /**
>>    * Structure contains variables relevant to RX/TX virtqueues.
>>    */
>> @@ -112,6 +122,9 @@ struct vhost_virtqueue {
>>   	struct batch_copy_elem	*batch_copy_elems;
>>   	uint16_t		batch_copy_nb_elems;
>>   
>> +	struct log_cache_entry log_cache[VHOST_LOG_CACHE_NR];
>> +	uint16_t log_cache_nb_elem;
>> +
>>   	rte_rwlock_t	iotlb_lock;
>>   	rte_rwlock_t	iotlb_pending_lock;
>>   	struct rte_mempool *iotlb_pool;
>> @@ -309,7 +322,15 @@ struct virtio_net {
>>   static __rte_always_inline void
>>   vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
>>   {
>> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
> 
> Just curious, is there any reference about why
> this version was chosen? Thanks!

I googled Michael reference to the LWN article [0], and they mention GCC
7.1.

I haven't checked by myself though whether generated code is different 
in GCC >= 7.1.

[0]: https://lwn.net/Articles/691128/
> 
>> +		/*
>> +		 * __sync_ built-ins are deprecated, but __atomic_ ones
>> +		 * are sub-optimized in older GCC versions.
>> +		 */
> 
> The indent isn't right (just need one tab here).
Right, will fix.

> 
>>   	__sync_fetch_and_or_8(addr, (1U << nr));

This is unrelated to this patch set, but from GCC doc [1], shouldn't
we use __sync_fetch_and_or_1 as the size is in bytes?

[1]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html

>> +#else
>> +	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
>> +#endif
>>   }
>>   
>>   static __rte_always_inline void
>> @@ -340,6 +361,98 @@ vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>   	}
>>   }
>>   
>> +static __rte_always_inline void
>> +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
>> +{
>> +	unsigned long *log_base;
>> +	int i;
>> +
>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> +		   !dev->log_base))
>> +		return;
>> +
>> +	log_base = (unsigned long *)(uintptr_t)dev->log_base;
>> +
>> +	/* To make sure guest memory updates are committed before logging */
>> +	rte_smp_wmb();
>> +
>> +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
>> +		struct log_cache_entry *elem = vq->log_cache + i;
>> +
>> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
>> +		/*
>> +		 * '__sync' builtins are deprecated, but '__atomic' ones
>> +		 * are sub-optimized in older GCC versions.
>> +		 */
>> +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
>> +#else
>> +		__atomic_fetch_or(log_base + elem->offset, elem->val,
>> +				__ATOMIC_RELAXED);
>> +#endif
>> +	}
>> +
>> +	vq->log_cache_nb_elem = 0;
>> +}
>> +
>> +static __rte_always_inline void
>> +vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +					 uint64_t page)
>> +{
>> +	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
>> +	uint32_t offset = page / (sizeof(unsigned long) << 3);
>> +	int i;
>> +
>> +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
>> +		struct log_cache_entry *elem = vq->log_cache + i;
>> +
>> +		if (elem->offset == offset) {
>> +			elem->val |= (1U << bit_nr);
> 
> The val is unsigned long now, we need to use 1UL.

Good catch!

>> +			return;
>> +		}
>> +	}
>> +
>> +	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
>> +		/*
>> +		 * No more room for a new log cache entry,
>> +		 * so write the dirty log map directly.
>> +		 */
>> +		rte_smp_wmb();
>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>> +
>> +		return;
>> +	}
>> +
>> +	vq->log_cache[i].offset = offset;
>> +	vq->log_cache[i].val = (1U << bit_nr);
> 
> Ditto.
> 
>> +}
>> +
>> +static __rte_always_inline void
>> +vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +					  uint64_t addr, uint64_t len)
> 
> The 8 spaces width tabs are more widely used in DPDK.
> And in below coding style document,
> 
> https://github.com/DPDK/dpdk/blob/master/doc/guides/contributing/coding_style.rst
> 
> The width of each level indent in most examples is 8
> spaces. So maybe it's better to keep using 8 spaces
> width tabs.

Right, will fix this in v3.

>> +{
>> +	uint64_t page;
>> +
>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> +		   !dev->log_base || !len))
>> +		return;
>> +
>> +	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>> +		return;
>> +
>> +	page = addr / VHOST_LOG_PAGE;
>> +	while (page * VHOST_LOG_PAGE < addr + len) {
>> +		vhost_log_cache_page(dev, vq, page);
>> +		page += 1;
>> +	}
>> +}
>> +
>> +static __rte_always_inline void
>> +vhost_log_cache_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +		     uint64_t offset, uint64_t len)
>> +{
>> +	vhost_log_cache_write(dev, vq, vq->log_guest_addr + offset, len);
>> +}
>> +
>>   static __rte_always_inline void
>>   vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>   		     uint64_t offset, uint64_t len)
>> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
>> index eed6b0227..76ec5f089 100644
>> --- a/lib/librte_vhost/virtio_net.c
>> +++ b/lib/librte_vhost/virtio_net.c
>> @@ -78,7 +78,7 @@ do_flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>   	rte_memcpy(&vq->used->ring[to],
>>   			&vq->shadow_used_ring[from],
>>   			size * sizeof(struct vring_used_elem));
>> -	vhost_log_used_vring(dev, vq,
>> +	vhost_log_cache_used_vring(dev, vq,
>>   			offsetof(struct vring_used, ring[to]),
>>   			size * sizeof(struct vring_used_elem));
>>   }
>> @@ -106,6 +106,8 @@ flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq)
>>   
>>   	rte_smp_wmb();
>>   
>> +	vhost_log_cache_sync(dev, vq);
> 
> Each time we call vhost_log_cache_sync(), there
> is already a rte_smp_wmb() which is to protect
> the used->idx update. So maybe there is no need
> to call rte_smp_wmb() in vhost_log_cache_sync().

Right, I can remove it in vhost_log_cache_sync(), and
maybe add a comment there stating that a write barrier
before calling the function is expected.

Thanks,
Maxime
> Best regards,
> Tiwei Bie
>