From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by dpdk.org (Postfix) with ESMTP id B92602952; Fri, 4 May 2018 20:54:19 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F3962EB706; Fri, 4 May 2018 18:54:18 +0000 (UTC) Received: from redhat.com (ovpn-123-58.rdu2.redhat.com [10.10.123.58]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4D36E2166BAD; Fri, 4 May 2018 18:54:17 +0000 (UTC) Date: Fri, 4 May 2018 21:54:17 +0300 From: "Michael S. Tsirkin" To: Maxime Coquelin Cc: Tiwei Bie , dev@dpdk.org, jianfeng.tan@intel.com, stable@dpdk.org Message-ID: <20180504214703-mutt-send-email-mst@kernel.org> References: <20180430155954.9939-1-maxime.coquelin@redhat.com> <20180503115634.feaimkzpnbodferd@debian> <6125f044-d557-666a-8228-4930ead16089@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <6125f044-d557-666a-8228-4930ead16089@redhat.com> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Fri, 04 May 2018 18:54:19 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Fri, 04 May 2018 18:54:19 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'mst@redhat.com' RCPT:'' Subject: Re: [dpdk-stable] [PATCH] vhost: improve dirty pages logging performance X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 May 2018 18:54:20 -0000 On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote: > Hi Tiwei, > > On 05/03/2018 01:56 PM, Tiwei Bie wrote: > > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote: > > > This patch caches all dirty pages logging until the used ring index > > > is updated. These dirty pages won't be accessed by the guest as > > > long as the host doesn't give them back to it by updating the > > > index. > > > > Below sentence in above commit message isn't the reason why > > we can cache the dirty page logging. Right? > > > > """ > > These dirty pages won't be accessed by the guest as > > long as the host doesn't give them back to it by updating the > > index. > > """ > > > > > > > > The goal of this optimization is to fix a performance regression > > > introduced when the vhost library started to use atomic operations > > > to set bits in the shared dirty log map. While the fix was valid > > > as previous implementation wasn't safe against concurent accesses, > > > contention was induced. > > > > > > With this patch, during migration, we have: > > > 1. Less atomic operations as only a single atomic OR operation > > > per 32 pages. > > > > Why not do it per 64 pages? > > I wasn't sure it would be supported on 32bits CPU, but [0] seems do > indicate that only 128bits atomic operations aren't supported on all > architectures. > > I will change to do it per 64 pages. "" The four non-arithmetic functions (load, store, exchange, and compare_exchange) all have a generic version as well. This generic version works on any data type. It uses the lock-free built-in function if the specific data type size makes that possible; otherwise, an external call is left to be resolved at run time. This external call is the same format with the addition of a ‘size_t’ parameter inserted as the first parameter indicating the size of the object being pointed to. All objects must be the same size. "" So I suspect it isn't well supported on all 32 bit CPUs, but can't you do an atomic using long? That will do the right thing depending on the CPU. If not it's pretty easy to something along the lines of #if BIT_PER_LONG == 64 > [0]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins > > > 2. Less atomic operations as during a burst, the same page will > > > be marked dirty only once. > > > 3. Less write memory barriers. > > > > > > Fixes: 897f13a1f726 ("vhost: make page logging atomic") > > > > > > Cc: stable@dpdk.org > > > > > > Suggested-by: Michael S. Tsirkin > > > Signed-off-by: Maxime Coquelin > > > --- > > > > > > Hi, > > > > > > This series was tested with migrating a guest while running PVP > > > benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch. > > > > If the throughput is higher (e.g. by adding more cores > > and queues), will the live migration fail due to the > > higher dirty page generating speed? > > I haven't done the check, but I agree that the higher is the throughput, > the longer will be the migration duration and the higher the risk it > never converges. Right. It's not a good reason to just randomly slow down networking :) In fact if data ends up on the same page, there is a chance that instead of dirtying the same page multiple times we will dirty it once, which will help migration converge. > In this case of scenario, postcopy live-migration way be a better fit, > as the hot pages will be copied only once. Postcopy support for > vhost-user have been added to QEMU in v2.12, and I have a prototype for > DPDK that I plan to propose for next release. > > > > > > > > > With this patch we recover the packet drops regressions seen since > > > the use of atomic operations to log dirty pages. > > [...] > > > +static __rte_always_inline void > > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq) > > > +{ > > > + uint32_t *log_base; > > > + int i; > > > + > > > + if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) || > > > + !dev->log_base)) > > > + return; > > > + > > > + log_base = (uint32_t *)(uintptr_t)dev->log_base; > > > + > > > + /* To make sure guest memory updates are committed before logging */ > > > + rte_smp_wmb(); > > > > It seems that __sync_fetch_and_or() can be considered a full > > barrier [1]. So do we really need this rte_smp_wmb()? > > That's a good point, thanks for the pointer. > > > Besides, based on the same doc [1], it seems that the __sync_ > > version is deprecated in favor of the __atomic_ one. > > I will change to __atomic_. For the memory model, do you agree I should > use __ATOMIC_SEQ_CST? > > > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html > > > > > + > > > + for (i = 0; i < vq->log_cache_nb_elem; i++) { > > > + struct log_cache_entry *elem = vq->log_cache + i; > > > + > > > + __sync_fetch_and_or(log_base + elem->offset, elem->val); > > > + } > > > + > > > + vq->log_cache_nb_elem = 0; > > > +} > > > + > > [...] > > > > Thanks, > Maxime