From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by dpdk.org (Postfix) with ESMTP id CBF632C4F; Fri, 24 Aug 2018 06:49:57 +0200 (CEST) X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 23 Aug 2018 21:49:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,281,1531810800"; d="scan'208";a="83013957" Received: from debian.sh.intel.com (HELO debian) ([10.67.104.194]) by fmsmga004.fm.intel.com with ESMTP; 23 Aug 2018 21:49:55 -0700 Date: Fri, 24 Aug 2018 12:49:09 +0800 From: Tiwei Bie To: "Burakov, Anatoly" Cc: Sean Harte , maxime.coquelin@redhat.com, zhihong.wang@intel.com, dev@dpdk.org, stable@dpdk.org Message-ID: <20180824044909.GA19323@debian> References: <20180823025721.18300-1-tiwei.bie@intel.com> <992313c9-cd03-03db-888b-f5dd8c072cf0@intel.com> <34c163f6-b50f-06c0-d099-7d0ae5fbc5b0@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <34c163f6-b50f-06c0-d099-7d0ae5fbc5b0@intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-stable] [PATCH] net/virtio-user: fix memory hotplug support X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Aug 2018 04:49:58 -0000 On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > On 23-Aug-18 12:19 PM, Sean Harte wrote: > > On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > > wrote: > > > > > > On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > > > > Deadlock can occur when allocating memory if a vhost-kernel > > > > based virtio-user device is in use. Besides, it's possible > > > > to have much more than 64 non-contiguous hugepage backed > > > > memory regions due to the memory hotplug, which may cause > > > > problems when handling VHOST_SET_MEM_TABLE request. A better > > > > solution is to have the virtio-user pass all the VA ranges > > > > reserved by DPDK to vhost-kernel. > > > > > > > > Bugzilla ID: 81 > > > > Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > > > > Cc: stable@dpdk.org > > > > > > > > Reported-by: Seán Harte > > > > Signed-off-by: Tiwei Bie > > > > --- > > > > drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > > > > 1 file changed, 27 insertions(+), 37 deletions(-) > > > > > > > > diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > index b2444096c..49bd1b821 100644 > > > > --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > > > > [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > > > > }; > > > > > > > > -struct walk_arg { > > > > - struct vhost_memory_kernel *vm; > > > > - uint32_t region_nr; > > > > -}; > > > > -static int > > > > -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > > > > - const struct rte_memseg *ms, size_t len, void *arg) > > > > -{ > > > > - struct walk_arg *wa = arg; > > > > - struct vhost_memory_region *mr; > > > > - void *start_addr; > > > > - > > > > - if (wa->region_nr >= max_regions) > > > > - return -1; > > > > - > > > > - mr = &wa->vm->regions[wa->region_nr++]; > > > > - start_addr = ms->addr; > > > > - > > > > - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > > > - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > > > - mr->memory_size = len; > > > > - mr->mmap_offset = 0; > > > > - > > > > - return 0; > > > > -} > > > > - > > > > -/* By default, vhost kernel module allows 64 regions, but DPDK allows > > > > - * 256 segments. As a relief, below function merges those virtually > > > > - * adjacent memsegs into one region. > > > > - */ > > > > static struct vhost_memory_kernel * > > > > prepare_vhost_memory_kernel(void) > > > > { > > > > + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > > > > struct vhost_memory_kernel *vm; > > > > - struct walk_arg wa; > > > > + uint32_t region_nr = 0, i; > > > > > > > > vm = malloc(sizeof(struct vhost_memory_kernel) + > > > > max_regions * > > > > @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > > > > if (!vm) > > > > return NULL; > > > > > > > > - wa.region_nr = 0; > > > > - wa.vm = vm; > > > > + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > > > > + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > > > > + struct vhost_memory_region *mr; > > > > + void *start_addr; > > > > + uint64_t len; > > > > > > There is a rte_memseg_list_walk() - please do not iterate over memseg > > > lists manually. > > > > > > > rte_memseg_list_walk() can't be used here because > > prepare_vhost_memory_kernel() is sometimes called from a memory > > callback. It will then hang trying to get a read lock on > > memory_hotplug_lock. > > OK, so use rte_memseg_list_walk_thread_unsafe(). > > > I don't think the rte_memseg_list_walk_thread_unsafe() function is > > appropriate because prepare_vhost_memory_kernel() may not always be > > called from a memory callback. > > And how is this different? What you're doing here is identical to calling > rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > internally - check the code!), except that you're doing it manually and not > using DPDK API, which makes your code dependent on internals of DPDK's > memory implementation. > > So, this function may or may not be called from a callback, but you're using > thread-unsafe walk anyway. I think you should call either thread-safe or > thread-unsafe version depending on whether you're being called from a > callback or not. Hmm, the real case is a bit more tricky. Even if this function isn't called from memory event callbacks, the "thread-safe" version list_walk() still can't be used. Because deadlock may happen. In virtio-user device start, it needs to do SET_MEM_TABLE for the vhost-backend. And to make sure that preparing and setting the memory table is atomic (and also to protect the device state), it needs a lock. So if it calls "thread-safe" version list_walk(), there will be two locks taken in below order: - the virtio-user device lock (taken by virtio_user_start_device()); - the memory hotplug lock (taken by rte_memseg_list_walk()); And above locks will be released in below order: - the memory hotplug lock (released by rte_memseg_list_walk()); - the virtio-user device lock (released by virtio_user_start_device()); And in virtio-user's memory event callback, it also needs to take the virtio-user device lock to make sure preparing and setting the memory table is atomic (and also to protect the device state), so the same device lock is needed here. And before virtio-user's memory event callback is called, the memory hotplug lock has already been taken by memory subsystem. So the locks are taken in below order: - the memory hotplug lock (It has been taken by memory subsystem before virtio-user's memory event callback being called); - the virtio-user device lock (taken by virtio_user_mem_event_cb()); So, if the virtio-user device start and memory callback events happen at the same time, deadlock may happen. And we can't always use rte_memseg_list_walk_thread_unsafe(), because by its definition, it's only expected to be called in memory callbacks: /** * ...... * @note This function does not perform any locking, and is only safe to call * from within memory-related callback functions. * ...... */ int __rte_experimental rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); So both of rte_memseg_list_walk_thread_unsafe() and rte_memseg_list_walk() are not really suitable for this case. If we really want to use these helpers, we need to allow rte_memseg_*_walk_thread_unsafe() to be called when the callers have taken mcfg->memory_hotplug_lock, or add some extra memory APIs to allow callers to take the lock for rte_memseg_*_walk_thread_unsafe(). And we can take this lock in virtio_user_start_device() before taking the virtio-user device lock (so locks can be taken in the correct order). Thoughts? Thanks > > > > > > > > > > > - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > > > > - free(vm); > > > > - return NULL; > > > > + start_addr = msl->base_va; > > > > + len = msl->page_sz * msl->memseg_arr.len; > > > > + > > > > + if (start_addr == NULL || len == 0) > > > > + continue; > > > > + > > > > + if (region_nr >= max_regions) { > > > > + free(vm); > > > > + return NULL; > > > > + } > > > > + > > > > + mr = &vm->regions[region_nr++]; > > > > + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > > > + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > > > + mr->memory_size = len; > > > > + mr->mmap_offset = 0; /* flags_padding */ > > > > + > > > > + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > > > > + i, start_addr, len); > > > > } > > > > > > > > - vm->nregions = wa.region_nr; > > > > + vm->nregions = region_nr; > > > > vm->padding = 0; > > > > return vm; > > > > } > > > > > > > > > > > > > -- > > > Thanks, > > > Anatoly > > > > > -- > Thanks, > Anatoly