From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f66.google.com (mail-oi0-f66.google.com [209.85.218.66]) by dpdk.org (Postfix) with ESMTP id 3B7F669D4; Fri, 24 Aug 2018 17:52:18 +0200 (CEST) Received: by mail-oi0-f66.google.com with SMTP id z185-v6so15657840oia.5; Fri, 24 Aug 2018 08:52:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=GQxd/d+URlj/SexrSj+MGdouK2MRDFdNc/AFZoAtOy4=; b=V+/n6xQeo0zPu+kmZR+/oXaxRBc8plERiHtmTREe2oinEUxVHT8ZIMdnDtNMHbawSO FGoICa4aIChv6Gf2QL++BgngXC6V0I24E8YoM1k75mn1SLsxVHNs+HG3736E4lqGViAE 4wE6TxDOBZ+eyUcClogRWH69DTnkncbiwmYJRXMK8CnBAY6v2b8DUUK/yLjKJpdUVHyo OIqjRVITvP9ilcKZ+s8elykXuFvPVv3FYDrz/OyjS3TpcwqvBrp5bl4rYVGke8HWTiOu PrnjXzy3fOjAa0OTARfuP7WKd/g92SzjEMFjTqGSlVeXvNQZdcdTXBXiucYN+O5w+DD9 647A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=GQxd/d+URlj/SexrSj+MGdouK2MRDFdNc/AFZoAtOy4=; b=SPXBn/NZ8XOmA2gobUI7scCedHqb6fw21nbgqRYKZNZLEAjIRH5tdOu6wiXxxQN8tf l2suJVaHFAaflrGLQff32YSCsIVlo30HxYpVKbM7HhqkBv5InNAro0hG4zCpNqrBM0qP Rr0ZPAJBPrFkhmY2ztRMZOzhJ7EtsTnvUIoJ+qK9d0x6aWKQ/Yc+XuP5NdvJIwb4mf9P gzQe3oJKACqAF6QxY508s0om3+Drb6vZ2N3N1moxCMvLm9umGMmwewFib6JT6MXPUodu ujWnHj+WQXyrNVebvDUFmfEeopLm+dkZrgIzmXeUPGC3fozoRROod92irenw/pXjbJQI x0bg== X-Gm-Message-State: APzg51AImgSZSroIYq4lkLYSQEjrZOdA7Mf+3YOmA5i15fBXeSui5Ahh DMlSnxj59iYrztlMTq9qI6G588dsDKD0cJC2gtQ= X-Google-Smtp-Source: ANB0VdYfFBWRGdfgF3ttUE9k/JiA/Cb5AlCU8maQHpHg1GQOJCNGZaiJJ+D9EUQTezfQ/xzFkEDB/wrCiKJjXCZjeFQ= X-Received: by 2002:aca:c0c1:: with SMTP id q184-v6mr1964271oif.61.1535125937375; Fri, 24 Aug 2018 08:52:17 -0700 (PDT) MIME-Version: 1.0 References: <20180823025721.18300-1-tiwei.bie@intel.com> <992313c9-cd03-03db-888b-f5dd8c072cf0@intel.com> <34c163f6-b50f-06c0-d099-7d0ae5fbc5b0@intel.com> <20180824044909.GA19323@debian> <9b95f13c-24a7-7a3c-1db6-46da10c926e4@intel.com> <20180824093556.GA89425@debian.sh.intel.com> <661bbefd-cb23-78a7-8c93-633a5289c090@intel.com> In-Reply-To: <661bbefd-cb23-78a7-8c93-633a5289c090@intel.com> From: Sean Harte Date: Fri, 24 Aug 2018 16:51:40 +0100 Message-ID: To: anatoly.burakov@intel.com Cc: tiwei.bie@intel.com, maxime.coquelin@redhat.com, zhihong.wang@intel.com, dev@dpdk.org, stable@dpdk.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Aug 2018 15:52:18 -0000 On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly wrote: > > On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: > > On 24-Aug-18 10:35 AM, Tiwei Bie wrote: > >> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: > >>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > >>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > >>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: > >>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > >>>>>> wrote: > >>>>>>> > >>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel > >>>>>>>> based virtio-user device is in use. Besides, it's possible > >>>>>>>> to have much more than 64 non-contiguous hugepage backed > >>>>>>>> memory regions due to the memory hotplug, which may cause > >>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>>>>>>> solution is to have the virtio-user pass all the VA ranges > >>>>>>>> reserved by DPDK to vhost-kernel. > >>>>>>>> > >>>>>>>> Bugzilla ID: 81 > >>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>>>>>>> Cc: stable@dpdk.org > >>>>>>>> > >>>>>>>> Reported-by: Se=C3=A1n Harte > >>>>>>>> Signed-off-by: Tiwei Bie > >>>>>>>> --- > >>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 > >>>>>>>> ++++++++----------- > >>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> index b2444096c..49bd1b821 100644 > >>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = =3D { > >>>>>>>> [VHOST_USER_SET_MEM_TABLE] =3D VHOST_SET_MEM_TABLE, > >>>>>>>> }; > >>>>>>>> > >>>>>>>> -struct walk_arg { > >>>>>>>> - struct vhost_memory_kernel *vm; > >>>>>>>> - uint32_t region_nr; > >>>>>>>> -}; > >>>>>>>> -static int > >>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unuse= d, > >>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg= ) > >>>>>>>> -{ > >>>>>>>> - struct walk_arg *wa =3D arg; > >>>>>>>> - struct vhost_memory_region *mr; > >>>>>>>> - void *start_addr; > >>>>>>>> - > >>>>>>>> - if (wa->region_nr >=3D max_regions) > >>>>>>>> - return -1; > >>>>>>>> - > >>>>>>>> - mr =3D &wa->vm->regions[wa->region_nr++]; > >>>>>>>> - start_addr =3D ms->addr; > >>>>>>>> - > >>>>>>>> - mr->guest_phys_addr =3D (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->userspace_addr =3D (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->memory_size =3D len; > >>>>>>>> - mr->mmap_offset =3D 0; > >>>>>>>> - > >>>>>>>> - return 0; > >>>>>>>> -} > >>>>>>>> - > >>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK > >>>>>>>> allows > >>>>>>>> - * 256 segments. As a relief, below function merges those > >>>>>>>> virtually > >>>>>>>> - * adjacent memsegs into one region. > >>>>>>>> - */ > >>>>>>>> static struct vhost_memory_kernel * > >>>>>>>> prepare_vhost_memory_kernel(void) > >>>>>>>> { > >>>>>>>> + struct rte_mem_config *mcfg =3D > >>>>>>>> rte_eal_get_configuration()->mem_config; > >>>>>>>> struct vhost_memory_kernel *vm; > >>>>>>>> - struct walk_arg wa; > >>>>>>>> + uint32_t region_nr =3D 0, i; > >>>>>>>> > >>>>>>>> vm =3D malloc(sizeof(struct vhost_memory_kernel) + > >>>>>>>> max_regions * > >>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>>>>>>> if (!vm) > >>>>>>>> return NULL; > >>>>>>>> > >>>>>>>> - wa.region_nr =3D 0; > >>>>>>>> - wa.vm =3D vm; > >>>>>>>> + for (i =3D 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>>>>>>> + struct rte_memseg_list *msl =3D &mcfg->memsegs[i]; > >>>>>>>> + struct vhost_memory_region *mr; > >>>>>>>> + void *start_addr; > >>>>>>>> + uint64_t len; > >>>>>>> > >>>>>>> There is a rte_memseg_list_walk() - please do not iterate over > >>>>>>> memseg > >>>>>>> lists manually. > >>>>>>> > >>>>>> > >>>>>> rte_memseg_list_walk() can't be used here because > >>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory > >>>>>> callback. It will then hang trying to get a read lock on > >>>>>> memory_hotplug_lock. > >>>>> > >>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). > >>>>> > >>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is > >>>>>> appropriate because prepare_vhost_memory_kernel() may not always b= e > >>>>>> called from a memory callback. > >>>>> > >>>>> And how is this different? What you're doing here is identical to > >>>>> calling > >>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > >>>>> internally - check the code!), except that you're doing it manually > >>>>> and not > >>>>> using DPDK API, which makes your code dependent on internals of DPD= K's > >>>>> memory implementation. > >>>>> > >>>>> So, this function may or may not be called from a callback, but > >>>>> you're using > >>>>> thread-unsafe walk anyway. I think you should call either > >>>>> thread-safe or > >>>>> thread-unsafe version depending on whether you're being called from= a > >>>>> callback or not. > >>>> > >>>> Hmm, the real case is a bit more tricky. Even if this > >>>> function isn't called from memory event callbacks, the > >>>> "thread-safe" version list_walk() still can't be used. > >>>> Because deadlock may happen. > >>>> > >>>> In virtio-user device start, it needs to do SET_MEM_TABLE > >>>> for the vhost-backend. And to make sure that preparing and > >>>> setting the memory table is atomic (and also to protect the > >>>> device state), it needs a lock. So if it calls "thread-safe" > >>>> version list_walk(), there will be two locks taken in > >>>> below order: > >>>> > >>>> - the virtio-user device lock (taken by virtio_user_start_device()); > >>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); > >>>> > >>>> And above locks will be released in below order: > >>>> > >>>> - the memory hotplug lock (released by rte_memseg_list_walk()); > >>>> - the virtio-user device lock (released by virtio_user_start_device(= )); > >>>> > >>>> And in virtio-user's memory event callback, it also needs > >>>> to take the virtio-user device lock to make sure preparing > >>>> and setting the memory table is atomic (and also to protect > >>>> the device state), so the same device lock is needed here. > >>>> And before virtio-user's memory event callback is called, > >>>> the memory hotplug lock has already been taken by memory > >>>> subsystem. So the locks are taken in below order: > >>>> > >>>> - the memory hotplug lock (It has been taken by memory subsystem > >>>> before virtio-user's memory event callback being called); > >>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > >>>> > >>>> So, if the virtio-user device start and memory callback > >>>> events happen at the same time, deadlock may happen. > >>>> > >>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), > >>>> because by its definition, it's only expected to be called > >>>> in memory callbacks: > >>>> > >>>> /** > >>>> * ...... > >>>> * @note This function does not perform any locking, and is only > >>>> safe to call > >>>> * from within memory-related callback functions. > >>>> * ...... > >>>> */ > >>>> int __rte_experimental > >>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void > >>>> *arg); > >>>> > >>>> So both of rte_memseg_list_walk_thread_unsafe() and > >>>> rte_memseg_list_walk() are not really suitable for this > >>>> case. If we really want to use these helpers, we need > >>>> to allow rte_memseg_*_walk_thread_unsafe() to be called > >>>> when the callers have taken mcfg->memory_hotplug_lock, > >>>> or add some extra memory APIs to allow callers to take > >>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we > >>>> can take this lock in virtio_user_start_device() before > >>>> taking the virtio-user device lock (so locks can be taken > >>>> in the correct order). Thoughts? > >>> > >>> You can know if this function is called from memory callback. You can > >>> probably also know if you're in the process of starting the device. T= he > >>> solution seems clear to me - check both? :) > >> > >> Hmm.. You didn't get my point. :( > >> > >> I mean a lock within virtio driver is needed by virtio-user > >> to avoid the race condition between the virtio-user device > >> start and the virtio-user memory event callback (e.g. about > >> the device state change). And to iterate the memseg lists, > >> a lock within memory subsystem will be taken. So in both of > >> the virtio device start and memory event handling, there are > >> two locks will be taken -- And we need to take these locks > >> in correct order to avoid deadlock, and it requires us to > >> have a way to take the lock for rte_memseg_*_thread_unsafe() > >> in callers. > > > > I'm afraid i'm still not getting your point :( > > > > You know that you can either get called from memory callback, or not > > from memory callback. Both of these times, the virtio device is locked. > > So, where does the race come from? You take out your device lock, and b= y > > the time you need to iterate through memsegs, you know that you either > > were or weren't called from memory callback, which means you can pick > > either thread-safe or thread-unsafe version. > > > > Can you please draw up a step-by-step example where race can happen tha= t > > cannot be solved the way i suggested above? > > > > I.e. > > > > thread 1 thread 2 > > do this > > do that > > For the benefit of public discussion, the following is result of our > internal discussion on this topic: the deadlock may happen because we > take virtio lock and hotplug lock in different order. > > So, thread 1 may do device start, which will take out virtio device > lock, then attempt to iterate memory regions thereby taking memory > hotplug lock. At the same time, thread 2 might trigger an allocation, > and virtio will receive a callback, which will lock hotplug, and then > attempt to lock virtio device. > > In other words, > > Thread 1 Thread 2 > lock virtio device > lock hotplug > (waits for hotplug unlock) > (waits for virtio device unlock) > > The solution to this is not trivial, and we haven't come up with > anything to fix this that doesn't involve pretty serious changes to the > memory subsystem. Ideas welcome :) > As part of device start, can the virtio-user driver take the hotplug lock before the virtio lock (in Thread 1 in your example). It can access the lock through rte_eal_get_configuration()->mem_config, as the Mellanox driver currently does. If that would work, an API to lock and unlock do it should probably be provided by the memory subsystem. Although, it seems a bit error-prone and a future change could easily break things. > -- > Thanks, > Anatoly