* [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support @ 2018-08-23 2:57 Tiwei Bie 2018-08-23 9:05 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Tiwei Bie @ 2018-08-23 2:57 UTC (permalink / raw) To: maxime.coquelin, zhihong.wang, dev; +Cc: seanbh, anatoly.burakov, stable Deadlock can occur when allocating memory if a vhost-kernel based virtio-user device is in use. Besides, it's possible to have much more than 64 non-contiguous hugepage backed memory regions due to the memory hotplug, which may cause problems when handling VHOST_SET_MEM_TABLE request. A better solution is to have the virtio-user pass all the VA ranges reserved by DPDK to vhost-kernel. Bugzilla ID: 81 Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") Cc: stable@dpdk.org Reported-by: Seán Harte <seanbh@gmail.com> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> --- drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- 1 file changed, 27 insertions(+), 37 deletions(-) diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c index b2444096c..49bd1b821 100644 --- a/drivers/net/virtio/virtio_user/vhost_kernel.c +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, }; -struct walk_arg { - struct vhost_memory_kernel *vm; - uint32_t region_nr; -}; -static int -add_memory_region(const struct rte_memseg_list *msl __rte_unused, - const struct rte_memseg *ms, size_t len, void *arg) -{ - struct walk_arg *wa = arg; - struct vhost_memory_region *mr; - void *start_addr; - - if (wa->region_nr >= max_regions) - return -1; - - mr = &wa->vm->regions[wa->region_nr++]; - start_addr = ms->addr; - - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; - mr->memory_size = len; - mr->mmap_offset = 0; - - return 0; -} - -/* By default, vhost kernel module allows 64 regions, but DPDK allows - * 256 segments. As a relief, below function merges those virtually - * adjacent memsegs into one region. - */ static struct vhost_memory_kernel * prepare_vhost_memory_kernel(void) { + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; struct vhost_memory_kernel *vm; - struct walk_arg wa; + uint32_t region_nr = 0, i; vm = malloc(sizeof(struct vhost_memory_kernel) + max_regions * @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) if (!vm) return NULL; - wa.region_nr = 0; - wa.vm = vm; + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { + struct rte_memseg_list *msl = &mcfg->memsegs[i]; + struct vhost_memory_region *mr; + void *start_addr; + uint64_t len; - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { - free(vm); - return NULL; + start_addr = msl->base_va; + len = msl->page_sz * msl->memseg_arr.len; + + if (start_addr == NULL || len == 0) + continue; + + if (region_nr >= max_regions) { + free(vm); + return NULL; + } + + mr = &vm->regions[region_nr++]; + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; + mr->memory_size = len; + mr->mmap_offset = 0; /* flags_padding */ + + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, + i, start_addr, len); } - vm->nregions = wa.region_nr; + vm->nregions = region_nr; vm->padding = 0; return vm; } -- 2.18.0 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-23 2:57 [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support Tiwei Bie @ 2018-08-23 9:05 ` Burakov, Anatoly 2018-08-23 11:19 ` Sean Harte 0 siblings, 1 reply; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-23 9:05 UTC (permalink / raw) To: Tiwei Bie, maxime.coquelin, zhihong.wang, dev; +Cc: seanbh, stable On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > Deadlock can occur when allocating memory if a vhost-kernel > based virtio-user device is in use. Besides, it's possible > to have much more than 64 non-contiguous hugepage backed > memory regions due to the memory hotplug, which may cause > problems when handling VHOST_SET_MEM_TABLE request. A better > solution is to have the virtio-user pass all the VA ranges > reserved by DPDK to vhost-kernel. > > Bugzilla ID: 81 > Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > Cc: stable@dpdk.org > > Reported-by: Seán Harte <seanbh@gmail.com> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > --- > drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > 1 file changed, 27 insertions(+), 37 deletions(-) > > diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > index b2444096c..49bd1b821 100644 > --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > }; > > -struct walk_arg { > - struct vhost_memory_kernel *vm; > - uint32_t region_nr; > -}; > -static int > -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > - const struct rte_memseg *ms, size_t len, void *arg) > -{ > - struct walk_arg *wa = arg; > - struct vhost_memory_region *mr; > - void *start_addr; > - > - if (wa->region_nr >= max_regions) > - return -1; > - > - mr = &wa->vm->regions[wa->region_nr++]; > - start_addr = ms->addr; > - > - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > - mr->memory_size = len; > - mr->mmap_offset = 0; > - > - return 0; > -} > - > -/* By default, vhost kernel module allows 64 regions, but DPDK allows > - * 256 segments. As a relief, below function merges those virtually > - * adjacent memsegs into one region. > - */ > static struct vhost_memory_kernel * > prepare_vhost_memory_kernel(void) > { > + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > struct vhost_memory_kernel *vm; > - struct walk_arg wa; > + uint32_t region_nr = 0, i; > > vm = malloc(sizeof(struct vhost_memory_kernel) + > max_regions * > @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > if (!vm) > return NULL; > > - wa.region_nr = 0; > - wa.vm = vm; > + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > + struct vhost_memory_region *mr; > + void *start_addr; > + uint64_t len; There is a rte_memseg_list_walk() - please do not iterate over memseg lists manually. > > - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > - free(vm); > - return NULL; > + start_addr = msl->base_va; > + len = msl->page_sz * msl->memseg_arr.len; > + > + if (start_addr == NULL || len == 0) > + continue; > + > + if (region_nr >= max_regions) { > + free(vm); > + return NULL; > + } > + > + mr = &vm->regions[region_nr++]; > + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > + mr->memory_size = len; > + mr->mmap_offset = 0; /* flags_padding */ > + > + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > + i, start_addr, len); > } > > - vm->nregions = wa.region_nr; > + vm->nregions = region_nr; > vm->padding = 0; > return vm; > } > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-23 9:05 ` Burakov, Anatoly @ 2018-08-23 11:19 ` Sean Harte 2018-08-23 14:01 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Sean Harte @ 2018-08-23 11:19 UTC (permalink / raw) To: anatoly.burakov; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly <anatoly.burakov@intel.com> wrote: > > On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > > Deadlock can occur when allocating memory if a vhost-kernel > > based virtio-user device is in use. Besides, it's possible > > to have much more than 64 non-contiguous hugepage backed > > memory regions due to the memory hotplug, which may cause > > problems when handling VHOST_SET_MEM_TABLE request. A better > > solution is to have the virtio-user pass all the VA ranges > > reserved by DPDK to vhost-kernel. > > > > Bugzilla ID: 81 > > Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > > Cc: stable@dpdk.org > > > > Reported-by: Seán Harte <seanbh@gmail.com> > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > > --- > > drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > > 1 file changed, 27 insertions(+), 37 deletions(-) > > > > diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > > index b2444096c..49bd1b821 100644 > > --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > > +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > > @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > > [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > > }; > > > > -struct walk_arg { > > - struct vhost_memory_kernel *vm; > > - uint32_t region_nr; > > -}; > > -static int > > -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > > - const struct rte_memseg *ms, size_t len, void *arg) > > -{ > > - struct walk_arg *wa = arg; > > - struct vhost_memory_region *mr; > > - void *start_addr; > > - > > - if (wa->region_nr >= max_regions) > > - return -1; > > - > > - mr = &wa->vm->regions[wa->region_nr++]; > > - start_addr = ms->addr; > > - > > - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > - mr->memory_size = len; > > - mr->mmap_offset = 0; > > - > > - return 0; > > -} > > - > > -/* By default, vhost kernel module allows 64 regions, but DPDK allows > > - * 256 segments. As a relief, below function merges those virtually > > - * adjacent memsegs into one region. > > - */ > > static struct vhost_memory_kernel * > > prepare_vhost_memory_kernel(void) > > { > > + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > > struct vhost_memory_kernel *vm; > > - struct walk_arg wa; > > + uint32_t region_nr = 0, i; > > > > vm = malloc(sizeof(struct vhost_memory_kernel) + > > max_regions * > > @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > > if (!vm) > > return NULL; > > > > - wa.region_nr = 0; > > - wa.vm = vm; > > + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > > + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > > + struct vhost_memory_region *mr; > > + void *start_addr; > > + uint64_t len; > > There is a rte_memseg_list_walk() - please do not iterate over memseg > lists manually. > rte_memseg_list_walk() can't be used here because prepare_vhost_memory_kernel() is sometimes called from a memory callback. It will then hang trying to get a read lock on memory_hotplug_lock. I don't think the rte_memseg_list_walk_thread_unsafe() function is appropriate because prepare_vhost_memory_kernel() may not always be called from a memory callback. > > > > - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > > - free(vm); > > - return NULL; > > + start_addr = msl->base_va; > > + len = msl->page_sz * msl->memseg_arr.len; > > + > > + if (start_addr == NULL || len == 0) > > + continue; > > + > > + if (region_nr >= max_regions) { > > + free(vm); > > + return NULL; > > + } > > + > > + mr = &vm->regions[region_nr++]; > > + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > + mr->memory_size = len; > > + mr->mmap_offset = 0; /* flags_padding */ > > + > > + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > > + i, start_addr, len); > > } > > > > - vm->nregions = wa.region_nr; > > + vm->nregions = region_nr; > > vm->padding = 0; > > return vm; > > } > > > > > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-23 11:19 ` Sean Harte @ 2018-08-23 14:01 ` Burakov, Anatoly 2018-08-23 15:33 ` Sean Harte 2018-08-24 4:49 ` Tiwei Bie 0 siblings, 2 replies; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-23 14:01 UTC (permalink / raw) To: Sean Harte; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On 23-Aug-18 12:19 PM, Sean Harte wrote: > On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > <anatoly.burakov@intel.com> wrote: >> >> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>> Deadlock can occur when allocating memory if a vhost-kernel >>> based virtio-user device is in use. Besides, it's possible >>> to have much more than 64 non-contiguous hugepage backed >>> memory regions due to the memory hotplug, which may cause >>> problems when handling VHOST_SET_MEM_TABLE request. A better >>> solution is to have the virtio-user pass all the VA ranges >>> reserved by DPDK to vhost-kernel. >>> >>> Bugzilla ID: 81 >>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>> Cc: stable@dpdk.org >>> >>> Reported-by: Seán Harte <seanbh@gmail.com> >>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>> --- >>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- >>> 1 file changed, 27 insertions(+), 37 deletions(-) >>> >>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c >>> index b2444096c..49bd1b821 100644 >>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>> }; >>> >>> -struct walk_arg { >>> - struct vhost_memory_kernel *vm; >>> - uint32_t region_nr; >>> -}; >>> -static int >>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>> - const struct rte_memseg *ms, size_t len, void *arg) >>> -{ >>> - struct walk_arg *wa = arg; >>> - struct vhost_memory_region *mr; >>> - void *start_addr; >>> - >>> - if (wa->region_nr >= max_regions) >>> - return -1; >>> - >>> - mr = &wa->vm->regions[wa->region_nr++]; >>> - start_addr = ms->addr; >>> - >>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>> - mr->memory_size = len; >>> - mr->mmap_offset = 0; >>> - >>> - return 0; >>> -} >>> - >>> -/* By default, vhost kernel module allows 64 regions, but DPDK allows >>> - * 256 segments. As a relief, below function merges those virtually >>> - * adjacent memsegs into one region. >>> - */ >>> static struct vhost_memory_kernel * >>> prepare_vhost_memory_kernel(void) >>> { >>> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; >>> struct vhost_memory_kernel *vm; >>> - struct walk_arg wa; >>> + uint32_t region_nr = 0, i; >>> >>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>> max_regions * >>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>> if (!vm) >>> return NULL; >>> >>> - wa.region_nr = 0; >>> - wa.vm = vm; >>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>> + struct vhost_memory_region *mr; >>> + void *start_addr; >>> + uint64_t len; >> >> There is a rte_memseg_list_walk() - please do not iterate over memseg >> lists manually. >> > > rte_memseg_list_walk() can't be used here because > prepare_vhost_memory_kernel() is sometimes called from a memory > callback. It will then hang trying to get a read lock on > memory_hotplug_lock. OK, so use rte_memseg_list_walk_thread_unsafe(). > I don't think the rte_memseg_list_walk_thread_unsafe() function is > appropriate because prepare_vhost_memory_kernel() may not always be > called from a memory callback. And how is this different? What you're doing here is identical to calling rte_memseg_list_walk_thread_unsafe() (that's precisely what it does internally - check the code!), except that you're doing it manually and not using DPDK API, which makes your code dependent on internals of DPDK's memory implementation. So, this function may or may not be called from a callback, but you're using thread-unsafe walk anyway. I think you should call either thread-safe or thread-unsafe version depending on whether you're being called from a callback or not. > >>> >>> - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { >>> - free(vm); >>> - return NULL; >>> + start_addr = msl->base_va; >>> + len = msl->page_sz * msl->memseg_arr.len; >>> + >>> + if (start_addr == NULL || len == 0) >>> + continue; >>> + >>> + if (region_nr >= max_regions) { >>> + free(vm); >>> + return NULL; >>> + } >>> + >>> + mr = &vm->regions[region_nr++]; >>> + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>> + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>> + mr->memory_size = len; >>> + mr->mmap_offset = 0; /* flags_padding */ >>> + >>> + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, >>> + i, start_addr, len); >>> } >>> >>> - vm->nregions = wa.region_nr; >>> + vm->nregions = region_nr; >>> vm->padding = 0; >>> return vm; >>> } >>> >> >> >> -- >> Thanks, >> Anatoly > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-23 14:01 ` Burakov, Anatoly @ 2018-08-23 15:33 ` Sean Harte 2018-08-24 4:49 ` Tiwei Bie 1 sibling, 0 replies; 14+ messages in thread From: Sean Harte @ 2018-08-23 15:33 UTC (permalink / raw) To: anatoly.burakov; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On Thu, 23 Aug 2018 at 15:01, Burakov, Anatoly <anatoly.burakov@intel.com> wrote: > > On 23-Aug-18 12:19 PM, Sean Harte wrote: > > On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > > <anatoly.burakov@intel.com> wrote: > >> > >> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>> Deadlock can occur when allocating memory if a vhost-kernel > >>> based virtio-user device is in use. Besides, it's possible > >>> to have much more than 64 non-contiguous hugepage backed > >>> memory regions due to the memory hotplug, which may cause > >>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>> solution is to have the virtio-user pass all the VA ranges > >>> reserved by DPDK to vhost-kernel. > >>> > >>> Bugzilla ID: 81 > >>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>> Cc: stable@dpdk.org > >>> > >>> Reported-by: Seán Harte <seanbh@gmail.com> > >>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > >>> --- > >>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > >>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>> > >>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>> index b2444096c..49bd1b821 100644 > >>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > >>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > >>> }; > >>> > >>> -struct walk_arg { > >>> - struct vhost_memory_kernel *vm; > >>> - uint32_t region_nr; > >>> -}; > >>> -static int > >>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > >>> - const struct rte_memseg *ms, size_t len, void *arg) > >>> -{ > >>> - struct walk_arg *wa = arg; > >>> - struct vhost_memory_region *mr; > >>> - void *start_addr; > >>> - > >>> - if (wa->region_nr >= max_regions) > >>> - return -1; > >>> - > >>> - mr = &wa->vm->regions[wa->region_nr++]; > >>> - start_addr = ms->addr; > >>> - > >>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>> - mr->memory_size = len; > >>> - mr->mmap_offset = 0; > >>> - > >>> - return 0; > >>> -} > >>> - > >>> -/* By default, vhost kernel module allows 64 regions, but DPDK allows > >>> - * 256 segments. As a relief, below function merges those virtually > >>> - * adjacent memsegs into one region. > >>> - */ > >>> static struct vhost_memory_kernel * > >>> prepare_vhost_memory_kernel(void) > >>> { > >>> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > >>> struct vhost_memory_kernel *vm; > >>> - struct walk_arg wa; > >>> + uint32_t region_nr = 0, i; > >>> > >>> vm = malloc(sizeof(struct vhost_memory_kernel) + > >>> max_regions * > >>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>> if (!vm) > >>> return NULL; > >>> > >>> - wa.region_nr = 0; > >>> - wa.vm = vm; > >>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > >>> + struct vhost_memory_region *mr; > >>> + void *start_addr; > >>> + uint64_t len; > >> > >> There is a rte_memseg_list_walk() - please do not iterate over memseg > >> lists manually. > >> > > > > rte_memseg_list_walk() can't be used here because > > prepare_vhost_memory_kernel() is sometimes called from a memory > > callback. It will then hang trying to get a read lock on > > memory_hotplug_lock. > > OK, so use rte_memseg_list_walk_thread_unsafe(). > > > I don't think the rte_memseg_list_walk_thread_unsafe() function is > > appropriate because prepare_vhost_memory_kernel() may not always be > > called from a memory callback. > > And how is this different? What you're doing here is identical to > calling rte_memseg_list_walk_thread_unsafe() (that's precisely what it > does internally - check the code!), except that you're doing it manually > and not using DPDK API, which makes your code dependent on internals of > DPDK's memory implementation. > > So, this function may or may not be called from a callback, but you're > using thread-unsafe walk anyway. I think you should call either > thread-safe or thread-unsafe version depending on whether you're being > called from a callback or not. > I'm not the patch author :-) Switching to the _thread_unsafe() function might convert the easy-to-debug deadlock issue into a tricky-to-debug race condition. Picking between the two implementations depending on the calling context would work. Another approach might be a different type of locking system that allows reading when the same thread already holds the write lock. > > > >>> > >>> - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > >>> - free(vm); > >>> - return NULL; > >>> + start_addr = msl->base_va; > >>> + len = msl->page_sz * msl->memseg_arr.len; > >>> + > >>> + if (start_addr == NULL || len == 0) > >>> + continue; > >>> + > >>> + if (region_nr >= max_regions) { > >>> + free(vm); > >>> + return NULL; > >>> + } > >>> + > >>> + mr = &vm->regions[region_nr++]; > >>> + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>> + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>> + mr->memory_size = len; > >>> + mr->mmap_offset = 0; /* flags_padding */ > >>> + > >>> + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > >>> + i, start_addr, len); > >>> } > >>> > >>> - vm->nregions = wa.region_nr; > >>> + vm->nregions = region_nr; > >>> vm->padding = 0; > >>> return vm; > >>> } > >>> > >> > >> > >> -- > >> Thanks, > >> Anatoly > > > > > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-23 14:01 ` Burakov, Anatoly 2018-08-23 15:33 ` Sean Harte @ 2018-08-24 4:49 ` Tiwei Bie 2018-08-24 8:59 ` Burakov, Anatoly 1 sibling, 1 reply; 14+ messages in thread From: Tiwei Bie @ 2018-08-24 4:49 UTC (permalink / raw) To: Burakov, Anatoly; +Cc: Sean Harte, maxime.coquelin, zhihong.wang, dev, stable On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > On 23-Aug-18 12:19 PM, Sean Harte wrote: > > On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > > <anatoly.burakov@intel.com> wrote: > > > > > > On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > > > > Deadlock can occur when allocating memory if a vhost-kernel > > > > based virtio-user device is in use. Besides, it's possible > > > > to have much more than 64 non-contiguous hugepage backed > > > > memory regions due to the memory hotplug, which may cause > > > > problems when handling VHOST_SET_MEM_TABLE request. A better > > > > solution is to have the virtio-user pass all the VA ranges > > > > reserved by DPDK to vhost-kernel. > > > > > > > > Bugzilla ID: 81 > > > > Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > > > > Cc: stable@dpdk.org > > > > > > > > Reported-by: Seán Harte <seanbh@gmail.com> > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > > > > --- > > > > drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > > > > 1 file changed, 27 insertions(+), 37 deletions(-) > > > > > > > > diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > index b2444096c..49bd1b821 100644 > > > > --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > > > > @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > > > > [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > > > > }; > > > > > > > > -struct walk_arg { > > > > - struct vhost_memory_kernel *vm; > > > > - uint32_t region_nr; > > > > -}; > > > > -static int > > > > -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > > > > - const struct rte_memseg *ms, size_t len, void *arg) > > > > -{ > > > > - struct walk_arg *wa = arg; > > > > - struct vhost_memory_region *mr; > > > > - void *start_addr; > > > > - > > > > - if (wa->region_nr >= max_regions) > > > > - return -1; > > > > - > > > > - mr = &wa->vm->regions[wa->region_nr++]; > > > > - start_addr = ms->addr; > > > > - > > > > - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > > > - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > > > - mr->memory_size = len; > > > > - mr->mmap_offset = 0; > > > > - > > > > - return 0; > > > > -} > > > > - > > > > -/* By default, vhost kernel module allows 64 regions, but DPDK allows > > > > - * 256 segments. As a relief, below function merges those virtually > > > > - * adjacent memsegs into one region. > > > > - */ > > > > static struct vhost_memory_kernel * > > > > prepare_vhost_memory_kernel(void) > > > > { > > > > + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > > > > struct vhost_memory_kernel *vm; > > > > - struct walk_arg wa; > > > > + uint32_t region_nr = 0, i; > > > > > > > > vm = malloc(sizeof(struct vhost_memory_kernel) + > > > > max_regions * > > > > @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > > > > if (!vm) > > > > return NULL; > > > > > > > > - wa.region_nr = 0; > > > > - wa.vm = vm; > > > > + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > > > > + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > > > > + struct vhost_memory_region *mr; > > > > + void *start_addr; > > > > + uint64_t len; > > > > > > There is a rte_memseg_list_walk() - please do not iterate over memseg > > > lists manually. > > > > > > > rte_memseg_list_walk() can't be used here because > > prepare_vhost_memory_kernel() is sometimes called from a memory > > callback. It will then hang trying to get a read lock on > > memory_hotplug_lock. > > OK, so use rte_memseg_list_walk_thread_unsafe(). > > > I don't think the rte_memseg_list_walk_thread_unsafe() function is > > appropriate because prepare_vhost_memory_kernel() may not always be > > called from a memory callback. > > And how is this different? What you're doing here is identical to calling > rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > internally - check the code!), except that you're doing it manually and not > using DPDK API, which makes your code dependent on internals of DPDK's > memory implementation. > > So, this function may or may not be called from a callback, but you're using > thread-unsafe walk anyway. I think you should call either thread-safe or > thread-unsafe version depending on whether you're being called from a > callback or not. Hmm, the real case is a bit more tricky. Even if this function isn't called from memory event callbacks, the "thread-safe" version list_walk() still can't be used. Because deadlock may happen. In virtio-user device start, it needs to do SET_MEM_TABLE for the vhost-backend. And to make sure that preparing and setting the memory table is atomic (and also to protect the device state), it needs a lock. So if it calls "thread-safe" version list_walk(), there will be two locks taken in below order: - the virtio-user device lock (taken by virtio_user_start_device()); - the memory hotplug lock (taken by rte_memseg_list_walk()); And above locks will be released in below order: - the memory hotplug lock (released by rte_memseg_list_walk()); - the virtio-user device lock (released by virtio_user_start_device()); And in virtio-user's memory event callback, it also needs to take the virtio-user device lock to make sure preparing and setting the memory table is atomic (and also to protect the device state), so the same device lock is needed here. And before virtio-user's memory event callback is called, the memory hotplug lock has already been taken by memory subsystem. So the locks are taken in below order: - the memory hotplug lock (It has been taken by memory subsystem before virtio-user's memory event callback being called); - the virtio-user device lock (taken by virtio_user_mem_event_cb()); So, if the virtio-user device start and memory callback events happen at the same time, deadlock may happen. And we can't always use rte_memseg_list_walk_thread_unsafe(), because by its definition, it's only expected to be called in memory callbacks: /** * ...... * @note This function does not perform any locking, and is only safe to call * from within memory-related callback functions. * ...... */ int __rte_experimental rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); So both of rte_memseg_list_walk_thread_unsafe() and rte_memseg_list_walk() are not really suitable for this case. If we really want to use these helpers, we need to allow rte_memseg_*_walk_thread_unsafe() to be called when the callers have taken mcfg->memory_hotplug_lock, or add some extra memory APIs to allow callers to take the lock for rte_memseg_*_walk_thread_unsafe(). And we can take this lock in virtio_user_start_device() before taking the virtio-user device lock (so locks can be taken in the correct order). Thoughts? Thanks > > > > > > > > > > > - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > > > > - free(vm); > > > > - return NULL; > > > > + start_addr = msl->base_va; > > > > + len = msl->page_sz * msl->memseg_arr.len; > > > > + > > > > + if (start_addr == NULL || len == 0) > > > > + continue; > > > > + > > > > + if (region_nr >= max_regions) { > > > > + free(vm); > > > > + return NULL; > > > > + } > > > > + > > > > + mr = &vm->regions[region_nr++]; > > > > + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > > > > + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > > > > + mr->memory_size = len; > > > > + mr->mmap_offset = 0; /* flags_padding */ > > > > + > > > > + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > > > > + i, start_addr, len); > > > > } > > > > > > > > - vm->nregions = wa.region_nr; > > > > + vm->nregions = region_nr; > > > > vm->padding = 0; > > > > return vm; > > > > } > > > > > > > > > > > > > -- > > > Thanks, > > > Anatoly > > > > > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 4:49 ` Tiwei Bie @ 2018-08-24 8:59 ` Burakov, Anatoly 2018-08-24 9:35 ` Tiwei Bie 0 siblings, 1 reply; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-24 8:59 UTC (permalink / raw) To: Tiwei Bie; +Cc: Sean Harte, maxime.coquelin, zhihong.wang, dev, stable On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: >> On 23-Aug-18 12:19 PM, Sean Harte wrote: >>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly >>> <anatoly.burakov@intel.com> wrote: >>>> >>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>>>> Deadlock can occur when allocating memory if a vhost-kernel >>>>> based virtio-user device is in use. Besides, it's possible >>>>> to have much more than 64 non-contiguous hugepage backed >>>>> memory regions due to the memory hotplug, which may cause >>>>> problems when handling VHOST_SET_MEM_TABLE request. A better >>>>> solution is to have the virtio-user pass all the VA ranges >>>>> reserved by DPDK to vhost-kernel. >>>>> >>>>> Bugzilla ID: 81 >>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>>>> Cc: stable@dpdk.org >>>>> >>>>> Reported-by: Seán Harte <seanbh@gmail.com> >>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>>>> --- >>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- >>>>> 1 file changed, 27 insertions(+), 37 deletions(-) >>>>> >>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>> index b2444096c..49bd1b821 100644 >>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>>>> }; >>>>> >>>>> -struct walk_arg { >>>>> - struct vhost_memory_kernel *vm; >>>>> - uint32_t region_nr; >>>>> -}; >>>>> -static int >>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>>>> - const struct rte_memseg *ms, size_t len, void *arg) >>>>> -{ >>>>> - struct walk_arg *wa = arg; >>>>> - struct vhost_memory_region *mr; >>>>> - void *start_addr; >>>>> - >>>>> - if (wa->region_nr >= max_regions) >>>>> - return -1; >>>>> - >>>>> - mr = &wa->vm->regions[wa->region_nr++]; >>>>> - start_addr = ms->addr; >>>>> - >>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>> - mr->memory_size = len; >>>>> - mr->mmap_offset = 0; >>>>> - >>>>> - return 0; >>>>> -} >>>>> - >>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK allows >>>>> - * 256 segments. As a relief, below function merges those virtually >>>>> - * adjacent memsegs into one region. >>>>> - */ >>>>> static struct vhost_memory_kernel * >>>>> prepare_vhost_memory_kernel(void) >>>>> { >>>>> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; >>>>> struct vhost_memory_kernel *vm; >>>>> - struct walk_arg wa; >>>>> + uint32_t region_nr = 0, i; >>>>> >>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>>>> max_regions * >>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>>>> if (!vm) >>>>> return NULL; >>>>> >>>>> - wa.region_nr = 0; >>>>> - wa.vm = vm; >>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>>>> + struct vhost_memory_region *mr; >>>>> + void *start_addr; >>>>> + uint64_t len; >>>> >>>> There is a rte_memseg_list_walk() - please do not iterate over memseg >>>> lists manually. >>>> >>> >>> rte_memseg_list_walk() can't be used here because >>> prepare_vhost_memory_kernel() is sometimes called from a memory >>> callback. It will then hang trying to get a read lock on >>> memory_hotplug_lock. >> >> OK, so use rte_memseg_list_walk_thread_unsafe(). >> >>> I don't think the rte_memseg_list_walk_thread_unsafe() function is >>> appropriate because prepare_vhost_memory_kernel() may not always be >>> called from a memory callback. >> >> And how is this different? What you're doing here is identical to calling >> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does >> internally - check the code!), except that you're doing it manually and not >> using DPDK API, which makes your code dependent on internals of DPDK's >> memory implementation. >> >> So, this function may or may not be called from a callback, but you're using >> thread-unsafe walk anyway. I think you should call either thread-safe or >> thread-unsafe version depending on whether you're being called from a >> callback or not. > > Hmm, the real case is a bit more tricky. Even if this > function isn't called from memory event callbacks, the > "thread-safe" version list_walk() still can't be used. > Because deadlock may happen. > > In virtio-user device start, it needs to do SET_MEM_TABLE > for the vhost-backend. And to make sure that preparing and > setting the memory table is atomic (and also to protect the > device state), it needs a lock. So if it calls "thread-safe" > version list_walk(), there will be two locks taken in > below order: > > - the virtio-user device lock (taken by virtio_user_start_device()); > - the memory hotplug lock (taken by rte_memseg_list_walk()); > > And above locks will be released in below order: > > - the memory hotplug lock (released by rte_memseg_list_walk()); > - the virtio-user device lock (released by virtio_user_start_device()); > > And in virtio-user's memory event callback, it also needs > to take the virtio-user device lock to make sure preparing > and setting the memory table is atomic (and also to protect > the device state), so the same device lock is needed here. > And before virtio-user's memory event callback is called, > the memory hotplug lock has already been taken by memory > subsystem. So the locks are taken in below order: > > - the memory hotplug lock (It has been taken by memory subsystem > before virtio-user's memory event callback being called); > - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > > So, if the virtio-user device start and memory callback > events happen at the same time, deadlock may happen. > > And we can't always use rte_memseg_list_walk_thread_unsafe(), > because by its definition, it's only expected to be called > in memory callbacks: > > /** > * ...... > * @note This function does not perform any locking, and is only safe to call > * from within memory-related callback functions. > * ...... > */ > int __rte_experimental > rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); > > So both of rte_memseg_list_walk_thread_unsafe() and > rte_memseg_list_walk() are not really suitable for this > case. If we really want to use these helpers, we need > to allow rte_memseg_*_walk_thread_unsafe() to be called > when the callers have taken mcfg->memory_hotplug_lock, > or add some extra memory APIs to allow callers to take > the lock for rte_memseg_*_walk_thread_unsafe(). And we > can take this lock in virtio_user_start_device() before > taking the virtio-user device lock (so locks can be taken > in the correct order). Thoughts? You can know if this function is called from memory callback. You can probably also know if you're in the process of starting the device. The solution seems clear to me - check both? :) > > Thanks > >> >>> >>>>> >>>>> - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { >>>>> - free(vm); >>>>> - return NULL; >>>>> + start_addr = msl->base_va; >>>>> + len = msl->page_sz * msl->memseg_arr.len; >>>>> + >>>>> + if (start_addr == NULL || len == 0) >>>>> + continue; >>>>> + >>>>> + if (region_nr >= max_regions) { >>>>> + free(vm); >>>>> + return NULL; >>>>> + } >>>>> + >>>>> + mr = &vm->regions[region_nr++]; >>>>> + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>> + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>> + mr->memory_size = len; >>>>> + mr->mmap_offset = 0; /* flags_padding */ >>>>> + >>>>> + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, >>>>> + i, start_addr, len); >>>>> } >>>>> >>>>> - vm->nregions = wa.region_nr; >>>>> + vm->nregions = region_nr; >>>>> vm->padding = 0; >>>>> return vm; >>>>> } >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> Anatoly >>> >> >> >> -- >> Thanks, >> Anatoly > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 8:59 ` Burakov, Anatoly @ 2018-08-24 9:35 ` Tiwei Bie 2018-08-24 10:41 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Tiwei Bie @ 2018-08-24 9:35 UTC (permalink / raw) To: Burakov, Anatoly; +Cc: Sean Harte, maxime.coquelin, zhihong.wang, dev, stable On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: > On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > > On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > >> On 23-Aug-18 12:19 PM, Sean Harte wrote: > >>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > >>> <anatoly.burakov@intel.com> wrote: > >>>> > >>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>>>> Deadlock can occur when allocating memory if a vhost-kernel > >>>>> based virtio-user device is in use. Besides, it's possible > >>>>> to have much more than 64 non-contiguous hugepage backed > >>>>> memory regions due to the memory hotplug, which may cause > >>>>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>>>> solution is to have the virtio-user pass all the VA ranges > >>>>> reserved by DPDK to vhost-kernel. > >>>>> > >>>>> Bugzilla ID: 81 > >>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>>>> Cc: stable@dpdk.org > >>>>> > >>>>> Reported-by: Seán Harte <seanbh@gmail.com> > >>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > >>>>> --- > >>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- > >>>>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>>>> > >>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>> index b2444096c..49bd1b821 100644 > >>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > >>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > >>>>> }; > >>>>> > >>>>> -struct walk_arg { > >>>>> - struct vhost_memory_kernel *vm; > >>>>> - uint32_t region_nr; > >>>>> -}; > >>>>> -static int > >>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > >>>>> - const struct rte_memseg *ms, size_t len, void *arg) > >>>>> -{ > >>>>> - struct walk_arg *wa = arg; > >>>>> - struct vhost_memory_region *mr; > >>>>> - void *start_addr; > >>>>> - > >>>>> - if (wa->region_nr >= max_regions) > >>>>> - return -1; > >>>>> - > >>>>> - mr = &wa->vm->regions[wa->region_nr++]; > >>>>> - start_addr = ms->addr; > >>>>> - > >>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>>>> - mr->memory_size = len; > >>>>> - mr->mmap_offset = 0; > >>>>> - > >>>>> - return 0; > >>>>> -} > >>>>> - > >>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK allows > >>>>> - * 256 segments. As a relief, below function merges those virtually > >>>>> - * adjacent memsegs into one region. > >>>>> - */ > >>>>> static struct vhost_memory_kernel * > >>>>> prepare_vhost_memory_kernel(void) > >>>>> { > >>>>> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; > >>>>> struct vhost_memory_kernel *vm; > >>>>> - struct walk_arg wa; > >>>>> + uint32_t region_nr = 0, i; > >>>>> > >>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + > >>>>> max_regions * > >>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>>>> if (!vm) > >>>>> return NULL; > >>>>> > >>>>> - wa.region_nr = 0; > >>>>> - wa.vm = vm; > >>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > >>>>> + struct vhost_memory_region *mr; > >>>>> + void *start_addr; > >>>>> + uint64_t len; > >>>> > >>>> There is a rte_memseg_list_walk() - please do not iterate over memseg > >>>> lists manually. > >>>> > >>> > >>> rte_memseg_list_walk() can't be used here because > >>> prepare_vhost_memory_kernel() is sometimes called from a memory > >>> callback. It will then hang trying to get a read lock on > >>> memory_hotplug_lock. > >> > >> OK, so use rte_memseg_list_walk_thread_unsafe(). > >> > >>> I don't think the rte_memseg_list_walk_thread_unsafe() function is > >>> appropriate because prepare_vhost_memory_kernel() may not always be > >>> called from a memory callback. > >> > >> And how is this different? What you're doing here is identical to calling > >> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > >> internally - check the code!), except that you're doing it manually and not > >> using DPDK API, which makes your code dependent on internals of DPDK's > >> memory implementation. > >> > >> So, this function may or may not be called from a callback, but you're using > >> thread-unsafe walk anyway. I think you should call either thread-safe or > >> thread-unsafe version depending on whether you're being called from a > >> callback or not. > > > > Hmm, the real case is a bit more tricky. Even if this > > function isn't called from memory event callbacks, the > > "thread-safe" version list_walk() still can't be used. > > Because deadlock may happen. > > > > In virtio-user device start, it needs to do SET_MEM_TABLE > > for the vhost-backend. And to make sure that preparing and > > setting the memory table is atomic (and also to protect the > > device state), it needs a lock. So if it calls "thread-safe" > > version list_walk(), there will be two locks taken in > > below order: > > > > - the virtio-user device lock (taken by virtio_user_start_device()); > > - the memory hotplug lock (taken by rte_memseg_list_walk()); > > > > And above locks will be released in below order: > > > > - the memory hotplug lock (released by rte_memseg_list_walk()); > > - the virtio-user device lock (released by virtio_user_start_device()); > > > > And in virtio-user's memory event callback, it also needs > > to take the virtio-user device lock to make sure preparing > > and setting the memory table is atomic (and also to protect > > the device state), so the same device lock is needed here. > > And before virtio-user's memory event callback is called, > > the memory hotplug lock has already been taken by memory > > subsystem. So the locks are taken in below order: > > > > - the memory hotplug lock (It has been taken by memory subsystem > > before virtio-user's memory event callback being called); > > - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > > > > So, if the virtio-user device start and memory callback > > events happen at the same time, deadlock may happen. > > > > And we can't always use rte_memseg_list_walk_thread_unsafe(), > > because by its definition, it's only expected to be called > > in memory callbacks: > > > > /** > > * ...... > > * @note This function does not perform any locking, and is only safe to call > > * from within memory-related callback functions. > > * ...... > > */ > > int __rte_experimental > > rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); > > > > So both of rte_memseg_list_walk_thread_unsafe() and > > rte_memseg_list_walk() are not really suitable for this > > case. If we really want to use these helpers, we need > > to allow rte_memseg_*_walk_thread_unsafe() to be called > > when the callers have taken mcfg->memory_hotplug_lock, > > or add some extra memory APIs to allow callers to take > > the lock for rte_memseg_*_walk_thread_unsafe(). And we > > can take this lock in virtio_user_start_device() before > > taking the virtio-user device lock (so locks can be taken > > in the correct order). Thoughts? > > You can know if this function is called from memory callback. You can > probably also know if you're in the process of starting the device. The > solution seems clear to me - check both? :) Hmm.. You didn't get my point. :( I mean a lock within virtio driver is needed by virtio-user to avoid the race condition between the virtio-user device start and the virtio-user memory event callback (e.g. about the device state change). And to iterate the memseg lists, a lock within memory subsystem will be taken. So in both of the virtio device start and memory event handling, there are two locks will be taken -- And we need to take these locks in correct order to avoid deadlock, and it requires us to have a way to take the lock for rte_memseg_*_thread_unsafe() in callers. > > > > > Thanks > > > >> > >>> > >>>>> > >>>>> - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { > >>>>> - free(vm); > >>>>> - return NULL; > >>>>> + start_addr = msl->base_va; > >>>>> + len = msl->page_sz * msl->memseg_arr.len; > >>>>> + > >>>>> + if (start_addr == NULL || len == 0) > >>>>> + continue; > >>>>> + > >>>>> + if (region_nr >= max_regions) { > >>>>> + free(vm); > >>>>> + return NULL; > >>>>> + } > >>>>> + > >>>>> + mr = &vm->regions[region_nr++]; > >>>>> + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>>>> + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>>>> + mr->memory_size = len; > >>>>> + mr->mmap_offset = 0; /* flags_padding */ > >>>>> + > >>>>> + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, > >>>>> + i, start_addr, len); > >>>>> } > >>>>> > >>>>> - vm->nregions = wa.region_nr; > >>>>> + vm->nregions = region_nr; > >>>>> vm->padding = 0; > >>>>> return vm; > >>>>> } > >>>>> > >>>> > >>>> > >>>> -- > >>>> Thanks, > >>>> Anatoly > >>> > >> > >> > >> -- > >> Thanks, > >> Anatoly > > > > > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 9:35 ` Tiwei Bie @ 2018-08-24 10:41 ` Burakov, Anatoly 2018-08-24 15:19 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-24 10:41 UTC (permalink / raw) To: Tiwei Bie; +Cc: Sean Harte, maxime.coquelin, zhihong.wang, dev, stable On 24-Aug-18 10:35 AM, Tiwei Bie wrote: > On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: >> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: >>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: >>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: >>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly >>>>> <anatoly.burakov@intel.com> wrote: >>>>>> >>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>>>>>> Deadlock can occur when allocating memory if a vhost-kernel >>>>>>> based virtio-user device is in use. Besides, it's possible >>>>>>> to have much more than 64 non-contiguous hugepage backed >>>>>>> memory regions due to the memory hotplug, which may cause >>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better >>>>>>> solution is to have the virtio-user pass all the VA ranges >>>>>>> reserved by DPDK to vhost-kernel. >>>>>>> >>>>>>> Bugzilla ID: 81 >>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>>>>>> Cc: stable@dpdk.org >>>>>>> >>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> >>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>>>>>> --- >>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++----------- >>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>> index b2444096c..49bd1b821 100644 >>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>>>>>> }; >>>>>>> >>>>>>> -struct walk_arg { >>>>>>> - struct vhost_memory_kernel *vm; >>>>>>> - uint32_t region_nr; >>>>>>> -}; >>>>>>> -static int >>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) >>>>>>> -{ >>>>>>> - struct walk_arg *wa = arg; >>>>>>> - struct vhost_memory_region *mr; >>>>>>> - void *start_addr; >>>>>>> - >>>>>>> - if (wa->region_nr >= max_regions) >>>>>>> - return -1; >>>>>>> - >>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; >>>>>>> - start_addr = ms->addr; >>>>>>> - >>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>> - mr->memory_size = len; >>>>>>> - mr->mmap_offset = 0; >>>>>>> - >>>>>>> - return 0; >>>>>>> -} >>>>>>> - >>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK allows >>>>>>> - * 256 segments. As a relief, below function merges those virtually >>>>>>> - * adjacent memsegs into one region. >>>>>>> - */ >>>>>>> static struct vhost_memory_kernel * >>>>>>> prepare_vhost_memory_kernel(void) >>>>>>> { >>>>>>> + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; >>>>>>> struct vhost_memory_kernel *vm; >>>>>>> - struct walk_arg wa; >>>>>>> + uint32_t region_nr = 0, i; >>>>>>> >>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>>>>>> max_regions * >>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>>>>>> if (!vm) >>>>>>> return NULL; >>>>>>> >>>>>>> - wa.region_nr = 0; >>>>>>> - wa.vm = vm; >>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>>>>>> + struct vhost_memory_region *mr; >>>>>>> + void *start_addr; >>>>>>> + uint64_t len; >>>>>> >>>>>> There is a rte_memseg_list_walk() - please do not iterate over memseg >>>>>> lists manually. >>>>>> >>>>> >>>>> rte_memseg_list_walk() can't be used here because >>>>> prepare_vhost_memory_kernel() is sometimes called from a memory >>>>> callback. It will then hang trying to get a read lock on >>>>> memory_hotplug_lock. >>>> >>>> OK, so use rte_memseg_list_walk_thread_unsafe(). >>>> >>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is >>>>> appropriate because prepare_vhost_memory_kernel() may not always be >>>>> called from a memory callback. >>>> >>>> And how is this different? What you're doing here is identical to calling >>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does >>>> internally - check the code!), except that you're doing it manually and not >>>> using DPDK API, which makes your code dependent on internals of DPDK's >>>> memory implementation. >>>> >>>> So, this function may or may not be called from a callback, but you're using >>>> thread-unsafe walk anyway. I think you should call either thread-safe or >>>> thread-unsafe version depending on whether you're being called from a >>>> callback or not. >>> >>> Hmm, the real case is a bit more tricky. Even if this >>> function isn't called from memory event callbacks, the >>> "thread-safe" version list_walk() still can't be used. >>> Because deadlock may happen. >>> >>> In virtio-user device start, it needs to do SET_MEM_TABLE >>> for the vhost-backend. And to make sure that preparing and >>> setting the memory table is atomic (and also to protect the >>> device state), it needs a lock. So if it calls "thread-safe" >>> version list_walk(), there will be two locks taken in >>> below order: >>> >>> - the virtio-user device lock (taken by virtio_user_start_device()); >>> - the memory hotplug lock (taken by rte_memseg_list_walk()); >>> >>> And above locks will be released in below order: >>> >>> - the memory hotplug lock (released by rte_memseg_list_walk()); >>> - the virtio-user device lock (released by virtio_user_start_device()); >>> >>> And in virtio-user's memory event callback, it also needs >>> to take the virtio-user device lock to make sure preparing >>> and setting the memory table is atomic (and also to protect >>> the device state), so the same device lock is needed here. >>> And before virtio-user's memory event callback is called, >>> the memory hotplug lock has already been taken by memory >>> subsystem. So the locks are taken in below order: >>> >>> - the memory hotplug lock (It has been taken by memory subsystem >>> before virtio-user's memory event callback being called); >>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); >>> >>> So, if the virtio-user device start and memory callback >>> events happen at the same time, deadlock may happen. >>> >>> And we can't always use rte_memseg_list_walk_thread_unsafe(), >>> because by its definition, it's only expected to be called >>> in memory callbacks: >>> >>> /** >>> * ...... >>> * @note This function does not perform any locking, and is only safe to call >>> * from within memory-related callback functions. >>> * ...... >>> */ >>> int __rte_experimental >>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg); >>> >>> So both of rte_memseg_list_walk_thread_unsafe() and >>> rte_memseg_list_walk() are not really suitable for this >>> case. If we really want to use these helpers, we need >>> to allow rte_memseg_*_walk_thread_unsafe() to be called >>> when the callers have taken mcfg->memory_hotplug_lock, >>> or add some extra memory APIs to allow callers to take >>> the lock for rte_memseg_*_walk_thread_unsafe(). And we >>> can take this lock in virtio_user_start_device() before >>> taking the virtio-user device lock (so locks can be taken >>> in the correct order). Thoughts? >> >> You can know if this function is called from memory callback. You can >> probably also know if you're in the process of starting the device. The >> solution seems clear to me - check both? :) > > Hmm.. You didn't get my point. :( > > I mean a lock within virtio driver is needed by virtio-user > to avoid the race condition between the virtio-user device > start and the virtio-user memory event callback (e.g. about > the device state change). And to iterate the memseg lists, > a lock within memory subsystem will be taken. So in both of > the virtio device start and memory event handling, there are > two locks will be taken -- And we need to take these locks > in correct order to avoid deadlock, and it requires us to > have a way to take the lock for rte_memseg_*_thread_unsafe() > in callers. I'm afraid i'm still not getting your point :( You know that you can either get called from memory callback, or not from memory callback. Both of these times, the virtio device is locked. So, where does the race come from? You take out your device lock, and by the time you need to iterate through memsegs, you know that you either were or weren't called from memory callback, which means you can pick either thread-safe or thread-unsafe version. Can you please draw up a step-by-step example where race can happen that cannot be solved the way i suggested above? I.e. thread 1 thread 2 do this do that > > >> >>> >>> Thanks >>> >>>> >>>>> >>>>>>> >>>>>>> - if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) { >>>>>>> - free(vm); >>>>>>> - return NULL; >>>>>>> + start_addr = msl->base_va; >>>>>>> + len = msl->page_sz * msl->memseg_arr.len; >>>>>>> + >>>>>>> + if (start_addr == NULL || len == 0) >>>>>>> + continue; >>>>>>> + >>>>>>> + if (region_nr >= max_regions) { >>>>>>> + free(vm); >>>>>>> + return NULL; >>>>>>> + } >>>>>>> + >>>>>>> + mr = &vm->regions[region_nr++]; >>>>>>> + mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>> + mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>> + mr->memory_size = len; >>>>>>> + mr->mmap_offset = 0; /* flags_padding */ >>>>>>> + >>>>>>> + PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64, >>>>>>> + i, start_addr, len); >>>>>>> } >>>>>>> >>>>>>> - vm->nregions = wa.region_nr; >>>>>>> + vm->nregions = region_nr; >>>>>>> vm->padding = 0; >>>>>>> return vm; >>>>>>> } >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> Anatoly >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> Anatoly >>> >> >> >> -- >> Thanks, >> Anatoly > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 10:41 ` Burakov, Anatoly @ 2018-08-24 15:19 ` Burakov, Anatoly 2018-08-24 15:51 ` Sean Harte 0 siblings, 1 reply; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-24 15:19 UTC (permalink / raw) To: Tiwei Bie; +Cc: Sean Harte, maxime.coquelin, zhihong.wang, dev, stable On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: > On 24-Aug-18 10:35 AM, Tiwei Bie wrote: >> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: >>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: >>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: >>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: >>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly >>>>>> <anatoly.burakov@intel.com> wrote: >>>>>>> >>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel >>>>>>>> based virtio-user device is in use. Besides, it's possible >>>>>>>> to have much more than 64 non-contiguous hugepage backed >>>>>>>> memory regions due to the memory hotplug, which may cause >>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better >>>>>>>> solution is to have the virtio-user pass all the VA ranges >>>>>>>> reserved by DPDK to vhost-kernel. >>>>>>>> >>>>>>>> Bugzilla ID: 81 >>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>>>>>>> Cc: stable@dpdk.org >>>>>>>> >>>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> >>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>>>>>>> --- >>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 >>>>>>>> ++++++++----------- >>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>> index b2444096c..49bd1b821 100644 >>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>>>>>>> }; >>>>>>>> >>>>>>>> -struct walk_arg { >>>>>>>> - struct vhost_memory_kernel *vm; >>>>>>>> - uint32_t region_nr; >>>>>>>> -}; >>>>>>>> -static int >>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) >>>>>>>> -{ >>>>>>>> - struct walk_arg *wa = arg; >>>>>>>> - struct vhost_memory_region *mr; >>>>>>>> - void *start_addr; >>>>>>>> - >>>>>>>> - if (wa->region_nr >= max_regions) >>>>>>>> - return -1; >>>>>>>> - >>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; >>>>>>>> - start_addr = ms->addr; >>>>>>>> - >>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>> - mr->memory_size = len; >>>>>>>> - mr->mmap_offset = 0; >>>>>>>> - >>>>>>>> - return 0; >>>>>>>> -} >>>>>>>> - >>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK >>>>>>>> allows >>>>>>>> - * 256 segments. As a relief, below function merges those >>>>>>>> virtually >>>>>>>> - * adjacent memsegs into one region. >>>>>>>> - */ >>>>>>>> static struct vhost_memory_kernel * >>>>>>>> prepare_vhost_memory_kernel(void) >>>>>>>> { >>>>>>>> + struct rte_mem_config *mcfg = >>>>>>>> rte_eal_get_configuration()->mem_config; >>>>>>>> struct vhost_memory_kernel *vm; >>>>>>>> - struct walk_arg wa; >>>>>>>> + uint32_t region_nr = 0, i; >>>>>>>> >>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>>>>>>> max_regions * >>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>>>>>>> if (!vm) >>>>>>>> return NULL; >>>>>>>> >>>>>>>> - wa.region_nr = 0; >>>>>>>> - wa.vm = vm; >>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>>>>>>> + struct vhost_memory_region *mr; >>>>>>>> + void *start_addr; >>>>>>>> + uint64_t len; >>>>>>> >>>>>>> There is a rte_memseg_list_walk() - please do not iterate over >>>>>>> memseg >>>>>>> lists manually. >>>>>>> >>>>>> >>>>>> rte_memseg_list_walk() can't be used here because >>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory >>>>>> callback. It will then hang trying to get a read lock on >>>>>> memory_hotplug_lock. >>>>> >>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). >>>>> >>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is >>>>>> appropriate because prepare_vhost_memory_kernel() may not always be >>>>>> called from a memory callback. >>>>> >>>>> And how is this different? What you're doing here is identical to >>>>> calling >>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does >>>>> internally - check the code!), except that you're doing it manually >>>>> and not >>>>> using DPDK API, which makes your code dependent on internals of DPDK's >>>>> memory implementation. >>>>> >>>>> So, this function may or may not be called from a callback, but >>>>> you're using >>>>> thread-unsafe walk anyway. I think you should call either >>>>> thread-safe or >>>>> thread-unsafe version depending on whether you're being called from a >>>>> callback or not. >>>> >>>> Hmm, the real case is a bit more tricky. Even if this >>>> function isn't called from memory event callbacks, the >>>> "thread-safe" version list_walk() still can't be used. >>>> Because deadlock may happen. >>>> >>>> In virtio-user device start, it needs to do SET_MEM_TABLE >>>> for the vhost-backend. And to make sure that preparing and >>>> setting the memory table is atomic (and also to protect the >>>> device state), it needs a lock. So if it calls "thread-safe" >>>> version list_walk(), there will be two locks taken in >>>> below order: >>>> >>>> - the virtio-user device lock (taken by virtio_user_start_device()); >>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); >>>> >>>> And above locks will be released in below order: >>>> >>>> - the memory hotplug lock (released by rte_memseg_list_walk()); >>>> - the virtio-user device lock (released by virtio_user_start_device()); >>>> >>>> And in virtio-user's memory event callback, it also needs >>>> to take the virtio-user device lock to make sure preparing >>>> and setting the memory table is atomic (and also to protect >>>> the device state), so the same device lock is needed here. >>>> And before virtio-user's memory event callback is called, >>>> the memory hotplug lock has already been taken by memory >>>> subsystem. So the locks are taken in below order: >>>> >>>> - the memory hotplug lock (It has been taken by memory subsystem >>>> before virtio-user's memory event callback being called); >>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); >>>> >>>> So, if the virtio-user device start and memory callback >>>> events happen at the same time, deadlock may happen. >>>> >>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), >>>> because by its definition, it's only expected to be called >>>> in memory callbacks: >>>> >>>> /** >>>> * ...... >>>> * @note This function does not perform any locking, and is only >>>> safe to call >>>> * from within memory-related callback functions. >>>> * ...... >>>> */ >>>> int __rte_experimental >>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void >>>> *arg); >>>> >>>> So both of rte_memseg_list_walk_thread_unsafe() and >>>> rte_memseg_list_walk() are not really suitable for this >>>> case. If we really want to use these helpers, we need >>>> to allow rte_memseg_*_walk_thread_unsafe() to be called >>>> when the callers have taken mcfg->memory_hotplug_lock, >>>> or add some extra memory APIs to allow callers to take >>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we >>>> can take this lock in virtio_user_start_device() before >>>> taking the virtio-user device lock (so locks can be taken >>>> in the correct order). Thoughts? >>> >>> You can know if this function is called from memory callback. You can >>> probably also know if you're in the process of starting the device. The >>> solution seems clear to me - check both? :) >> >> Hmm.. You didn't get my point. :( >> >> I mean a lock within virtio driver is needed by virtio-user >> to avoid the race condition between the virtio-user device >> start and the virtio-user memory event callback (e.g. about >> the device state change). And to iterate the memseg lists, >> a lock within memory subsystem will be taken. So in both of >> the virtio device start and memory event handling, there are >> two locks will be taken -- And we need to take these locks >> in correct order to avoid deadlock, and it requires us to >> have a way to take the lock for rte_memseg_*_thread_unsafe() >> in callers. > > I'm afraid i'm still not getting your point :( > > You know that you can either get called from memory callback, or not > from memory callback. Both of these times, the virtio device is locked. > So, where does the race come from? You take out your device lock, and by > the time you need to iterate through memsegs, you know that you either > were or weren't called from memory callback, which means you can pick > either thread-safe or thread-unsafe version. > > Can you please draw up a step-by-step example where race can happen that > cannot be solved the way i suggested above? > > I.e. > > thread 1 thread 2 > do this > do that For the benefit of public discussion, the following is result of our internal discussion on this topic: the deadlock may happen because we take virtio lock and hotplug lock in different order. So, thread 1 may do device start, which will take out virtio device lock, then attempt to iterate memory regions thereby taking memory hotplug lock. At the same time, thread 2 might trigger an allocation, and virtio will receive a callback, which will lock hotplug, and then attempt to lock virtio device. In other words, Thread 1 Thread 2 lock virtio device lock hotplug (waits for hotplug unlock) (waits for virtio device unlock) The solution to this is not trivial, and we haven't come up with anything to fix this that doesn't involve pretty serious changes to the memory subsystem. Ideas welcome :) -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 15:19 ` Burakov, Anatoly @ 2018-08-24 15:51 ` Sean Harte 2018-08-27 9:15 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Sean Harte @ 2018-08-24 15:51 UTC (permalink / raw) To: anatoly.burakov; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly <anatoly.burakov@intel.com> wrote: > > On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: > > On 24-Aug-18 10:35 AM, Tiwei Bie wrote: > >> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: > >>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > >>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > >>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: > >>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > >>>>>> <anatoly.burakov@intel.com> wrote: > >>>>>>> > >>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel > >>>>>>>> based virtio-user device is in use. Besides, it's possible > >>>>>>>> to have much more than 64 non-contiguous hugepage backed > >>>>>>>> memory regions due to the memory hotplug, which may cause > >>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>>>>>>> solution is to have the virtio-user pass all the VA ranges > >>>>>>>> reserved by DPDK to vhost-kernel. > >>>>>>>> > >>>>>>>> Bugzilla ID: 81 > >>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>>>>>>> Cc: stable@dpdk.org > >>>>>>>> > >>>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> > >>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > >>>>>>>> --- > >>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 > >>>>>>>> ++++++++----------- > >>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> index b2444096c..49bd1b821 100644 > >>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > >>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > >>>>>>>> }; > >>>>>>>> > >>>>>>>> -struct walk_arg { > >>>>>>>> - struct vhost_memory_kernel *vm; > >>>>>>>> - uint32_t region_nr; > >>>>>>>> -}; > >>>>>>>> -static int > >>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > >>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) > >>>>>>>> -{ > >>>>>>>> - struct walk_arg *wa = arg; > >>>>>>>> - struct vhost_memory_region *mr; > >>>>>>>> - void *start_addr; > >>>>>>>> - > >>>>>>>> - if (wa->region_nr >= max_regions) > >>>>>>>> - return -1; > >>>>>>>> - > >>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; > >>>>>>>> - start_addr = ms->addr; > >>>>>>>> - > >>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>> - mr->memory_size = len; > >>>>>>>> - mr->mmap_offset = 0; > >>>>>>>> - > >>>>>>>> - return 0; > >>>>>>>> -} > >>>>>>>> - > >>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK > >>>>>>>> allows > >>>>>>>> - * 256 segments. As a relief, below function merges those > >>>>>>>> virtually > >>>>>>>> - * adjacent memsegs into one region. > >>>>>>>> - */ > >>>>>>>> static struct vhost_memory_kernel * > >>>>>>>> prepare_vhost_memory_kernel(void) > >>>>>>>> { > >>>>>>>> + struct rte_mem_config *mcfg = > >>>>>>>> rte_eal_get_configuration()->mem_config; > >>>>>>>> struct vhost_memory_kernel *vm; > >>>>>>>> - struct walk_arg wa; > >>>>>>>> + uint32_t region_nr = 0, i; > >>>>>>>> > >>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + > >>>>>>>> max_regions * > >>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>>>>>>> if (!vm) > >>>>>>>> return NULL; > >>>>>>>> > >>>>>>>> - wa.region_nr = 0; > >>>>>>>> - wa.vm = vm; > >>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > >>>>>>>> + struct vhost_memory_region *mr; > >>>>>>>> + void *start_addr; > >>>>>>>> + uint64_t len; > >>>>>>> > >>>>>>> There is a rte_memseg_list_walk() - please do not iterate over > >>>>>>> memseg > >>>>>>> lists manually. > >>>>>>> > >>>>>> > >>>>>> rte_memseg_list_walk() can't be used here because > >>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory > >>>>>> callback. It will then hang trying to get a read lock on > >>>>>> memory_hotplug_lock. > >>>>> > >>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). > >>>>> > >>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is > >>>>>> appropriate because prepare_vhost_memory_kernel() may not always be > >>>>>> called from a memory callback. > >>>>> > >>>>> And how is this different? What you're doing here is identical to > >>>>> calling > >>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > >>>>> internally - check the code!), except that you're doing it manually > >>>>> and not > >>>>> using DPDK API, which makes your code dependent on internals of DPDK's > >>>>> memory implementation. > >>>>> > >>>>> So, this function may or may not be called from a callback, but > >>>>> you're using > >>>>> thread-unsafe walk anyway. I think you should call either > >>>>> thread-safe or > >>>>> thread-unsafe version depending on whether you're being called from a > >>>>> callback or not. > >>>> > >>>> Hmm, the real case is a bit more tricky. Even if this > >>>> function isn't called from memory event callbacks, the > >>>> "thread-safe" version list_walk() still can't be used. > >>>> Because deadlock may happen. > >>>> > >>>> In virtio-user device start, it needs to do SET_MEM_TABLE > >>>> for the vhost-backend. And to make sure that preparing and > >>>> setting the memory table is atomic (and also to protect the > >>>> device state), it needs a lock. So if it calls "thread-safe" > >>>> version list_walk(), there will be two locks taken in > >>>> below order: > >>>> > >>>> - the virtio-user device lock (taken by virtio_user_start_device()); > >>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); > >>>> > >>>> And above locks will be released in below order: > >>>> > >>>> - the memory hotplug lock (released by rte_memseg_list_walk()); > >>>> - the virtio-user device lock (released by virtio_user_start_device()); > >>>> > >>>> And in virtio-user's memory event callback, it also needs > >>>> to take the virtio-user device lock to make sure preparing > >>>> and setting the memory table is atomic (and also to protect > >>>> the device state), so the same device lock is needed here. > >>>> And before virtio-user's memory event callback is called, > >>>> the memory hotplug lock has already been taken by memory > >>>> subsystem. So the locks are taken in below order: > >>>> > >>>> - the memory hotplug lock (It has been taken by memory subsystem > >>>> before virtio-user's memory event callback being called); > >>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > >>>> > >>>> So, if the virtio-user device start and memory callback > >>>> events happen at the same time, deadlock may happen. > >>>> > >>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), > >>>> because by its definition, it's only expected to be called > >>>> in memory callbacks: > >>>> > >>>> /** > >>>> * ...... > >>>> * @note This function does not perform any locking, and is only > >>>> safe to call > >>>> * from within memory-related callback functions. > >>>> * ...... > >>>> */ > >>>> int __rte_experimental > >>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void > >>>> *arg); > >>>> > >>>> So both of rte_memseg_list_walk_thread_unsafe() and > >>>> rte_memseg_list_walk() are not really suitable for this > >>>> case. If we really want to use these helpers, we need > >>>> to allow rte_memseg_*_walk_thread_unsafe() to be called > >>>> when the callers have taken mcfg->memory_hotplug_lock, > >>>> or add some extra memory APIs to allow callers to take > >>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we > >>>> can take this lock in virtio_user_start_device() before > >>>> taking the virtio-user device lock (so locks can be taken > >>>> in the correct order). Thoughts? > >>> > >>> You can know if this function is called from memory callback. You can > >>> probably also know if you're in the process of starting the device. The > >>> solution seems clear to me - check both? :) > >> > >> Hmm.. You didn't get my point. :( > >> > >> I mean a lock within virtio driver is needed by virtio-user > >> to avoid the race condition between the virtio-user device > >> start and the virtio-user memory event callback (e.g. about > >> the device state change). And to iterate the memseg lists, > >> a lock within memory subsystem will be taken. So in both of > >> the virtio device start and memory event handling, there are > >> two locks will be taken -- And we need to take these locks > >> in correct order to avoid deadlock, and it requires us to > >> have a way to take the lock for rte_memseg_*_thread_unsafe() > >> in callers. > > > > I'm afraid i'm still not getting your point :( > > > > You know that you can either get called from memory callback, or not > > from memory callback. Both of these times, the virtio device is locked. > > So, where does the race come from? You take out your device lock, and by > > the time you need to iterate through memsegs, you know that you either > > were or weren't called from memory callback, which means you can pick > > either thread-safe or thread-unsafe version. > > > > Can you please draw up a step-by-step example where race can happen that > > cannot be solved the way i suggested above? > > > > I.e. > > > > thread 1 thread 2 > > do this > > do that > > For the benefit of public discussion, the following is result of our > internal discussion on this topic: the deadlock may happen because we > take virtio lock and hotplug lock in different order. > > So, thread 1 may do device start, which will take out virtio device > lock, then attempt to iterate memory regions thereby taking memory > hotplug lock. At the same time, thread 2 might trigger an allocation, > and virtio will receive a callback, which will lock hotplug, and then > attempt to lock virtio device. > > In other words, > > Thread 1 Thread 2 > lock virtio device > lock hotplug > (waits for hotplug unlock) > (waits for virtio device unlock) > > The solution to this is not trivial, and we haven't come up with > anything to fix this that doesn't involve pretty serious changes to the > memory subsystem. Ideas welcome :) > As part of device start, can the virtio-user driver take the hotplug lock before the virtio lock (in Thread 1 in your example). It can access the lock through rte_eal_get_configuration()->mem_config, as the Mellanox driver currently does. If that would work, an API to lock and unlock do it should probably be provided by the memory subsystem. Although, it seems a bit error-prone and a future change could easily break things. > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-24 15:51 ` Sean Harte @ 2018-08-27 9:15 ` Burakov, Anatoly 2018-08-28 13:27 ` Sean Harte 0 siblings, 1 reply; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-27 9:15 UTC (permalink / raw) To: Sean Harte; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On 24-Aug-18 4:51 PM, Sean Harte wrote: > On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly > <anatoly.burakov@intel.com> wrote: >> >> On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: >>> On 24-Aug-18 10:35 AM, Tiwei Bie wrote: >>>> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: >>>>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: >>>>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: >>>>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: >>>>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly >>>>>>>> <anatoly.burakov@intel.com> wrote: >>>>>>>>> >>>>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>>>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel >>>>>>>>>> based virtio-user device is in use. Besides, it's possible >>>>>>>>>> to have much more than 64 non-contiguous hugepage backed >>>>>>>>>> memory regions due to the memory hotplug, which may cause >>>>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better >>>>>>>>>> solution is to have the virtio-user pass all the VA ranges >>>>>>>>>> reserved by DPDK to vhost-kernel. >>>>>>>>>> >>>>>>>>>> Bugzilla ID: 81 >>>>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>>>>>>>>> Cc: stable@dpdk.org >>>>>>>>>> >>>>>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> >>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>>>>>>>>> --- >>>>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 >>>>>>>>>> ++++++++----------- >>>>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>> index b2444096c..49bd1b821 100644 >>>>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>>>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> -struct walk_arg { >>>>>>>>>> - struct vhost_memory_kernel *vm; >>>>>>>>>> - uint32_t region_nr; >>>>>>>>>> -}; >>>>>>>>>> -static int >>>>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>>>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) >>>>>>>>>> -{ >>>>>>>>>> - struct walk_arg *wa = arg; >>>>>>>>>> - struct vhost_memory_region *mr; >>>>>>>>>> - void *start_addr; >>>>>>>>>> - >>>>>>>>>> - if (wa->region_nr >= max_regions) >>>>>>>>>> - return -1; >>>>>>>>>> - >>>>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; >>>>>>>>>> - start_addr = ms->addr; >>>>>>>>>> - >>>>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>>>> - mr->memory_size = len; >>>>>>>>>> - mr->mmap_offset = 0; >>>>>>>>>> - >>>>>>>>>> - return 0; >>>>>>>>>> -} >>>>>>>>>> - >>>>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK >>>>>>>>>> allows >>>>>>>>>> - * 256 segments. As a relief, below function merges those >>>>>>>>>> virtually >>>>>>>>>> - * adjacent memsegs into one region. >>>>>>>>>> - */ >>>>>>>>>> static struct vhost_memory_kernel * >>>>>>>>>> prepare_vhost_memory_kernel(void) >>>>>>>>>> { >>>>>>>>>> + struct rte_mem_config *mcfg = >>>>>>>>>> rte_eal_get_configuration()->mem_config; >>>>>>>>>> struct vhost_memory_kernel *vm; >>>>>>>>>> - struct walk_arg wa; >>>>>>>>>> + uint32_t region_nr = 0, i; >>>>>>>>>> >>>>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>>>>>>>>> max_regions * >>>>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>>>>>>>>> if (!vm) >>>>>>>>>> return NULL; >>>>>>>>>> >>>>>>>>>> - wa.region_nr = 0; >>>>>>>>>> - wa.vm = vm; >>>>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>>>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>>>>>>>>> + struct vhost_memory_region *mr; >>>>>>>>>> + void *start_addr; >>>>>>>>>> + uint64_t len; >>>>>>>>> >>>>>>>>> There is a rte_memseg_list_walk() - please do not iterate over >>>>>>>>> memseg >>>>>>>>> lists manually. >>>>>>>>> >>>>>>>> >>>>>>>> rte_memseg_list_walk() can't be used here because >>>>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory >>>>>>>> callback. It will then hang trying to get a read lock on >>>>>>>> memory_hotplug_lock. >>>>>>> >>>>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). >>>>>>> >>>>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is >>>>>>>> appropriate because prepare_vhost_memory_kernel() may not always be >>>>>>>> called from a memory callback. >>>>>>> >>>>>>> And how is this different? What you're doing here is identical to >>>>>>> calling >>>>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does >>>>>>> internally - check the code!), except that you're doing it manually >>>>>>> and not >>>>>>> using DPDK API, which makes your code dependent on internals of DPDK's >>>>>>> memory implementation. >>>>>>> >>>>>>> So, this function may or may not be called from a callback, but >>>>>>> you're using >>>>>>> thread-unsafe walk anyway. I think you should call either >>>>>>> thread-safe or >>>>>>> thread-unsafe version depending on whether you're being called from a >>>>>>> callback or not. >>>>>> >>>>>> Hmm, the real case is a bit more tricky. Even if this >>>>>> function isn't called from memory event callbacks, the >>>>>> "thread-safe" version list_walk() still can't be used. >>>>>> Because deadlock may happen. >>>>>> >>>>>> In virtio-user device start, it needs to do SET_MEM_TABLE >>>>>> for the vhost-backend. And to make sure that preparing and >>>>>> setting the memory table is atomic (and also to protect the >>>>>> device state), it needs a lock. So if it calls "thread-safe" >>>>>> version list_walk(), there will be two locks taken in >>>>>> below order: >>>>>> >>>>>> - the virtio-user device lock (taken by virtio_user_start_device()); >>>>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); >>>>>> >>>>>> And above locks will be released in below order: >>>>>> >>>>>> - the memory hotplug lock (released by rte_memseg_list_walk()); >>>>>> - the virtio-user device lock (released by virtio_user_start_device()); >>>>>> >>>>>> And in virtio-user's memory event callback, it also needs >>>>>> to take the virtio-user device lock to make sure preparing >>>>>> and setting the memory table is atomic (and also to protect >>>>>> the device state), so the same device lock is needed here. >>>>>> And before virtio-user's memory event callback is called, >>>>>> the memory hotplug lock has already been taken by memory >>>>>> subsystem. So the locks are taken in below order: >>>>>> >>>>>> - the memory hotplug lock (It has been taken by memory subsystem >>>>>> before virtio-user's memory event callback being called); >>>>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); >>>>>> >>>>>> So, if the virtio-user device start and memory callback >>>>>> events happen at the same time, deadlock may happen. >>>>>> >>>>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), >>>>>> because by its definition, it's only expected to be called >>>>>> in memory callbacks: >>>>>> >>>>>> /** >>>>>> * ...... >>>>>> * @note This function does not perform any locking, and is only >>>>>> safe to call >>>>>> * from within memory-related callback functions. >>>>>> * ...... >>>>>> */ >>>>>> int __rte_experimental >>>>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void >>>>>> *arg); >>>>>> >>>>>> So both of rte_memseg_list_walk_thread_unsafe() and >>>>>> rte_memseg_list_walk() are not really suitable for this >>>>>> case. If we really want to use these helpers, we need >>>>>> to allow rte_memseg_*_walk_thread_unsafe() to be called >>>>>> when the callers have taken mcfg->memory_hotplug_lock, >>>>>> or add some extra memory APIs to allow callers to take >>>>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we >>>>>> can take this lock in virtio_user_start_device() before >>>>>> taking the virtio-user device lock (so locks can be taken >>>>>> in the correct order). Thoughts? >>>>> >>>>> You can know if this function is called from memory callback. You can >>>>> probably also know if you're in the process of starting the device. The >>>>> solution seems clear to me - check both? :) >>>> >>>> Hmm.. You didn't get my point. :( >>>> >>>> I mean a lock within virtio driver is needed by virtio-user >>>> to avoid the race condition between the virtio-user device >>>> start and the virtio-user memory event callback (e.g. about >>>> the device state change). And to iterate the memseg lists, >>>> a lock within memory subsystem will be taken. So in both of >>>> the virtio device start and memory event handling, there are >>>> two locks will be taken -- And we need to take these locks >>>> in correct order to avoid deadlock, and it requires us to >>>> have a way to take the lock for rte_memseg_*_thread_unsafe() >>>> in callers. >>> >>> I'm afraid i'm still not getting your point :( >>> >>> You know that you can either get called from memory callback, or not >>> from memory callback. Both of these times, the virtio device is locked. >>> So, where does the race come from? You take out your device lock, and by >>> the time you need to iterate through memsegs, you know that you either >>> were or weren't called from memory callback, which means you can pick >>> either thread-safe or thread-unsafe version. >>> >>> Can you please draw up a step-by-step example where race can happen that >>> cannot be solved the way i suggested above? >>> >>> I.e. >>> >>> thread 1 thread 2 >>> do this >>> do that >> >> For the benefit of public discussion, the following is result of our >> internal discussion on this topic: the deadlock may happen because we >> take virtio lock and hotplug lock in different order. >> >> So, thread 1 may do device start, which will take out virtio device >> lock, then attempt to iterate memory regions thereby taking memory >> hotplug lock. At the same time, thread 2 might trigger an allocation, >> and virtio will receive a callback, which will lock hotplug, and then >> attempt to lock virtio device. >> >> In other words, >> >> Thread 1 Thread 2 >> lock virtio device >> lock hotplug >> (waits for hotplug unlock) >> (waits for virtio device unlock) >> >> The solution to this is not trivial, and we haven't come up with >> anything to fix this that doesn't involve pretty serious changes to the >> memory subsystem. Ideas welcome :) >> > > As part of device start, can the virtio-user driver take the hotplug > lock before the virtio lock (in Thread 1 in your example). It can > access the lock through rte_eal_get_configuration()->mem_config, as > the Mellanox driver currently does. If that would work, an API to lock > and unlock do it should probably be provided by the memory subsystem. > Although, it seems a bit error-prone and a future change could easily > break things. Hi Sean, I wasn't aware of the MLX driver accessing the lock directly, and i would strongly discourage everyone from doing so :) The main problem with either accessing the lock directly or providing the API to lock/unlock this lock is, if something in current thtread has accidentally triggered an allocation while holding that lock (which may happen silently in the background), it may lead to a deadlock as allocation may try to take out a write lock. One possible solution could be changing the rwlock to a recursive lock, but even then it presents several problems because 1) our recursive spinlocks are not rwlocks and thus they can't allow parallel read access, and 2) there is currently no way to upgrade/downgrade read lock to a write lock (not to mention the allocator doesn't know if it's supposed to release a lock, or downgrade it to a read lock), which would be needed for such a thing to work without stopping the world, so to speak. Another possible solution to this could be to stop page allocations from happening in the first place, while an external lock is held. This might just solve our problem (and possibly MLX's as well), but i haven't yet thought through the implications of such a change. Thanks, Anatoly > >> -- >> Thanks, >> Anatoly > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-27 9:15 ` Burakov, Anatoly @ 2018-08-28 13:27 ` Sean Harte 2018-08-28 14:12 ` Burakov, Anatoly 0 siblings, 1 reply; 14+ messages in thread From: Sean Harte @ 2018-08-28 13:27 UTC (permalink / raw) To: anatoly.burakov; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On Mon, 27 Aug 2018 at 10:15, Burakov, Anatoly <anatoly.burakov@intel.com> wrote: > > On 24-Aug-18 4:51 PM, Sean Harte wrote: > > On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly > > <anatoly.burakov@intel.com> wrote: > >> > >> On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: > >>> On 24-Aug-18 10:35 AM, Tiwei Bie wrote: > >>>> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: > >>>>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: > >>>>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: > >>>>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: > >>>>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly > >>>>>>>> <anatoly.burakov@intel.com> wrote: > >>>>>>>>> > >>>>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: > >>>>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel > >>>>>>>>>> based virtio-user device is in use. Besides, it's possible > >>>>>>>>>> to have much more than 64 non-contiguous hugepage backed > >>>>>>>>>> memory regions due to the memory hotplug, which may cause > >>>>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better > >>>>>>>>>> solution is to have the virtio-user pass all the VA ranges > >>>>>>>>>> reserved by DPDK to vhost-kernel. > >>>>>>>>>> > >>>>>>>>>> Bugzilla ID: 81 > >>>>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") > >>>>>>>>>> Cc: stable@dpdk.org > >>>>>>>>>> > >>>>>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> > >>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> > >>>>>>>>>> --- > >>>>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 > >>>>>>>>>> ++++++++----------- > >>>>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>>>> index b2444096c..49bd1b821 100644 > >>>>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c > >>>>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { > >>>>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, > >>>>>>>>>> }; > >>>>>>>>>> > >>>>>>>>>> -struct walk_arg { > >>>>>>>>>> - struct vhost_memory_kernel *vm; > >>>>>>>>>> - uint32_t region_nr; > >>>>>>>>>> -}; > >>>>>>>>>> -static int > >>>>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, > >>>>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) > >>>>>>>>>> -{ > >>>>>>>>>> - struct walk_arg *wa = arg; > >>>>>>>>>> - struct vhost_memory_region *mr; > >>>>>>>>>> - void *start_addr; > >>>>>>>>>> - > >>>>>>>>>> - if (wa->region_nr >= max_regions) > >>>>>>>>>> - return -1; > >>>>>>>>>> - > >>>>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; > >>>>>>>>>> - start_addr = ms->addr; > >>>>>>>>>> - > >>>>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; > >>>>>>>>>> - mr->memory_size = len; > >>>>>>>>>> - mr->mmap_offset = 0; > >>>>>>>>>> - > >>>>>>>>>> - return 0; > >>>>>>>>>> -} > >>>>>>>>>> - > >>>>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK > >>>>>>>>>> allows > >>>>>>>>>> - * 256 segments. As a relief, below function merges those > >>>>>>>>>> virtually > >>>>>>>>>> - * adjacent memsegs into one region. > >>>>>>>>>> - */ > >>>>>>>>>> static struct vhost_memory_kernel * > >>>>>>>>>> prepare_vhost_memory_kernel(void) > >>>>>>>>>> { > >>>>>>>>>> + struct rte_mem_config *mcfg = > >>>>>>>>>> rte_eal_get_configuration()->mem_config; > >>>>>>>>>> struct vhost_memory_kernel *vm; > >>>>>>>>>> - struct walk_arg wa; > >>>>>>>>>> + uint32_t region_nr = 0, i; > >>>>>>>>>> > >>>>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + > >>>>>>>>>> max_regions * > >>>>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) > >>>>>>>>>> if (!vm) > >>>>>>>>>> return NULL; > >>>>>>>>>> > >>>>>>>>>> - wa.region_nr = 0; > >>>>>>>>>> - wa.vm = vm; > >>>>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { > >>>>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; > >>>>>>>>>> + struct vhost_memory_region *mr; > >>>>>>>>>> + void *start_addr; > >>>>>>>>>> + uint64_t len; > >>>>>>>>> > >>>>>>>>> There is a rte_memseg_list_walk() - please do not iterate over > >>>>>>>>> memseg > >>>>>>>>> lists manually. > >>>>>>>>> > >>>>>>>> > >>>>>>>> rte_memseg_list_walk() can't be used here because > >>>>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory > >>>>>>>> callback. It will then hang trying to get a read lock on > >>>>>>>> memory_hotplug_lock. > >>>>>>> > >>>>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). > >>>>>>> > >>>>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is > >>>>>>>> appropriate because prepare_vhost_memory_kernel() may not always be > >>>>>>>> called from a memory callback. > >>>>>>> > >>>>>>> And how is this different? What you're doing here is identical to > >>>>>>> calling > >>>>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does > >>>>>>> internally - check the code!), except that you're doing it manually > >>>>>>> and not > >>>>>>> using DPDK API, which makes your code dependent on internals of DPDK's > >>>>>>> memory implementation. > >>>>>>> > >>>>>>> So, this function may or may not be called from a callback, but > >>>>>>> you're using > >>>>>>> thread-unsafe walk anyway. I think you should call either > >>>>>>> thread-safe or > >>>>>>> thread-unsafe version depending on whether you're being called from a > >>>>>>> callback or not. > >>>>>> > >>>>>> Hmm, the real case is a bit more tricky. Even if this > >>>>>> function isn't called from memory event callbacks, the > >>>>>> "thread-safe" version list_walk() still can't be used. > >>>>>> Because deadlock may happen. > >>>>>> > >>>>>> In virtio-user device start, it needs to do SET_MEM_TABLE > >>>>>> for the vhost-backend. And to make sure that preparing and > >>>>>> setting the memory table is atomic (and also to protect the > >>>>>> device state), it needs a lock. So if it calls "thread-safe" > >>>>>> version list_walk(), there will be two locks taken in > >>>>>> below order: > >>>>>> > >>>>>> - the virtio-user device lock (taken by virtio_user_start_device()); > >>>>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); > >>>>>> > >>>>>> And above locks will be released in below order: > >>>>>> > >>>>>> - the memory hotplug lock (released by rte_memseg_list_walk()); > >>>>>> - the virtio-user device lock (released by virtio_user_start_device()); > >>>>>> > >>>>>> And in virtio-user's memory event callback, it also needs > >>>>>> to take the virtio-user device lock to make sure preparing > >>>>>> and setting the memory table is atomic (and also to protect > >>>>>> the device state), so the same device lock is needed here. > >>>>>> And before virtio-user's memory event callback is called, > >>>>>> the memory hotplug lock has already been taken by memory > >>>>>> subsystem. So the locks are taken in below order: > >>>>>> > >>>>>> - the memory hotplug lock (It has been taken by memory subsystem > >>>>>> before virtio-user's memory event callback being called); > >>>>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); > >>>>>> > >>>>>> So, if the virtio-user device start and memory callback > >>>>>> events happen at the same time, deadlock may happen. > >>>>>> > >>>>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), > >>>>>> because by its definition, it's only expected to be called > >>>>>> in memory callbacks: > >>>>>> > >>>>>> /** > >>>>>> * ...... > >>>>>> * @note This function does not perform any locking, and is only > >>>>>> safe to call > >>>>>> * from within memory-related callback functions. > >>>>>> * ...... > >>>>>> */ > >>>>>> int __rte_experimental > >>>>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void > >>>>>> *arg); > >>>>>> > >>>>>> So both of rte_memseg_list_walk_thread_unsafe() and > >>>>>> rte_memseg_list_walk() are not really suitable for this > >>>>>> case. If we really want to use these helpers, we need > >>>>>> to allow rte_memseg_*_walk_thread_unsafe() to be called > >>>>>> when the callers have taken mcfg->memory_hotplug_lock, > >>>>>> or add some extra memory APIs to allow callers to take > >>>>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we > >>>>>> can take this lock in virtio_user_start_device() before > >>>>>> taking the virtio-user device lock (so locks can be taken > >>>>>> in the correct order). Thoughts? > >>>>> > >>>>> You can know if this function is called from memory callback. You can > >>>>> probably also know if you're in the process of starting the device. The > >>>>> solution seems clear to me - check both? :) > >>>> > >>>> Hmm.. You didn't get my point. :( > >>>> > >>>> I mean a lock within virtio driver is needed by virtio-user > >>>> to avoid the race condition between the virtio-user device > >>>> start and the virtio-user memory event callback (e.g. about > >>>> the device state change). And to iterate the memseg lists, > >>>> a lock within memory subsystem will be taken. So in both of > >>>> the virtio device start and memory event handling, there are > >>>> two locks will be taken -- And we need to take these locks > >>>> in correct order to avoid deadlock, and it requires us to > >>>> have a way to take the lock for rte_memseg_*_thread_unsafe() > >>>> in callers. > >>> > >>> I'm afraid i'm still not getting your point :( > >>> > >>> You know that you can either get called from memory callback, or not > >>> from memory callback. Both of these times, the virtio device is locked. > >>> So, where does the race come from? You take out your device lock, and by > >>> the time you need to iterate through memsegs, you know that you either > >>> were or weren't called from memory callback, which means you can pick > >>> either thread-safe or thread-unsafe version. > >>> > >>> Can you please draw up a step-by-step example where race can happen that > >>> cannot be solved the way i suggested above? > >>> > >>> I.e. > >>> > >>> thread 1 thread 2 > >>> do this > >>> do that > >> > >> For the benefit of public discussion, the following is result of our > >> internal discussion on this topic: the deadlock may happen because we > >> take virtio lock and hotplug lock in different order. > >> > >> So, thread 1 may do device start, which will take out virtio device > >> lock, then attempt to iterate memory regions thereby taking memory > >> hotplug lock. At the same time, thread 2 might trigger an allocation, > >> and virtio will receive a callback, which will lock hotplug, and then > >> attempt to lock virtio device. > >> > >> In other words, > >> > >> Thread 1 Thread 2 > >> lock virtio device > >> lock hotplug > >> (waits for hotplug unlock) > >> (waits for virtio device unlock) > >> > >> The solution to this is not trivial, and we haven't come up with > >> anything to fix this that doesn't involve pretty serious changes to the > >> memory subsystem. Ideas welcome :) > >> > > > > As part of device start, can the virtio-user driver take the hotplug > > lock before the virtio lock (in Thread 1 in your example). It can > > access the lock through rte_eal_get_configuration()->mem_config, as > > the Mellanox driver currently does. If that would work, an API to lock > > and unlock do it should probably be provided by the memory subsystem. > > Although, it seems a bit error-prone and a future change could easily > > break things. > > Hi Sean, > > I wasn't aware of the MLX driver accessing the lock directly, and i > would strongly discourage everyone from doing so :) > > The main problem with either accessing the lock directly or providing > the API to lock/unlock this lock is, if something in current thtread has > accidentally triggered an allocation while holding that lock (which may > happen silently in the background), it may lead to a deadlock as > allocation may try to take out a write lock. > Hi Anatoly, Can this situation already occur if something does an allocation in the context of a memory callback? I'm not very familiar with the memory hotplug management code: is it possible to re-arrange things so that the memory callbacks are called without holding the memory_hotplug_lock? > One possible solution could be changing the rwlock to a recursive lock, > but even then it presents several problems because 1) our recursive > spinlocks are not rwlocks and thus they can't allow parallel read > access, and 2) there is currently no way to upgrade/downgrade read lock > to a write lock (not to mention the allocator doesn't know if it's > supposed to release a lock, or downgrade it to a read lock), which would > be needed for such a thing to work without stopping the world, so to speak. > Is parallel read access needed for functional reasons or just performance? Is it expected that performance critical code would ever hold this lock? If parallel read access is needed for functional reasons, it should be possible to implement a RW lock that is recursive and allows upgrade/downgrade (I don't think you really want to downgrade, you just want a read lock to succeed if write lock is already hold, but you don't want to give up the write lock?) > Another possible solution to this could be to stop page allocations from > happening in the first place, while an external lock is held. This might > just solve our problem (and possibly MLX's as well), but i haven't yet > thought through the implications of such a change. > How would you detect if an external lock is held? Also, would it result in failed allocations even though there is memory available on the system? > Thanks, > Anatoly > > > > >> -- > >> Thanks, > >> Anatoly > > > > > -- > Thanks, > Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support 2018-08-28 13:27 ` Sean Harte @ 2018-08-28 14:12 ` Burakov, Anatoly 0 siblings, 0 replies; 14+ messages in thread From: Burakov, Anatoly @ 2018-08-28 14:12 UTC (permalink / raw) To: Sean Harte; +Cc: tiwei.bie, maxime.coquelin, zhihong.wang, dev, stable On 28-Aug-18 2:27 PM, Sean Harte wrote: > On Mon, 27 Aug 2018 at 10:15, Burakov, Anatoly > <anatoly.burakov@intel.com> wrote: >> >> On 24-Aug-18 4:51 PM, Sean Harte wrote: >>> On Fri, 24 Aug 2018 at 16:19, Burakov, Anatoly >>> <anatoly.burakov@intel.com> wrote: >>>> >>>> On 24-Aug-18 11:41 AM, Burakov, Anatoly wrote: >>>>> On 24-Aug-18 10:35 AM, Tiwei Bie wrote: >>>>>> On Fri, Aug 24, 2018 at 09:59:42AM +0100, Burakov, Anatoly wrote: >>>>>>> On 24-Aug-18 5:49 AM, Tiwei Bie wrote: >>>>>>>> On Thu, Aug 23, 2018 at 03:01:30PM +0100, Burakov, Anatoly wrote: >>>>>>>>> On 23-Aug-18 12:19 PM, Sean Harte wrote: >>>>>>>>>> On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly >>>>>>>>>> <anatoly.burakov@intel.com> wrote: >>>>>>>>>>> >>>>>>>>>>> On 23-Aug-18 3:57 AM, Tiwei Bie wrote: >>>>>>>>>>>> Deadlock can occur when allocating memory if a vhost-kernel >>>>>>>>>>>> based virtio-user device is in use. Besides, it's possible >>>>>>>>>>>> to have much more than 64 non-contiguous hugepage backed >>>>>>>>>>>> memory regions due to the memory hotplug, which may cause >>>>>>>>>>>> problems when handling VHOST_SET_MEM_TABLE request. A better >>>>>>>>>>>> solution is to have the virtio-user pass all the VA ranges >>>>>>>>>>>> reserved by DPDK to vhost-kernel. >>>>>>>>>>>> >>>>>>>>>>>> Bugzilla ID: 81 >>>>>>>>>>>> Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug") >>>>>>>>>>>> Cc: stable@dpdk.org >>>>>>>>>>>> >>>>>>>>>>>> Reported-by: Seán Harte <seanbh@gmail.com> >>>>>>>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/net/virtio/virtio_user/vhost_kernel.c | 64 >>>>>>>>>>>> ++++++++----------- >>>>>>>>>>>> 1 file changed, 27 insertions(+), 37 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>>>> b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>>>> index b2444096c..49bd1b821 100644 >>>>>>>>>>>> --- a/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>>>> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c >>>>>>>>>>>> @@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = { >>>>>>>>>>>> [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>>>>>>>>>>> }; >>>>>>>>>>>> >>>>>>>>>>>> -struct walk_arg { >>>>>>>>>>>> - struct vhost_memory_kernel *vm; >>>>>>>>>>>> - uint32_t region_nr; >>>>>>>>>>>> -}; >>>>>>>>>>>> -static int >>>>>>>>>>>> -add_memory_region(const struct rte_memseg_list *msl __rte_unused, >>>>>>>>>>>> - const struct rte_memseg *ms, size_t len, void *arg) >>>>>>>>>>>> -{ >>>>>>>>>>>> - struct walk_arg *wa = arg; >>>>>>>>>>>> - struct vhost_memory_region *mr; >>>>>>>>>>>> - void *start_addr; >>>>>>>>>>>> - >>>>>>>>>>>> - if (wa->region_nr >= max_regions) >>>>>>>>>>>> - return -1; >>>>>>>>>>>> - >>>>>>>>>>>> - mr = &wa->vm->regions[wa->region_nr++]; >>>>>>>>>>>> - start_addr = ms->addr; >>>>>>>>>>>> - >>>>>>>>>>>> - mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>>>>>> - mr->userspace_addr = (uint64_t)(uintptr_t)start_addr; >>>>>>>>>>>> - mr->memory_size = len; >>>>>>>>>>>> - mr->mmap_offset = 0; >>>>>>>>>>>> - >>>>>>>>>>>> - return 0; >>>>>>>>>>>> -} >>>>>>>>>>>> - >>>>>>>>>>>> -/* By default, vhost kernel module allows 64 regions, but DPDK >>>>>>>>>>>> allows >>>>>>>>>>>> - * 256 segments. As a relief, below function merges those >>>>>>>>>>>> virtually >>>>>>>>>>>> - * adjacent memsegs into one region. >>>>>>>>>>>> - */ >>>>>>>>>>>> static struct vhost_memory_kernel * >>>>>>>>>>>> prepare_vhost_memory_kernel(void) >>>>>>>>>>>> { >>>>>>>>>>>> + struct rte_mem_config *mcfg = >>>>>>>>>>>> rte_eal_get_configuration()->mem_config; >>>>>>>>>>>> struct vhost_memory_kernel *vm; >>>>>>>>>>>> - struct walk_arg wa; >>>>>>>>>>>> + uint32_t region_nr = 0, i; >>>>>>>>>>>> >>>>>>>>>>>> vm = malloc(sizeof(struct vhost_memory_kernel) + >>>>>>>>>>>> max_regions * >>>>>>>>>>>> @@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void) >>>>>>>>>>>> if (!vm) >>>>>>>>>>>> return NULL; >>>>>>>>>>>> >>>>>>>>>>>> - wa.region_nr = 0; >>>>>>>>>>>> - wa.vm = vm; >>>>>>>>>>>> + for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) { >>>>>>>>>>>> + struct rte_memseg_list *msl = &mcfg->memsegs[i]; >>>>>>>>>>>> + struct vhost_memory_region *mr; >>>>>>>>>>>> + void *start_addr; >>>>>>>>>>>> + uint64_t len; >>>>>>>>>>> >>>>>>>>>>> There is a rte_memseg_list_walk() - please do not iterate over >>>>>>>>>>> memseg >>>>>>>>>>> lists manually. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> rte_memseg_list_walk() can't be used here because >>>>>>>>>> prepare_vhost_memory_kernel() is sometimes called from a memory >>>>>>>>>> callback. It will then hang trying to get a read lock on >>>>>>>>>> memory_hotplug_lock. >>>>>>>>> >>>>>>>>> OK, so use rte_memseg_list_walk_thread_unsafe(). >>>>>>>>> >>>>>>>>>> I don't think the rte_memseg_list_walk_thread_unsafe() function is >>>>>>>>>> appropriate because prepare_vhost_memory_kernel() may not always be >>>>>>>>>> called from a memory callback. >>>>>>>>> >>>>>>>>> And how is this different? What you're doing here is identical to >>>>>>>>> calling >>>>>>>>> rte_memseg_list_walk_thread_unsafe() (that's precisely what it does >>>>>>>>> internally - check the code!), except that you're doing it manually >>>>>>>>> and not >>>>>>>>> using DPDK API, which makes your code dependent on internals of DPDK's >>>>>>>>> memory implementation. >>>>>>>>> >>>>>>>>> So, this function may or may not be called from a callback, but >>>>>>>>> you're using >>>>>>>>> thread-unsafe walk anyway. I think you should call either >>>>>>>>> thread-safe or >>>>>>>>> thread-unsafe version depending on whether you're being called from a >>>>>>>>> callback or not. >>>>>>>> >>>>>>>> Hmm, the real case is a bit more tricky. Even if this >>>>>>>> function isn't called from memory event callbacks, the >>>>>>>> "thread-safe" version list_walk() still can't be used. >>>>>>>> Because deadlock may happen. >>>>>>>> >>>>>>>> In virtio-user device start, it needs to do SET_MEM_TABLE >>>>>>>> for the vhost-backend. And to make sure that preparing and >>>>>>>> setting the memory table is atomic (and also to protect the >>>>>>>> device state), it needs a lock. So if it calls "thread-safe" >>>>>>>> version list_walk(), there will be two locks taken in >>>>>>>> below order: >>>>>>>> >>>>>>>> - the virtio-user device lock (taken by virtio_user_start_device()); >>>>>>>> - the memory hotplug lock (taken by rte_memseg_list_walk()); >>>>>>>> >>>>>>>> And above locks will be released in below order: >>>>>>>> >>>>>>>> - the memory hotplug lock (released by rte_memseg_list_walk()); >>>>>>>> - the virtio-user device lock (released by virtio_user_start_device()); >>>>>>>> >>>>>>>> And in virtio-user's memory event callback, it also needs >>>>>>>> to take the virtio-user device lock to make sure preparing >>>>>>>> and setting the memory table is atomic (and also to protect >>>>>>>> the device state), so the same device lock is needed here. >>>>>>>> And before virtio-user's memory event callback is called, >>>>>>>> the memory hotplug lock has already been taken by memory >>>>>>>> subsystem. So the locks are taken in below order: >>>>>>>> >>>>>>>> - the memory hotplug lock (It has been taken by memory subsystem >>>>>>>> before virtio-user's memory event callback being called); >>>>>>>> - the virtio-user device lock (taken by virtio_user_mem_event_cb()); >>>>>>>> >>>>>>>> So, if the virtio-user device start and memory callback >>>>>>>> events happen at the same time, deadlock may happen. >>>>>>>> >>>>>>>> And we can't always use rte_memseg_list_walk_thread_unsafe(), >>>>>>>> because by its definition, it's only expected to be called >>>>>>>> in memory callbacks: >>>>>>>> >>>>>>>> /** >>>>>>>> * ...... >>>>>>>> * @note This function does not perform any locking, and is only >>>>>>>> safe to call >>>>>>>> * from within memory-related callback functions. >>>>>>>> * ...... >>>>>>>> */ >>>>>>>> int __rte_experimental >>>>>>>> rte_memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void >>>>>>>> *arg); >>>>>>>> >>>>>>>> So both of rte_memseg_list_walk_thread_unsafe() and >>>>>>>> rte_memseg_list_walk() are not really suitable for this >>>>>>>> case. If we really want to use these helpers, we need >>>>>>>> to allow rte_memseg_*_walk_thread_unsafe() to be called >>>>>>>> when the callers have taken mcfg->memory_hotplug_lock, >>>>>>>> or add some extra memory APIs to allow callers to take >>>>>>>> the lock for rte_memseg_*_walk_thread_unsafe(). And we >>>>>>>> can take this lock in virtio_user_start_device() before >>>>>>>> taking the virtio-user device lock (so locks can be taken >>>>>>>> in the correct order). Thoughts? >>>>>>> >>>>>>> You can know if this function is called from memory callback. You can >>>>>>> probably also know if you're in the process of starting the device. The >>>>>>> solution seems clear to me - check both? :) >>>>>> >>>>>> Hmm.. You didn't get my point. :( >>>>>> >>>>>> I mean a lock within virtio driver is needed by virtio-user >>>>>> to avoid the race condition between the virtio-user device >>>>>> start and the virtio-user memory event callback (e.g. about >>>>>> the device state change). And to iterate the memseg lists, >>>>>> a lock within memory subsystem will be taken. So in both of >>>>>> the virtio device start and memory event handling, there are >>>>>> two locks will be taken -- And we need to take these locks >>>>>> in correct order to avoid deadlock, and it requires us to >>>>>> have a way to take the lock for rte_memseg_*_thread_unsafe() >>>>>> in callers. >>>>> >>>>> I'm afraid i'm still not getting your point :( >>>>> >>>>> You know that you can either get called from memory callback, or not >>>>> from memory callback. Both of these times, the virtio device is locked. >>>>> So, where does the race come from? You take out your device lock, and by >>>>> the time you need to iterate through memsegs, you know that you either >>>>> were or weren't called from memory callback, which means you can pick >>>>> either thread-safe or thread-unsafe version. >>>>> >>>>> Can you please draw up a step-by-step example where race can happen that >>>>> cannot be solved the way i suggested above? >>>>> >>>>> I.e. >>>>> >>>>> thread 1 thread 2 >>>>> do this >>>>> do that >>>> >>>> For the benefit of public discussion, the following is result of our >>>> internal discussion on this topic: the deadlock may happen because we >>>> take virtio lock and hotplug lock in different order. >>>> >>>> So, thread 1 may do device start, which will take out virtio device >>>> lock, then attempt to iterate memory regions thereby taking memory >>>> hotplug lock. At the same time, thread 2 might trigger an allocation, >>>> and virtio will receive a callback, which will lock hotplug, and then >>>> attempt to lock virtio device. >>>> >>>> In other words, >>>> >>>> Thread 1 Thread 2 >>>> lock virtio device >>>> lock hotplug >>>> (waits for hotplug unlock) >>>> (waits for virtio device unlock) >>>> >>>> The solution to this is not trivial, and we haven't come up with >>>> anything to fix this that doesn't involve pretty serious changes to the >>>> memory subsystem. Ideas welcome :) >>>> >>> >>> As part of device start, can the virtio-user driver take the hotplug >>> lock before the virtio lock (in Thread 1 in your example). It can >>> access the lock through rte_eal_get_configuration()->mem_config, as >>> the Mellanox driver currently does. If that would work, an API to lock >>> and unlock do it should probably be provided by the memory subsystem. >>> Although, it seems a bit error-prone and a future change could easily >>> break things. >> >> Hi Sean, >> >> I wasn't aware of the MLX driver accessing the lock directly, and i >> would strongly discourage everyone from doing so :) >> >> The main problem with either accessing the lock directly or providing >> the API to lock/unlock this lock is, if something in current thtread has >> accidentally triggered an allocation while holding that lock (which may >> happen silently in the background), it may lead to a deadlock as >> allocation may try to take out a write lock. >> > > Hi Anatoly, > > Can this situation already occur if something does an allocation in > the context of a memory callback? I'm not very familiar with the > memory hotplug management code: is it possible to re-arrange things so > that the memory callbacks are called without holding the > memory_hotplug_lock? Yes, this situation will occur if someone tries to allocate something and triggers new page allocation/free in the context of a callback. And this is the kind of thing i was referring to when i said "without serious changes to the memory code" :) The main problem here is that these callbacks *must* be called while holding *some* kind of lock, be it read or write lock - otherwise, memory layout might change while you're still inside the callback. Making the lock recursive would help in this case somewhat, but with it it carries a danger of calling the callback recursively as well (you allocate, you end up in the callback, you allocate inside callback, you end up in the callback again, round and round we go). This is not a problem i see an easy solution to. An additional complication here is secondary processes - we need to synchronize memory maps between primary and secondary, and every one of these processes will receive the callback. They need to be triggered per process, and we need to make sure that some other process doesn't change the memory map in the meantime. Technically (cue xkcd), and assuming we could upgrade/downgrade the memory hotplug lock to read/write ones, we could reorganize callbacks in a way that they were only called while holding a read lock, so we could use the thread-safe walk functions, but the allocation inside the callback problem doesn't go away. And, coming back to this particular issue, this wouldn't have helped us in any way, because the issue here is not that we can't allocate inside a callback or arrive in a callback with a lock already held, but the fact that we take the two locks (virtio device lock and mem lock) in different order in two different places :) > >> One possible solution could be changing the rwlock to a recursive lock, >> but even then it presents several problems because 1) our recursive >> spinlocks are not rwlocks and thus they can't allow parallel read >> access, and 2) there is currently no way to upgrade/downgrade read lock >> to a write lock (not to mention the allocator doesn't know if it's >> supposed to release a lock, or downgrade it to a read lock), which would >> be needed for such a thing to work without stopping the world, so to speak. >> > > Is parallel read access needed for functional reasons or just > performance? Is it expected that performance critical code would ever > hold this lock? If parallel read access is needed for functional > reasons, it should be possible to implement a RW lock that is > recursive and allows upgrade/downgrade (I don't think you really want > to downgrade, you just want a read lock to succeed if write lock is > already hold, but you don't want to give up the write lock?) > Strictly speaking, the answer to both questions is no - we use it purely because i don't want to block other threads from reading the mem map unless absolutely required. It's not expected to be performance sensitive (if you want performant alloc/free, use mempool). So, replacing rwlocks with recursive locks is not the problem - the implications of allowing memory allocations inside callbacks - _that_ is the real issue with this proposal. But, as i said above, this wouldn't have fixed the issue we're having here anyway :) >> Another possible solution to this could be to stop page allocations from >> happening in the first place, while an external lock is held. This might >> just solve our problem (and possibly MLX's as well), but i haven't yet >> thought through the implications of such a change. >> > > How would you detect if an external lock is held? Also, would it > result in failed allocations even though there is memory available on > the system? By providing an API to take out that lock explicitly, instead of allowing backdoor access to mem config through rte_config. When user calls this lock API, we could store a flag that gets cleared when user calls the unlock function - that would take care of forbidding new page allocations even if we had recursive lock and an allocation were to happen while the lock is held. If i had my way, this rte_eal_get_configuration() API would've been gone as i don't think we should be providing such an API in the first place - this is basically inviting user code to depend on DPDK internals, which means changing them is a huge pain for everyone. > >> Thanks, >> Anatoly >> >>> >>>> -- >>>> Thanks, >>>> Anatoly >>> >> >> >> -- >> Thanks, >> Anatoly > -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2018-08-28 14:12 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-08-23 2:57 [dpdk-dev] [PATCH] net/virtio-user: fix memory hotplug support Tiwei Bie 2018-08-23 9:05 ` Burakov, Anatoly 2018-08-23 11:19 ` Sean Harte 2018-08-23 14:01 ` Burakov, Anatoly 2018-08-23 15:33 ` Sean Harte 2018-08-24 4:49 ` Tiwei Bie 2018-08-24 8:59 ` Burakov, Anatoly 2018-08-24 9:35 ` Tiwei Bie 2018-08-24 10:41 ` Burakov, Anatoly 2018-08-24 15:19 ` Burakov, Anatoly 2018-08-24 15:51 ` Sean Harte 2018-08-27 9:15 ` Burakov, Anatoly 2018-08-28 13:27 ` Sean Harte 2018-08-28 14:12 ` Burakov, Anatoly
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).