From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 7D08C5688 for ; Mon, 17 Dec 2018 12:08:50 +0100 (CET) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C3D11124570; Mon, 17 Dec 2018 11:08:49 +0000 (UTC) Received: from [10.36.112.10] (unknown [10.36.112.10]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 00FD16012D; Mon, 17 Dec 2018 11:08:46 +0000 (UTC) To: "Wang, Xiao W" , "Bie, Tiwei" Cc: "alejandro.lucero@netronome.com" , "dev@dpdk.org" , "Wang, Zhihong" , "Ye, Xiaolong" References: <20181213100910.13087-2-xiao.w.wang@intel.com> <20181214211612.167681-1-xiao.w.wang@intel.com> <20181214211612.167681-10-xiao.w.wang@intel.com> From: Maxime Coquelin Message-ID: Date: Mon, 17 Dec 2018 12:08:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Mon, 17 Dec 2018 11:08:49 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH v4 09/10] net/ifc: support SW assisted VDPA live migration X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2018 11:08:50 -0000 On 12/17/18 10:12 AM, Wang, Xiao W wrote: > Hi Maxime, > >> -----Original Message----- >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] >> Sent: Sunday, December 16, 2018 1:35 AM >> To: Wang, Xiao W ; Bie, Tiwei >> Cc: alejandro.lucero@netronome.com; dev@dpdk.org; Wang, Zhihong >> ; Ye, Xiaolong >> Subject: Re: [PATCH v4 09/10] net/ifc: support SW assisted VDPA live migration >> >> >> >> On 12/14/18 10:16 PM, Xiao Wang wrote: >>> In SW assisted live migration mode, driver will stop the device and >>> setup a mediate virtio ring to relay the communication between the >>> virtio driver and the VDPA device. >>> >>> This data path intervention will allow SW to help on guest dirty page >>> logging for live migration. >>> >>> This SW fallback is event driven relay thread, so when the network >>> throughput is low, this SW fallback will take little CPU resource, but >>> when the throughput goes up, the relay thread's CPU usage will goes up >>> accordinly. >> >> s/accordinly/accordingly/ >> > > Will fix it in next version. > >>> >>> User needs to take all the factors including CPU usage, guest perf >>> degradation, etc. into consideration when selecting the live migration >>> support mode. >>> >>> Signed-off-by: Xiao Wang >>> --- >>> drivers/net/ifc/base/ifcvf.h | 1 + >>> drivers/net/ifc/ifcvf_vdpa.c | 346 >> ++++++++++++++++++++++++++++++++++++++++++- >>> 2 files changed, 344 insertions(+), 3 deletions(-) >>> >>> diff --git a/drivers/net/ifc/base/ifcvf.h b/drivers/net/ifc/base/ifcvf.h >>> index c15c69107..e8a30d2c6 100644 >>> --- a/drivers/net/ifc/base/ifcvf.h >>> +++ b/drivers/net/ifc/base/ifcvf.h >>> @@ -50,6 +50,7 @@ >>> #define IFCVF_LM_ENABLE_VF 0x1 >>> #define IFCVF_LM_ENABLE_PF 0x3 >>> #define IFCVF_LOG_BASE 0x100000000000 >>> +#define IFCVF_MEDIATE_VRING 0x200000000000 >> >> MEDIATED? > > "mediate" is used as adjective here. > >> >>> >>> #define IFCVF_32_BIT_MASK 0xffffffff >>> >>> diff --git a/drivers/net/ifc/ifcvf_vdpa.c b/drivers/net/ifc/ifcvf_vdpa.c >>> index f181c5a6e..61757d0b4 100644 >>> --- a/drivers/net/ifc/ifcvf_vdpa.c >>> +++ b/drivers/net/ifc/ifcvf_vdpa.c >>> @@ -63,6 +63,9 @@ struct ifcvf_internal { >>> rte_atomic32_t running; >>> rte_spinlock_t lock; >>> bool sw_lm; > > [...] > >>> +static void * >>> +vring_relay(void *arg) >>> +{ >>> + int i, vid, epfd, fd, nfds; >>> + struct ifcvf_internal *internal = (struct ifcvf_internal *)arg; >>> + struct rte_vhost_vring vring; >>> + struct rte_intr_handle *intr_handle; >>> + uint16_t qid, q_num; >>> + struct epoll_event events[IFCVF_MAX_QUEUES * 4]; >>> + struct epoll_event ev; >>> + int nbytes; >>> + uint64_t buf; >>> + >>> + vid = internal->vid; >>> + q_num = rte_vhost_get_vring_num(vid); >>> + /* prepare the mediate vring */ >>> + for (qid = 0; qid < q_num; qid++) { >>> + rte_vhost_get_vring_base(vid, qid, >>> + &internal->m_vring[qid].avail->idx, >>> + &internal->m_vring[qid].used->idx); >>> + rte_vdpa_relay_vring_avail(vid, qid, &internal->m_vring[qid]); >>> + } >>> + >>> + /* add notify fd and interrupt fd to epoll */ >>> + epfd = epoll_create(IFCVF_MAX_QUEUES * 2); >>> + if (epfd < 0) { >>> + DRV_LOG(ERR, "failed to create epoll instance."); >>> + return NULL; >>> + } >>> + internal->epfd = epfd; >>> + >>> + for (qid = 0; qid < q_num; qid++) { >>> + ev.events = EPOLLIN | EPOLLPRI; >>> + rte_vhost_get_vhost_vring(vid, qid, &vring); >>> + ev.data.u64 = qid << 1 | (uint64_t)vring.kickfd << 32; >>> + if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) { >>> + DRV_LOG(ERR, "epoll add error: %s", strerror(errno)); >>> + return NULL; >>> + } >>> + } >>> + >>> + intr_handle = &internal->pdev->intr_handle; >>> + for (qid = 0; qid < q_num; qid++) { >>> + ev.events = EPOLLIN | EPOLLPRI; >>> + ev.data.u64 = 1 | qid << 1 | >>> + (uint64_t)intr_handle->efds[qid] << 32; >>> + if (epoll_ctl(epfd, EPOLL_CTL_ADD, intr_handle->efds[qid], >> &ev) >>> + < 0) { >>> + DRV_LOG(ERR, "epoll add error: %s", strerror(errno)); >>> + return NULL; >>> + } >>> + } >>> + >>> + /* start relay with a first kick */ >>> + for (qid = 0; qid < q_num; qid++) >>> + ifcvf_notify_queue(&internal->hw, qid); >>> + >>> + /* listen to the events and react accordingly */ >>> + for (;;) { >>> + nfds = epoll_wait(epfd, events, q_num * 2, -1); >>> + if (nfds < 0) { >>> + if (errno == EINTR) >>> + continue; >>> + DRV_LOG(ERR, "epoll_wait return fail\n"); >>> + return NULL; >>> + } >>> + >>> + for (i = 0; i < nfds; i++) { >>> + fd = (uint32_t)(events[i].data.u64 >> 32); >>> + do { >>> + nbytes = read(fd, &buf, 8); >>> + if (nbytes < 0) { >>> + if (errno == EINTR || >>> + errno == EWOULDBLOCK || >>> + errno == EAGAIN) >>> + continue; >>> + DRV_LOG(INFO, "Error reading " >>> + "kickfd: %s", >>> + strerror(errno)); >>> + } >>> + break; >>> + } while (1); >>> + >>> + qid = events[i].data.u32 >> 1; >>> + >>> + if (events[i].data.u32 & 1) >>> + update_used_ring(internal, qid); >>> + else >>> + update_avail_ring(internal, qid); >>> + } >>> + } >>> + >>> + return NULL; >>> +} >>> + >>> +static int >>> +setup_vring_relay(struct ifcvf_internal *internal) >>> +{ >>> + int ret; >>> + >>> + ret = pthread_create(&internal->tid, NULL, vring_relay, >>> + (void *)internal); >> >> So it will be scheduled without any affinity? >> Shouldn't it use a pmd thread instead? > > The new thread will inherit the thread affinity from its parent thread. As you know, vdpa is trying to > minimize CPU usage for virtio HW acceleration, and we assign just one core to vdpa daemon > (doc/guides/sample_app_ug/vdpa.rst), so there's no dedicated pmd worker core. OK, thanks for the clarification! > Thanks, > Xiao >