From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 30DEFA052A; Tue, 22 Dec 2020 09:49:11 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 16983CA97; Tue, 22 Dec 2020 09:49:10 +0100 (CET) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by dpdk.org (Postfix) with ESMTP id CE802CA8F for ; Tue, 22 Dec 2020 09:49:08 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608626947; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uyO6BeXBIn5TjM5zr3iV8DWLZDQuVQqFD1fLWkywegY=; b=h3MdoFo3ZLvmYBZ83tuBToE+xlUNDFq+EnQov/G+Vg7ySInK0sTuc2VOjmhQMqZ0i6LcMI 96ypCXnE/rzrQiQbbZ/JoXRhvON3FXlxGpJaqGQor+aIbdS574+dM6A1+o8ild4H7DJ7+S 3YXd8THraBezZR0DjhsiZMJyvvhRuUk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-593-TMyzcr9sPFK13Olh9atmqg-1; Tue, 22 Dec 2020 03:48:53 -0500 X-MC-Unique: TMyzcr9sPFK13Olh9atmqg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E2BF2520F; Tue, 22 Dec 2020 08:48:51 +0000 (UTC) Received: from [10.36.110.46] (unknown [10.36.110.46]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0FBB860C7A; Tue, 22 Dec 2020 08:48:45 +0000 (UTC) To: "Xia, Chenbo" , Thomas Monjalon , David Marchand Cc: dev , Stephen Hemminger , "Liang, Cunming" , "Lu, Xiuchun" , "Li, Miao" , "Wu, Jingjing" References: <20201218074736.93999-1-chenbo.xia@intel.com> <22c05dc3-bee9-4662-3f8d-a4dcd7635b42@redhat.com> From: Maxime Coquelin Message-ID: <33a01a88-327b-844b-9f3c-03660bc2ceb3@redhat.com> Date: Tue, 22 Dec 2020 09:48:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=maxime.coquelin@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH 0/8] Introduce emudev library and iavf emudev driver X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Chenbo, Thanks for the detailed reply. On 12/22/20 4:09 AM, Xia, Chenbo wrote: > Hi Maxime, > >> -----Original Message----- >> From: Maxime Coquelin >> Sent: Monday, December 21, 2020 8:02 PM >> To: Xia, Chenbo ; Thomas Monjalon ; >> David Marchand >> Cc: dev ; Stephen Hemminger ; Liang, >> Cunming ; Lu, Xiuchun ; Li, >> Miao ; Wu, Jingjing >> Subject: Re: [dpdk-dev] [PATCH 0/8] Introduce emudev library and iavf emudev >> driver >> >> >> >> On 12/21/20 10:52 AM, Maxime Coquelin wrote: >>> Hi Chenbo, >>> >>> On 12/19/20 7:11 AM, Xia, Chenbo wrote: >>>> Hi David, >>>> >>>>> -----Original Message----- >>>>> From: David Marchand >>>>> Sent: Friday, December 18, 2020 5:54 PM >>>>> To: Xia, Chenbo >>>>> Cc: dev ; Thomas Monjalon ; Stephen >>>>> Hemminger ; Liang, Cunming >>>>> ; Lu, Xiuchun ; Li, Miao >>>>> ; Wu, Jingjing >>>>> Subject: Re: [PATCH 0/8] Introduce emudev library and iavf emudev driver >>>>> >>>>> On Fri, Dec 18, 2020 at 9:02 AM Chenbo Xia wrote: >>>>>> >>>>>> This series introduces a new device abstraction called emudev for >>>>> emulated >>>>>> devices. A new library (librte_emudev) is implemented. The first emudev >>>>>> driver is also introduced, which emulates Intel Adaptive Virtual >>>>> Function >>>>>> (iavf) as a software network device. >>>>>> >>>>>> This series has a dependency on librte_vfio_user patch series: >>>>>> http://patchwork.dpdk.org/cover/85389/ >>>>>> >>>>>> Background & Motivation >>>>>> ----------------------- >>>>>> The disaggregated/multi-process QEMU is using VFIO-over-socket/vfio-user >>>>>> as the main transport mechanism to disaggregate IO services from QEMU. >>>>>> Therefore, librte_vfio_user is introduced in DPDK to accommodate >>>>>> emulated devices for high performance I/O. Although vfio-user library >>>>>> provides possibility of emulating devices in DPDK, DPDK does not have >>>>>> a device abstraction for emulated devices. A good device abstraction >>>>> will >>>>>> be useful for applications or high performance data path driver. With >>>>>> this consideration, emudev library is designed and implemented. It also >>>>>> make it possbile to keep modular design on emulated devices by >>>>> implementing >>>>>> data path related logic in a standalone driver (e.g., an ethdev driver) >>>>>> and keeps the unrelated logic in the emudev driver. >>>>> >>>>> Since you mention performance, how does it compare to vhost-user/virtio? >>>> >>>> I think it depends on the device specification (i.e., how complex its data >> path >>>> handling is). A first try on iavf spec shows better performance than virtio >>>> in our simple tests. >>> >>> That's interesting! How big is the performance difference? And how do >>> we explain it? >>> >>> If there are improvements that could be done in the Virtio >>> specification, it would be great to know and work on their >>> implementations. It worries me a bit that every one is coming with >>> his new device emulation every release, making things like live- >>> migration difficult to achieve in the future. >> >> I did a quick review of the IAVF emudev driver to understand what other >> factors than ring layout could explain a performance gain. >> >> My understanding is that part of the performance gain may come from >> following things that are supported/implemented in Vhost-user backend >> and not in IAVF driver: >> 1. Memory hotplug. It seems the datapath is not safe against memory >> hotplug in the VM, which causes the memory tables to be updated >> asynchronously from the datapath. In order to support it in Vhost-user >> library, we had to introduce locks to ensure the datapath isn't >> accessing the shared memory while it is being remapped. > > I think now it uses the similar way that vhost-user does. > > First, in the vfio-user patch series, we introduce a callback lock_dp to lock > the data path when messages like DMA MAP/UNMAP come. It will lock datapath > in our data path driver. > > Note that the data path handling is in our data path driver: > http://patchwork.dpdk.org/cover/85500/ > > For modular design, iavf_emu driver emulates the device but the iavf back-end > driver does data path handling. My analysis was actually based on the data path driver series. My point was that iavfbe_recv_pkts() and iavfbe_xmit_pkts() are not safe against asynchronous changes like memory table updates. As far as I can see, the access_lock aren't taken by these functions, so if for example a memory table update happen during the these functions execution, it could lead to undefined behaviour. Only things checked there is whether the queue is enabled when entering the function, but this value can be changed right after having being checked. For example, in Vhost-user lib, we protected rte_vhost_dequeue_burst() and rte_vhost_enqueue_burst() with a spinlock. Note that this spinlock is per-virtqueue in order to avoid contention between the different queues. >> >> 2. Live-migration. This feature does not seem supported in the driver, >> as I could not find dirty pages tracking mechanism. On Vhost-user side, >> supporting implies adding more branch conditions in the hot path, even >> when it is not used. > > Yes, we don't support this now in this version. And yes, when we support this > feature, it will introduce complexity in data path. > >> >> 3. No IOMMU support. Same here, this is supported in Vhost-user library, >> and adding its support in IAVF driver would introduce some more branches >> in the hot path even when not used. > > Yes, vIOMMU is not fully supported in vfio-user spec for now and I also agree > when we have to support it, it will slow down the data path. > >> >> 4. Out of bound memory accesses checks. While >> rte_iavf_emu_get_dma_vaddr() provides a way to ensure the full requested >> length is mapped, the data path does not use it. It does not even ensure > > In fact, it uses it 😊. I think you may miss our data path driver. Here's a > use: http://patchwork.dpdk.org/patch/85504/ Sorry, I was not clear. What I meant is that rte_iavf_emu_get_dma_vaddr() is indeed used, but its outputs aren't checked properly. >> the translated address is non-NULL. It makes it trivial for a malicious >> guest to make the hypervisor's vSwitch to crash by simply passing random >> values as buffer's address and length. Fixing it is trivial, but it will >> add several more checks and loops (if a buffer is spanned across two >> pages) in the hot path. > > I don't quite understand this one. First, rte_iavf_emu_get_dma_vaddr() is the > only way to translate address. And I think this function will ensure the input > address is valid. Looking at the vhost side, vhost_iova_to_vva() does similar > things when vIOMMU is not used. Do I miss something? Just correct me if I am > wrong. I think rte_iavf_emu_get_dma_vaddr() is good, the issue is the way it is used in iavfbe_recv_pkts() and iavfbe_xmit_pkts(). First the returned address (desc_addr) is not checked. So if the start address of the buffer is out of guest memory ranges, the host app will crash with a segmentation fault. It can happen in case of bug in the guest driver, or with a maliciously crafted descriptor. Then, rte_iavf_emu_get_dma_vaddr() len parameter is a pointer. The caller has to pass the length of the buffer it wants to map. Once rte_iavf_emu_get_dma_vaddr() is updated with the contiguous length that is mapped. The caller needs to check whether all requested length is mapped, and loop until it is all mapped: http://code.dpdk.org/dpdk/latest/source/lib/librte_vhost/virtio_net.c#L484 This is required because a buffer contiguous in IOVA (guest physical address space in this case) is not necessarily contiguous in Host virtual address space. Also, this is required to be safe against untrusted guests, as a malicious guest could pass ~0ULL in the descriptor buffer len of the Tx path to crash the host application or overwrite its memory. Currently, in the driver, descriptor buf len is passed to the DMA map function, but its value is not checked afterward, so it only ensure first byte is mapped. For Tx, 1 bytes length is requested at DMA map, so also only first byte of the buffer is ensured to be mapped (if desc_addr is not NULL, which is not checked). >> >> Other than that, there is for sure a performance gain due to all the >> features Virtio-net supports that we have to check and handle in the >> hotpath, like indirect descriptors or mergeable buffers for example. > > I think iavf has similar features like indirect and mergeable to recv/xmit > large pkts. But I believe there will be some feature difference between > iavf and virtio/vhost. Sure, but I meant it might not be necessary to implement all those features (if they are optional in the spec) in your driver, just that in the case of Vhost we have a legacy to support. > I think you are correct that for this version, it's not fair to compare virtio/vhost > with iavf back-end because iavf back-end has not supported some features, and besides, > we have not optimized the data path of iavf back-end. We expect the performance of > iavf back-end in the same level of virtio 1.1 and hopefully better because of the ring > layout. But let's see when we can do complete performance analysis 😊. Thanks! Maxime > Thanks! > Chenbo > >> >> Best regards, >> Maxime >> >>> Regards, >>> Maxime >>> >>>> Thanks! >>>> Chenbo >>>> >>>>> >>>>> >>>>> -- >>>>> David Marchand >>>> >