From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) by dpdk.org (Postfix) with ESMTP id B05831EE5D for ; Wed, 13 Jun 2018 11:41:18 +0200 (CEST) Received: by mail-io0-f182.google.com with SMTP id l19-v6so2771544ioj.5 for ; Wed, 13 Jun 2018 02:41:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=z7edpJVIjZO7YUiZ6nCMK+d4vhgEMFtdw19Idk9LX7o=; b=rYMRGPRfYE0eyAemKrwm7dnppQy92/M028t+2Tg2dK8+DhUAjzcfWtYsBHQyZ33IPT 9io5LPca2MmFVTCVOsSjx0usHW2T1pNtUZeQSoQFZBRY2/Kq/KdCWK49+2l4Zu5T0ZnI Ne7IVKJp+PMYSKTL5s0Aqyec2KG2dCsaA27g2QL44oHLbQBix/tjupB5kd/zC7ND3rO+ eZf4ZgAGpp40Q3EWLVmmRnYT5n+w176ONYy1VsEJoYnqRzjqRNDJSzFn+HkhNfbaeTnx UDDIDFcSsXk1ortvT8z27kgLcRkQG3xpJupDT9Y/OEpQObXH+cyFYaFbCSt1+SodpeT5 d9ZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=z7edpJVIjZO7YUiZ6nCMK+d4vhgEMFtdw19Idk9LX7o=; b=plF7SHodrvrrtWD0RtUhblDVKWRnp/nigqIt6qhxX6DmK4lniaYylmCk2V3+h/0Yj3 eZ4XEWMMIr38WB4RulHb9iSz9UbgTbdNS8SWjJ+Bn3Rt8upz79/Z+6OJRMpyFP5VgG94 zXXFstiZQepe+wHV6TIemHkiEJPPABumQoHKj+lLcq7F+ruY7hyzG3uK3NJC2BOG4I9k 97QbWTdhajDMcI4q9xCfaMytrxG5QVEvJoommTAgiAa7LyZjhq/6MkmYx/4PFsX+dAUT wkwP/03OMHycMCOoywEbb/fdSkPmD4i9MwVXqEWnNWi14HD1KiJO88Uiw2nWQHhvNOzY IxnQ== X-Gm-Message-State: APt69E2S5axaCh0cIZ0Im8Ar7mf0l5ZoCRtdQCHZzJ5CGQkcJqlvYTQ7 x+MffLv2YT/JD9t5oVK+Vb2QJ/bg/HYE0X61BEk= X-Google-Smtp-Source: ADUXVKLAzShp0rrzwa17rzT8T55p7zhkcLYBArIf71fWG8N7xGVvxo7W54KlUszgeAtbQxZVTGen8pVeNCt/+N9EE3M= X-Received: by 2002:a6b:a30d:: with SMTP id m13-v6mr4102669ioe.98.1528882878031; Wed, 13 Jun 2018 02:41:18 -0700 (PDT) MIME-Version: 1.0 References: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com> <20180607151227.23660-1-darek.stojaczyk@gmail.com> <20180608100852.GA31164@stefanha-x1.localdomain> In-Reply-To: <20180608100852.GA31164@stefanha-x1.localdomain> From: Dariusz Stojaczyk Date: Wed, 13 Jun 2018 11:41:06 +0200 Message-ID: To: Stefan Hajnoczi Cc: dev@dpdk.org, Maxime Coquelin , Tiwei Bie , Tetsuya Mukawa , Thomas Monjalon , yliu@fridaylinux.org, James Harris , Tomasz Kulasek , "Wodkowski, PawelX" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Jun 2018 09:41:19 -0000 Hi Stefan, I'm sorry for the late response. My email client filtered out this mail. I fixed it just now. pt., 8 cze 2018 o 15:29 Stefan Hajnoczi napisa=C5=82(= a): > > On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote: > > The proposed structure for the new library is described below. > > * rte_vhost2.h > > - public API > > - registering targets with provided ops > > - unregistering targets > > - iova_to_vva() > > * transport.h/.c > > - implements rte_vhost2.h > > - allows registering vhost transports, which are opaquely required b= y the > > rte_vhost2.h API (target register function requires transport name= ). > > - exposes a set of callbacks to be implemented by each transport > > * vhost_user.c > > - vhost-user Unix domain socket transport > > This file should be called transport_unix.c or similar so it's clear > that it only handles UNIX domain socket transport aspects, not general > vhost-user protocol aspects. If the distinction is unclear then > layering violations are easy to make in the future (especially when > people other than you contribute to the code). Ack. Also, virtio-vhost-user transport still has to be placed in drivers/ directory, right? We could move most of this library somewhere into drivers/, leaving only rte_vhost2.h, transport.h and transport.c in lib/librte_vhost2. What do you think? > > > - does recvmsg() > > - uses the underlying vhost-user helper lib to process messages, but= still > > handles some transport-specific ones, e.g. SET_MEM_TABLE > > - calls some of the rte_vhost2.h ops registered with a target > > * fd_man.h/.c > > - polls provided descriptors, calls user callbacks on fd events > > - based on the original rte_vhost version > > - additionally allows calling user-provided callbacks on the poll th= read > > Ths is general-purpose functionality that should be a core DPDK utility. > > Are you sure you cannot use existing (e)poll functionality in DPDK? We have to use poll here, and I haven't seen any DPDK APIs for poll, only rte_epoll_*. Since received vhost-user messages may be handled asynchronously, we have to temporarily remove an fd from the poll group for the time each message is handled. We do it by setting the fd in the polled fds array to -1. man page for poll(2) explicitly suggests this to ignore poll() events. > > > * vhost.h/.c > > - a transport-agnostic vhost-user library > > - calls most of the rte_vhost2.h ops registered with a target > > - manages virtqueues state > > - hopefully to be reused by the virtio-vhost-user > > - exposes a set of callbacks to be implemented by each transport > > (for e.g. sending message replies) > > > > This series includes only vhost-user transport. Virtio-vhost-user > > is to be done later. > > > > The following items are still TBD: > > * vhost-user slave channel > > * get/set_config > > * cfg_call() implementation > > * IOTLB > > * NUMA awareness > > This is important to think about while the API is still being designed. > > Some initial thoughts. NUMA affinity is optimal when: > > 1. The vring, indirect descriptor tables, and data buffers are allocated > on the same NUMA node. > > 2. The virtqueue interrupts go to vcpus associated with the same NUMA > node as the vring. > > 3. The guest NUMA node corresponds to the host NUMA node of the backing > storage device (e.g. NVMe PCI adapter). > > This way memory is local to the NVMe PCI adapter on the way down the > stack when submitting I/O and back up again when completing I/O. > > Achieving #1 & #2 is up to the guest drivers. > > Achieving #3 is up to virtual machine configuration (QEMU command-line). > > The role that DPDK plays in all of this is that each vhost-user > virtqueue should be polled by a thread that has been placed on the same > host NUMA node mentioned above for #1, #2, and #3. > > Per-virtqueue state should also be allocated on this host NUMA node. I agree on all points. > Device backends should be able to query this information so they, too, > can allocate memory with optimal NUMA affinity. Of course. Since we offer raw vq pointers, they will be able to use get_mempolicy(2). No additional APIs required. > > > * Live migration > > Another important feature to design in from the beginning. This is being worked on at the moment. Thanks, D.