From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <darek.stojaczyk@gmail.com>
Received: from mail-io0-f182.google.com (mail-io0-f182.google.com
 [209.85.223.182]) by dpdk.org (Postfix) with ESMTP id B05831EE5D
 for <dev@dpdk.org>; Wed, 13 Jun 2018 11:41:18 +0200 (CEST)
Received: by mail-io0-f182.google.com with SMTP id l19-v6so2771544ioj.5
 for <dev@dpdk.org>; Wed, 13 Jun 2018 02:41:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=z7edpJVIjZO7YUiZ6nCMK+d4vhgEMFtdw19Idk9LX7o=;
 b=rYMRGPRfYE0eyAemKrwm7dnppQy92/M028t+2Tg2dK8+DhUAjzcfWtYsBHQyZ33IPT
 9io5LPca2MmFVTCVOsSjx0usHW2T1pNtUZeQSoQFZBRY2/Kq/KdCWK49+2l4Zu5T0ZnI
 Ne7IVKJp+PMYSKTL5s0Aqyec2KG2dCsaA27g2QL44oHLbQBix/tjupB5kd/zC7ND3rO+
 eZf4ZgAGpp40Q3EWLVmmRnYT5n+w176ONYy1VsEJoYnqRzjqRNDJSzFn+HkhNfbaeTnx
 UDDIDFcSsXk1ortvT8z27kgLcRkQG3xpJupDT9Y/OEpQObXH+cyFYaFbCSt1+SodpeT5
 d9ZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=z7edpJVIjZO7YUiZ6nCMK+d4vhgEMFtdw19Idk9LX7o=;
 b=plF7SHodrvrrtWD0RtUhblDVKWRnp/nigqIt6qhxX6DmK4lniaYylmCk2V3+h/0Yj3
 eZ4XEWMMIr38WB4RulHb9iSz9UbgTbdNS8SWjJ+Bn3Rt8upz79/Z+6OJRMpyFP5VgG94
 zXXFstiZQepe+wHV6TIemHkiEJPPABumQoHKj+lLcq7F+ruY7hyzG3uK3NJC2BOG4I9k
 97QbWTdhajDMcI4q9xCfaMytrxG5QVEvJoommTAgiAa7LyZjhq/6MkmYx/4PFsX+dAUT
 wkwP/03OMHycMCOoywEbb/fdSkPmD4i9MwVXqEWnNWi14HD1KiJO88Uiw2nWQHhvNOzY
 IxnQ==
X-Gm-Message-State: APt69E2S5axaCh0cIZ0Im8Ar7mf0l5ZoCRtdQCHZzJ5CGQkcJqlvYTQ7
 x+MffLv2YT/JD9t5oVK+Vb2QJ/bg/HYE0X61BEk=
X-Google-Smtp-Source: ADUXVKLAzShp0rrzwa17rzT8T55p7zhkcLYBArIf71fWG8N7xGVvxo7W54KlUszgeAtbQxZVTGen8pVeNCt/+N9EE3M=
X-Received: by 2002:a6b:a30d:: with SMTP id
 m13-v6mr4102669ioe.98.1528882878031; 
 Wed, 13 Jun 2018 02:41:18 -0700 (PDT)
MIME-Version: 1.0
References: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com>
 <20180607151227.23660-1-darek.stojaczyk@gmail.com>
 <20180608100852.GA31164@stefanha-x1.localdomain>
In-Reply-To: <20180608100852.GA31164@stefanha-x1.localdomain>
From: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>
Date: Wed, 13 Jun 2018 11:41:06 +0200
Message-ID: <CAH3KmLiWit+g5V=YN9pwBxQ_BgqT72+uDicHNN_m5E+Fg6t96w@mail.gmail.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: dev@dpdk.org, Maxime Coquelin <maxime.coquelin@redhat.com>, 
 Tiwei Bie <tiwei.bie@intel.com>, Tetsuya Mukawa <mtetsuyah@gmail.com>, 
 Thomas Monjalon <thomas@monjalon.net>, yliu@fridaylinux.org, 
 James Harris <james.r.harris@intel.com>,
 Tomasz Kulasek <tomaszx.kulasek@intel.com>, 
 "Wodkowski, PawelX" <pawelx.wodkowski@intel.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Jun 2018 09:41:19 -0000

Hi Stefan,
I'm sorry for the late response. My email client filtered out this
mail. I fixed it just now.

pt., 8 cze 2018 o 15:29 Stefan Hajnoczi <stefanha@redhat.com> napisa=C5=82(=
a):
>
> On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> > The proposed structure for the new library is described below.
> >  * rte_vhost2.h
> >    - public API
> >    - registering targets with provided ops
> >    - unregistering targets
> >    - iova_to_vva()
> >  * transport.h/.c
> >    - implements rte_vhost2.h
> >    - allows registering vhost transports, which are opaquely required b=
y the
> >      rte_vhost2.h API (target register function requires transport name=
).
> >    - exposes a set of callbacks to be implemented by each transport
> >  * vhost_user.c
> >    - vhost-user Unix domain socket transport
>
> This file should be called transport_unix.c or similar so it's clear
> that it only handles UNIX domain socket transport aspects, not general
> vhost-user protocol aspects.  If the distinction is unclear then
> layering violations are easy to make in the future (especially when
> people other than you contribute to the code).

Ack. Also, virtio-vhost-user transport still has to be placed in
drivers/ directory, right? We could move most of this library
somewhere into drivers/, leaving only rte_vhost2.h, transport.h and
transport.c in lib/librte_vhost2. What do you think?

>
> >    - does recvmsg()
> >    - uses the underlying vhost-user helper lib to process messages, but=
 still
> >      handles some transport-specific ones, e.g. SET_MEM_TABLE
> >    - calls some of the rte_vhost2.h ops registered with a target
> >  * fd_man.h/.c
> >    - polls provided descriptors, calls user callbacks on fd events
> >    - based on the original rte_vhost version
> >    - additionally allows calling user-provided callbacks on the poll th=
read
>
> Ths is general-purpose functionality that should be a core DPDK utility.
>
> Are you sure you cannot use existing (e)poll functionality in DPDK?

We have to use poll here, and I haven't seen any DPDK APIs for poll,
only rte_epoll_*. Since received vhost-user messages may be handled
asynchronously, we have to temporarily remove an fd from the poll
group for the time each message is handled. We do it by setting the fd
in the polled fds array to -1. man page for poll(2) explicitly
suggests this to ignore poll() events.

>
> >  * vhost.h/.c
> >    - a transport-agnostic vhost-user library
> >    - calls most of the rte_vhost2.h ops registered with a target
> >    - manages virtqueues state
> >    - hopefully to be reused by the virtio-vhost-user
> >    - exposes a set of callbacks to be implemented by each transport
> >      (for e.g. sending message replies)
> >
> > This series includes only vhost-user transport. Virtio-vhost-user
> > is to be done later.
> >
> > The following items are still TBD:
> >   * vhost-user slave channel
> >   * get/set_config
> >   * cfg_call() implementation
> >   * IOTLB
> >   * NUMA awareness
>
> This is important to think about while the API is still being designed.
>
> Some initial thoughts.  NUMA affinity is optimal when:
>
> 1. The vring, indirect descriptor tables, and data buffers are allocated
>    on the same NUMA node.
>
> 2. The virtqueue interrupts go to vcpus associated with the same NUMA
>    node as the vring.
>
> 3. The guest NUMA node corresponds to the host NUMA node of the backing
>    storage device (e.g. NVMe PCI adapter).
>
> This way memory is local to the NVMe PCI adapter on the way down the
> stack when submitting I/O and back up again when completing I/O.
>
> Achieving #1 & #2 is up to the guest drivers.
>
> Achieving #3 is up to virtual machine configuration (QEMU command-line).
>
> The role that DPDK plays in all of this is that each vhost-user
> virtqueue should be polled by a thread that has been placed on the same
> host NUMA node mentioned above for #1, #2, and #3.
>
> Per-virtqueue state should also be allocated on this host NUMA node.

I agree on all points.

> Device backends should be able to query this information so they, too,
> can allocate memory with optimal NUMA affinity.

Of course. Since we offer raw vq pointers, they will be able to use
get_mempolicy(2). No additional APIs required.

>
> >   * Live migration
>
> Another important feature to design in from the beginning.

This is being worked on at the moment.
Thanks,
D.