DPDK patches and discussions
 help / color / mirror / Atom feed
From: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: dev@dpdk.org, Maxime Coquelin <maxime.coquelin@redhat.com>,
	 Tiwei Bie <tiwei.bie@intel.com>,
	Tetsuya Mukawa <mtetsuyah@gmail.com>,
	 Thomas Monjalon <thomas@monjalon.net>,
	yliu@fridaylinux.org,  James Harris <james.r.harris@intel.com>,
	Tomasz Kulasek <tomaszx.kulasek@intel.com>,
	 "Wodkowski, PawelX" <pawelx.wodkowski@intel.com>
Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
Date: Wed, 13 Jun 2018 11:41:06 +0200	[thread overview]
Message-ID: <CAH3KmLiWit+g5V=YN9pwBxQ_BgqT72+uDicHNN_m5E+Fg6t96w@mail.gmail.com> (raw)
In-Reply-To: <20180608100852.GA31164@stefanha-x1.localdomain>

Hi Stefan,
I'm sorry for the late response. My email client filtered out this
mail. I fixed it just now.

pt., 8 cze 2018 o 15:29 Stefan Hajnoczi <stefanha@redhat.com> napisał(a):
>
> On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> > The proposed structure for the new library is described below.
> >  * rte_vhost2.h
> >    - public API
> >    - registering targets with provided ops
> >    - unregistering targets
> >    - iova_to_vva()
> >  * transport.h/.c
> >    - implements rte_vhost2.h
> >    - allows registering vhost transports, which are opaquely required by the
> >      rte_vhost2.h API (target register function requires transport name).
> >    - exposes a set of callbacks to be implemented by each transport
> >  * vhost_user.c
> >    - vhost-user Unix domain socket transport
>
> This file should be called transport_unix.c or similar so it's clear
> that it only handles UNIX domain socket transport aspects, not general
> vhost-user protocol aspects.  If the distinction is unclear then
> layering violations are easy to make in the future (especially when
> people other than you contribute to the code).

Ack. Also, virtio-vhost-user transport still has to be placed in
drivers/ directory, right? We could move most of this library
somewhere into drivers/, leaving only rte_vhost2.h, transport.h and
transport.c in lib/librte_vhost2. What do you think?

>
> >    - does recvmsg()
> >    - uses the underlying vhost-user helper lib to process messages, but still
> >      handles some transport-specific ones, e.g. SET_MEM_TABLE
> >    - calls some of the rte_vhost2.h ops registered with a target
> >  * fd_man.h/.c
> >    - polls provided descriptors, calls user callbacks on fd events
> >    - based on the original rte_vhost version
> >    - additionally allows calling user-provided callbacks on the poll thread
>
> Ths is general-purpose functionality that should be a core DPDK utility.
>
> Are you sure you cannot use existing (e)poll functionality in DPDK?

We have to use poll here, and I haven't seen any DPDK APIs for poll,
only rte_epoll_*. Since received vhost-user messages may be handled
asynchronously, we have to temporarily remove an fd from the poll
group for the time each message is handled. We do it by setting the fd
in the polled fds array to -1. man page for poll(2) explicitly
suggests this to ignore poll() events.

>
> >  * vhost.h/.c
> >    - a transport-agnostic vhost-user library
> >    - calls most of the rte_vhost2.h ops registered with a target
> >    - manages virtqueues state
> >    - hopefully to be reused by the virtio-vhost-user
> >    - exposes a set of callbacks to be implemented by each transport
> >      (for e.g. sending message replies)
> >
> > This series includes only vhost-user transport. Virtio-vhost-user
> > is to be done later.
> >
> > The following items are still TBD:
> >   * vhost-user slave channel
> >   * get/set_config
> >   * cfg_call() implementation
> >   * IOTLB
> >   * NUMA awareness
>
> This is important to think about while the API is still being designed.
>
> Some initial thoughts.  NUMA affinity is optimal when:
>
> 1. The vring, indirect descriptor tables, and data buffers are allocated
>    on the same NUMA node.
>
> 2. The virtqueue interrupts go to vcpus associated with the same NUMA
>    node as the vring.
>
> 3. The guest NUMA node corresponds to the host NUMA node of the backing
>    storage device (e.g. NVMe PCI adapter).
>
> This way memory is local to the NVMe PCI adapter on the way down the
> stack when submitting I/O and back up again when completing I/O.
>
> Achieving #1 & #2 is up to the guest drivers.
>
> Achieving #3 is up to virtual machine configuration (QEMU command-line).
>
> The role that DPDK plays in all of this is that each vhost-user
> virtqueue should be polled by a thread that has been placed on the same
> host NUMA node mentioned above for #1, #2, and #3.
>
> Per-virtqueue state should also be allocated on this host NUMA node.

I agree on all points.

> Device backends should be able to query this information so they, too,
> can allocate memory with optimal NUMA affinity.

Of course. Since we offer raw vq pointers, they will be able to use
get_mempolicy(2). No additional APIs required.

>
> >   * Live migration
>
> Another important feature to design in from the beginning.

This is being worked on at the moment.
Thanks,
D.

  parent reply	other threads:[~2018-06-13  9:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-10 13:22 [dpdk-dev] [RFC] vhost: new rte_vhost API proposal Dariusz Stojaczyk
     [not found] ` <20180510163643.GD9308@stefanha-x1.localdomain>
2018-05-11  5:55   ` Stojaczyk, DariuszX
     [not found]     ` <20180511100531.GA19894@stefanha-x1.localdomain>
2018-05-18  7:51       ` Stojaczyk, DariuszX
2018-05-18 13:01 ` [dpdk-dev] [RFC v2] " Dariusz Stojaczyk
2018-05-18 13:50   ` Maxime Coquelin
2018-05-20  7:07     ` Yuanhan Liu
2018-05-22 10:19     ` Stojaczyk, DariuszX
     [not found]   ` <20180525100550.GD14757@stefanha-x1.localdomain>
2018-05-29 13:38     ` Stojaczyk, DariuszX
     [not found]       ` <20180530085700.GC14623@stefanha-x1.localdomain>
2018-05-30 12:24         ` Stojaczyk, DariuszX
     [not found]   ` <20180607151227.23660-1-darek.stojaczyk@gmail.com>
     [not found]     ` <20180608100852.GA31164@stefanha-x1.localdomain>
2018-06-13  9:41       ` Dariusz Stojaczyk [this message]
2018-06-25 11:01     ` [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal Tiwei Bie
2018-06-25 12:17       ` Stojaczyk, DariuszX
2018-06-26  8:22         ` Tiwei Bie
2018-06-26  8:30           ` Thomas Monjalon
2018-06-26  8:47           ` Stojaczyk, DariuszX
2018-06-26  9:14             ` Tiwei Bie
2018-06-26  9:38               ` Maxime Coquelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAH3KmLiWit+g5V=YN9pwBxQ_BgqT72+uDicHNN_m5E+Fg6t96w@mail.gmail.com' \
    --to=darek.stojaczyk@gmail.com \
    --cc=dev@dpdk.org \
    --cc=james.r.harris@intel.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=mtetsuyah@gmail.com \
    --cc=pawelx.wodkowski@intel.com \
    --cc=stefanha@redhat.com \
    --cc=thomas@monjalon.net \
    --cc=tiwei.bie@intel.com \
    --cc=tomaszx.kulasek@intel.com \
    --cc=yliu@fridaylinux.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).