From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <chuckylinchuckylin@gmail.com>
Received: from mail-oi1-f195.google.com (mail-oi1-f195.google.com
 [209.85.167.195]) by dpdk.org (Postfix) with ESMTP id A0ABA1B1EF
 for <dev@dpdk.org>; Mon, 29 Apr 2019 06:07:18 +0200 (CEST)
Received: by mail-oi1-f195.google.com with SMTP id d62so1387583oib.13
 for <dev@dpdk.org>; Sun, 28 Apr 2019 21:07:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=6Xva+exm39q5vWGssLXPFWlCMSjjPNX4c72V0jBFME8=;
 b=gVB6ig0jc4hxPs4HzuFyrjyM+9dwsQ3g/WgJdw8uaxvyTDwFOQOp/fv4hQ43TfK9fC
 TYQUyZ47ykEBZKRBstNMXcFdc6Q0h/EkgHnkOTxDRbryap1gr2p1VsHaNu7H+KYxw5nN
 7VjCCB4MeuICSH1oeuvNN9V1hd/PLDai4a1+s3vhikY0C1ro4aiG2pvJ9Fx7V5OJJxne
 1wZrt3Dg0g4LnhwbFYlgrT2GznkvO4P2rHde3Dy4vyyEFB/6K6YsWJlrVt3r9m4vJm7v
 xGgOfBCix7+gSjDMyruEdTc4GSFq9dAX2JvBx89jXHzp5y+DK96+Ez6vzApFlYwugOzQ
 h+Vw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=6Xva+exm39q5vWGssLXPFWlCMSjjPNX4c72V0jBFME8=;
 b=E9lozDc2alfuhHnfi/IokP/XDRy3iOyZPPh4myP5foESjoTI8+b40kF+IwZ5wQ6SF2
 cv2d9hJxP/+QFlAWARdSFHV7qXibP8rMS55FFgvSFdyZAnA7uaEvDEENkA8GicLYnkSD
 vI2kZOOHwaC1ODYIlpATRdOqTA0LXv426+ceALcRGPUhkLXMfJBilsod/DzCDWmIZLjz
 Ia9TN7e8PZrNVhiEY78EEYu//+9PzvuqqD9OftCON5zY7JnNT5Vr2LjmDjXUcdsqpNwr
 wG693gq/JH7dSacVm87uTtBdPOPB+8UF7NHdnYibiIg256RHdUoVIxzgVJfrAIt8dpI8
 PjLA==
X-Gm-Message-State: APjAAAWHWWK5A36enlnl2lCjXjJr2MHeYp9kIoK0keHeYFqMxZZypqzQ
 y2ujUqZDpqRV/8QDaVCcxB0KydfSanNPfKhYukA=
X-Google-Smtp-Source: APXvYqzoY6FGY9MHZ1QbtC6X07ls+15bj40CM1RQIooeslkdg/wn4qMJOEatjXoVVDnN9X47GJaQSddM4LC7X3Vi8F4=
X-Received: by 2002:aca:4108:: with SMTP id o8mr15465392oia.162.1556510837589; 
 Sun, 28 Apr 2019 21:07:17 -0700 (PDT)
MIME-Version: 1.0
References: <1556190633-8099-1-git-send-email-chuckylinchuckylin@gmail.com>
 <1556271621-8594-1-git-send-email-chuckylinchuckylin@gmail.com>
 <20190428111748.GA16470@___>
In-Reply-To: <20190428111748.GA16470@___>
From: lin li <chuckylinchuckylin@gmail.com>
Date: Mon, 29 Apr 2019 12:07:05 +0800
Message-ID: <CAF+hgq2hjNob=Tvz2y6ES4rOXaXKOJstiMpbC9bGh1diMb3wCg@mail.gmail.com>
To: Tiwei Bie <tiwei.bie@intel.com>
Cc: maxime.coquelin@redhat.com, zhihong.wang@intel.com, dev@dpdk.org, 
 dariusz.stojaczyk@intel.com, changpeng.liu@intel.com, 
 james.r.harris@intel.com, lilin24 <lilin24@baidu.com>, Ni Xun <nixun@baidu.com>,
 Zhang Yu <zhangyu31@baidu.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [dpdk-dev] [PATCH v3] vhost: support inflight share memory
	protocol feature
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Apr 2019 04:07:19 -0000

Tiwei Bie <tiwei.bie@intel.com> =E4=BA=8E2019=E5=B9=B44=E6=9C=8828=E6=97=A5=
=E5=91=A8=E6=97=A5 =E4=B8=8B=E5=8D=887:18=E5=86=99=E9=81=93=EF=BC=9A
>
> On Fri, Apr 26, 2019 at 05:40:21AM -0400, Li Lin wrote:
> > From: lilin24 <lilin24@baidu.com>
> >
> > This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
> > buffer between qemu and backend.
> >
> > Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
> > shared buffer from backend. Then qemu should send it back
> > through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.
> >
> > This shared buffer is used to process inflight I/O when backend
> > reconnect.
> >
> > Signed-off-by: lilin24 <lilin24@baidu.com>
>
> You need to use your real name here.
>
> > Signed-off-by: Ni Xun <nixun@baidu.com>
> > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> >
> > v2:
> > - add set&clr inflight entry function
> > v3:
> > - simplified some function implementations
> >
> > ---
>
> You can put change logs here (i.e. after ---).
>
> >  lib/librte_vhost/rte_vhost.h  |  53 ++++++++++
> >  lib/librte_vhost/vhost.c      |  42 ++++++++
> >  lib/librte_vhost/vhost.h      |  11 ++
> >  lib/librte_vhost/vhost_user.c | 231 ++++++++++++++++++++++++++++++++++=
++++++++
> >  lib/librte_vhost/vhost_user.h |  16 ++-
> >  5 files changed, 351 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.=
h
> > index d2c0c93f4..bc25591a8 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -71,6 +71,10 @@ extern "C" {
> >  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11
> >  #endif
> >
> > +#ifndef VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD
> > +#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
> > +#endif
> > +
> >  /** Indicate whether protocol features negotiation is supported. */
> >  #ifndef VHOST_USER_F_PROTOCOL_FEATURES
> >  #define VHOST_USER_F_PROTOCOL_FEATURES       30
> > @@ -98,12 +102,26 @@ struct rte_vhost_memory {
> >       struct rte_vhost_mem_region regions[];
> >  };
> >
> > +typedef struct VhostUserInflightEntry {
> > +     uint8_t inflight;
> > +} VhostUserInflightEntry;
> > +
> > +typedef struct VhostInflightInfo {
> > +     uint16_t version;
> > +     uint16_t last_inflight_io;
> > +     uint16_t used_idx;
> > +     VhostUserInflightEntry desc[0];
> > +} VhostInflightInfo;
>
> Is there any details on above structure? Why does it not match
> QueueRegionSplit or QueueRegionPacked structures described in
> qemu/docs/interop/vhost-user.txt?

Qemu have its vhost-user backend=EF=BC=8Cqemu did the submission of IO in i=
t.
The implementation of dpdk is more general. It is just to mark inflight ent=
ry.
The submission of inflight entry is handle over to different backends.
They have their own ways to handle it, such as spdk.
So there are some differences in data structure.

>
> > +
> >  struct rte_vhost_vring {
> >       struct vring_desc       *desc;
> >       struct vring_avail      *avail;
> >       struct vring_used       *used;
> >       uint64_t                log_guest_addr;
> >
> > +     VhostInflightInfo       *inflight;
> > +     int                inflight_flag;
> > +
> >       /** Deprecated, use rte_vhost_vring_call() instead. */
> >       int                     callfd;
> >
> > @@ -632,6 +650,41 @@ int rte_vhost_get_vhost_vring(int vid, uint16_t vr=
ing_idx,
> >  int rte_vhost_vring_call(int vid, uint16_t vring_idx);
> >
> >  /**
> > + * set inflight flag for a entry.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_set_inflight(struct rte_vhost_vring *vring,
> > +             uint16_t idx);
> > +
> > +/**
> > + * clear inflight flag for a entry.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param last_used_idx
> > + *  next free used_idx
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_clr_inflight(struct rte_vhost_vring *vring,
> > +             uint16_t last_used_idx, uint16_t idx);
> > +
> > +/**
> > + * set last inflight io index.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_set_last_inflight_io(struct rte_vhost_vring *vring,
> > +             uint16_t idx);
>
> New APIs should be experimental and also need to be
> added in rte_vhost_version.map file.
>
> > +
> > +/**
> >   * Get vhost RX queue avail count.
> >   *
> >   * @param vid
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > index 163f4595e..9ba692935 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -76,6 +76,8 @@ cleanup_vq(struct vhost_virtqueue *vq, int destroy)
> >               close(vq->callfd);
> >       if (vq->kickfd >=3D 0)
> >               close(vq->kickfd);
> > +     if (vq->inflight)
> > +             vq->inflight =3D NULL;
> >  }
> >
> >  /*
> > @@ -589,6 +591,11 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_=
idx,
> >       vring->kickfd  =3D vq->kickfd;
> >       vring->size    =3D vq->size;
> >
> > +     vring->inflight =3D vq->inflight;
> > +
> > +     vring->inflight_flag =3D vq->inflight_flag;
> > +     vq->inflight_flag =3D 0;
>
> This will break the ABI. Better to introduce a new API to do this.
>
> > +
> >       return 0;
> >  }
> >
> > @@ -617,6 +624,41 @@ rte_vhost_vring_call(int vid, uint16_t vring_idx)
> >       return 0;
> >  }
> >
> > +void
> > +rte_vhost_set_inflight(struct rte_vhost_vring *vring, uint16_t idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +     if (unlikely(!inflight))
> > +             return;
> > +     inflight->desc[idx].inflight =3D 1;
> > +}
> > +
> > +void
> > +rte_vhost_clr_inflight(struct rte_vhost_vring *vring,
> > +     uint16_t last_used_idx, uint16_t idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +
> > +     if (unlikely(!inflight))
> > +             return;
> > +
> > +     rte_compiler_barrier();
> > +     inflight->desc[idx].inflight =3D 0;
> > +     rte_compiler_barrier();
> > +     inflight->used_idx =3D last_used_idx;
> > +}
> > +
> > +void
> > +rte_vhost_set_last_inflight_io(struct rte_vhost_vring *vring, uint16_t=
 idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +
> > +     if (unlikely(!inflight))
> > +             return;
> > +
> > +     inflight->last_inflight_io =3D idx;
> > +}
>
> The rte_vhost_vring is used to return information to apps.
>
> IIUC, you want to update vq->inflight. So the rte_vhost_vring
> param should be replaced by vid + vring_idx.
>
> > +
> >  uint16_t
> >  rte_vhost_avail_entries(int vid, uint16_t queue_id)
> >  {
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index e9138dfab..537d74c71 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -128,6 +128,10 @@ struct vhost_virtqueue {
> >       /* Physical address of used ring, for logging */
> >       uint64_t                log_guest_addr;
> >
> > +     /* Inflight share memory info */
> > +     VhostInflightInfo       *inflight;
> > +     bool                inflight_flag;
> > +
> >       uint16_t                nr_zmbuf;
> >       uint16_t                zmbuf_size;
> >       uint16_t                last_zmbuf_idx;
> > @@ -286,6 +290,12 @@ struct guest_page {
> >       uint64_t size;
> >  };
> >
> > +typedef struct VuDevInflightInfo {
> > +     int fd;
> > +     void *addr;
> > +     uint64_t size;
> > +} VuDevInflightInfo;
>
> Please follow DPDK's coding style when defining internal structure.
>
> > +
> >  /**
> >   * Device structure contains all configuration information relating
> >   * to the device.
> > @@ -303,6 +313,7 @@ struct virtio_net {
> >       uint32_t                nr_vring;
> >       int                     dequeue_zero_copy;
> >       struct vhost_virtqueue  *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> > +     VuDevInflightInfo inflight_info;
> >  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >       char                    ifname[IF_NAME_SZ];
> >       uint64_t                log_size;
> > diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_use=
r.c
> [...]
> > +static uint32_t get_pervq_shm_size(uint16_t queue_size)
> > +{
> > +     return ALIGN_UP(sizeof(VhostUserInflightEntry) * queue_size +
> > +             sizeof(uint16_t) * 3, INFLIGHT_ALIGNMENT);
> > +}
> > +
> > +static int
> > +vhost_user_get_inflight_fd(struct virtio_net **pdev, VhostUserMsg *msg=
,
> > +             int main_fd __rte_unused)
> > +{
> > +     int fd;
> > +     uint64_t mmap_size;
> > +     void *addr;
> > +     uint16_t num_queues, queue_size;
> > +     struct virtio_net *dev =3D *pdev;
> > +
> > +     if (msg->size !=3D sizeof(msg->payload.inflight)) {
> > +             RTE_LOG(ERR, VHOST_CONFIG,
> > +                     "Invalid get_inflight_fd message size is %d",
> > +                     msg->size);
> > +             msg->payload.inflight.mmap_size =3D 0;
>
> In this case, you can't touch the payload.
>
> > +             return 0;
>
> And an error should be returned.
>
> > +     }
> > +
> > +     num_queues =3D msg->payload.inflight.num_queues;
> > +     queue_size =3D msg->payload.inflight.queue_size;
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG, "get_inflight_fd num_queues: %u\n",
> > +                     msg->payload.inflight.num_queues);
> > +     RTE_LOG(INFO, VHOST_CONFIG, "get_inflight_fd queue_size: %u\n",
> > +                     msg->payload.inflight.queue_size);
> > +     mmap_size =3D num_queues * get_pervq_shm_size(queue_size);
> > +
> > +     addr =3D inflight_mem_alloc("vhost-inflight", mmap_size, &fd);
> > +     if (!addr) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "Failed to alloc vhost infligh=
t area");
> > +                     msg->payload.inflight.mmap_size =3D 0;
> > +             return 0;
>
> Should always return RTE_VHOST_MSG_RESULT_* in
> message handler.
>
> > +     }
> > +     memset(addr, 0, mmap_size);
> > +
> > +     dev->inflight_info.addr =3D addr;
> > +     dev->inflight_info.size =3D msg->payload.inflight.mmap_size =3D m=
map_size;
> > +     dev->inflight_info.fd =3D msg->fds[0] =3D fd;
> > +     msg->payload.inflight.mmap_offset =3D 0;
> > +     msg->fd_num =3D 1;
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight mmap_size: %lu\n",
> > +                     msg->payload.inflight.mmap_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight mmap_offset: %lu\n",
> > +                     msg->payload.inflight.mmap_offset);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight fd: %d\n", msg->fds[0]);
> > +
> > +     return RTE_VHOST_MSG_RESULT_REPLY;
> > +}
> > +
> > +static int
> > +vhost_user_set_inflight_fd(struct virtio_net **pdev, VhostUserMsg *msg=
,
> > +             int main_fd __rte_unused)
> > +{
> > +     int fd, i;
> > +     uint64_t mmap_size, mmap_offset;
> > +     uint16_t num_queues, queue_size;
> > +     uint32_t pervq_inflight_size;
> > +     void *rc;
> > +     struct vhost_virtqueue *vq;
> > +     struct virtio_net *dev =3D *pdev;
> > +
> > +     fd =3D msg->fds[0];
> > +     if (msg->size !=3D sizeof(msg->payload.inflight) || fd < 0) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "Invalid set_inflight_fd messa=
ge size is %d,fd is %d\n",
> > +                     msg->size, fd);
> > +             return -1;
>
> Ditto.
>
> > +     }
> > +
> > +     mmap_size =3D msg->payload.inflight.mmap_size;
> > +     mmap_offset =3D msg->payload.inflight.mmap_offset;
> > +     num_queues =3D msg->payload.inflight.num_queues;
> > +     queue_size =3D msg->payload.inflight.queue_size;
> > +     pervq_inflight_size =3D get_pervq_shm_size(queue_size);
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd mmap_size: %lu\n", mmap_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd mmap_offset: %lu\n", mmap_offset);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd num_queues: %u\n", num_queues);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd queue_size: %u\n", queue_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd fd: %d\n", fd);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd pervq_inflight_size: %d\n",
> > +             pervq_inflight_size);
> > +
> > +     if (dev->inflight_info.addr)
> > +             munmap(dev->inflight_info.addr, dev->inflight_info.size);
> > +
> > +     rc =3D mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> > +                     fd, mmap_offset);
>
> Why call it rc? Maybe addr is a better name?

In some scenarios, shared memory is reallocated or resized by qemu, so
again mmap is needed.

>
> > +     if (rc =3D=3D MAP_FAILED) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "failed to mmap share memory.\=
n");
> > +             return -1;
>
> Should always return RTE_VHOST_MSG_RESULT_* in
> message handler.
>
> > +     }
> > +
> > +     if (dev->inflight_info.fd)
> > +             close(dev->inflight_info.fd);
> > +
> > +     dev->inflight_info.fd =3D fd;
> > +     dev->inflight_info.addr =3D rc;
> > +     dev->inflight_info.size =3D mmap_size;
> > +
> > +     for (i =3D 0; i < num_queues; i++) {
> > +             vq =3D dev->virtqueue[i];
> > +             vq->inflight =3D (VhostInflightInfo *)rc;
> > +             rc =3D (void *)((char *)rc + pervq_inflight_size);
> > +     }
> > +
> > +     return RTE_VHOST_MSG_RESULT_OK;
> > +}
> > +
> >  static int
> >  vhost_user_set_vring_call(struct virtio_net **pdev, struct VhostUserMs=
g *msg,
> >                       int main_fd __rte_unused)
> > @@ -1202,6 +1402,29 @@ static int vhost_user_set_vring_err(struct virti=
o_net **pdev __rte_unused,
> >  }
> >
> >  static int
> > +vhost_check_queue_inflights(struct vhost_virtqueue *vq)
> > +{
> > +     uint16_t i =3D 0;
> > +
> > +     if ((!vq->inflight))
> > +             return RTE_VHOST_MSG_RESULT_ERR;
>
> Should check whether the feature is negotiated first.
>
> > +
> > +     if (!vq->inflight->version) {
> > +             vq->inflight->version =3D INFLIGHT_VERSION;
> > +             return RTE_VHOST_MSG_RESULT_OK;
> > +     }
> > +
> > +     for (i =3D 0; i < vq->size; i++) {
> > +             if (vq->inflight->desc[i].inflight =3D=3D 1) {
> > +                     vq->inflight_flag =3D 1;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return RTE_VHOST_MSG_RESULT_OK;
> > +}
> > +
> > +static int
> >  vhost_user_set_vring_kick(struct virtio_net **pdev, struct VhostUserMs=
g *msg,
> >                       int main_fd __rte_unused)
> >  {
> > @@ -1242,6 +1465,12 @@ vhost_user_set_vring_kick(struct virtio_net **pd=
ev, struct VhostUserMsg *msg,
> >               close(vq->kickfd);
> >       vq->kickfd =3D file.fd;
> >
> > +     if (vhost_check_queue_inflights(vq)) {
> > +             RTE_LOG(ERR, VHOST_CONFIG,
> > +                     "Failed to inflights for vq: %d\n", file.index);
> > +             return RTE_VHOST_MSG_RESULT_ERR;
> > +     }
> > +
> >       return RTE_VHOST_MSG_RESULT_OK;
> >  }
> >
> > @@ -1762,6 +1991,8 @@ static vhost_message_handler_t vhost_message_hand=
lers[VHOST_USER_MAX] =3D {
> >       [VHOST_USER_POSTCOPY_ADVISE] =3D vhost_user_set_postcopy_advise,
> >       [VHOST_USER_POSTCOPY_LISTEN] =3D vhost_user_set_postcopy_listen,
> >       [VHOST_USER_POSTCOPY_END] =3D vhost_user_postcopy_end,
> > +     [VHOST_USER_GET_INFLIGHT_FD] =3D vhost_user_get_inflight_fd,
> > +     [VHOST_USER_SET_INFLIGHT_FD] =3D vhost_user_set_inflight_fd,
> >  };
> >
> >
> > diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_use=
r.h
> > index 2a650fe4b..35f969b1b 100644
> > --- a/lib/librte_vhost/vhost_user.h
> > +++ b/lib/librte_vhost/vhost_user.h
> > @@ -23,7 +23,8 @@
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_C=
RYPTO_SESSION) | \
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_S=
LAVE_SEND_FD) | \
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_H=
OST_NOTIFIER) | \
> > -                                      (1ULL << VHOST_USER_PROTOCOL_F_P=
AGEFAULT))
> > +                                     (1ULL << VHOST_USER_PROTOCOL_F_PA=
GEFAULT) | \
> > +                                     (1ULL << VHOST_USER_PROTOCOL_F_IN=
FLIGHT_SHMFD))
>
> It will advertise this feature for builtin net and crypto
> backends. It's probably not what you intended.
>

Indeed, this feature is mainly used for spdk-like backends. You mean
this function is disabled by default=EF=BC=9F

> >
> >  typedef enum VhostUserRequest {
> >       VHOST_USER_NONE =3D 0,
> > @@ -54,7 +55,9 @@ typedef enum VhostUserRequest {
> >       VHOST_USER_POSTCOPY_ADVISE =3D 28,
> >       VHOST_USER_POSTCOPY_LISTEN =3D 29,
> >       VHOST_USER_POSTCOPY_END =3D 30,
> > -     VHOST_USER_MAX =3D 31
> > +     VHOST_USER_GET_INFLIGHT_FD =3D 31,
> > +     VHOST_USER_SET_INFLIGHT_FD =3D 32,
> > +     VHOST_USER_MAX =3D 33
> >  } VhostUserRequest;
> >
> >  typedef enum VhostUserSlaveRequest {
> > @@ -112,6 +115,13 @@ typedef struct VhostUserVringArea {
> >       uint64_t offset;
> >  } VhostUserVringArea;
> >
> > +typedef struct VhostUserInflight {
> > +     uint64_t mmap_size;
> > +     uint64_t mmap_offset;
> > +     uint16_t num_queues;
> > +     uint16_t queue_size;
> > +} VhostUserInflight;
> > +
> >  typedef struct VhostUserMsg {
> >       union {
> >               uint32_t master; /* a VhostUserRequest value */
> > @@ -131,6 +141,7 @@ typedef struct VhostUserMsg {
> >               struct vhost_vring_addr addr;
> >               VhostUserMemory memory;
> >               VhostUserLog    log;
> > +             VhostUserInflight inflight;
> >               struct vhost_iotlb_msg iotlb;
> >               VhostUserCryptoSessionParam crypto_session;
> >               VhostUserVringArea area;
> > @@ -148,6 +159,7 @@ typedef struct VhostUserMsg {
> >  /* vhost_user.c */
> >  int vhost_user_msg_handler(int vid, int fd);
> >  int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8=
_t perm);
> > +void *inflight_mem_alloc(const char *name, size_t size, int *fd);
> >
> >  /* socket.c */
> >  int read_fd_message(int sockfd, char *buf, int buflen, int *fds, int m=
ax_fds,
> > --
> > 2.11.0
> >

Thank you for your comments. They will be revised in the next version.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by dpdk.space (Postfix) with ESMTP id 51823A0679
	for <public@inbox.dpdk.org>; Mon, 29 Apr 2019 06:07:21 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 66DD51B1F0;
	Mon, 29 Apr 2019 06:07:20 +0200 (CEST)
Received: from mail-oi1-f195.google.com (mail-oi1-f195.google.com
 [209.85.167.195]) by dpdk.org (Postfix) with ESMTP id A0ABA1B1EF
 for <dev@dpdk.org>; Mon, 29 Apr 2019 06:07:18 +0200 (CEST)
Received: by mail-oi1-f195.google.com with SMTP id d62so1387583oib.13
 for <dev@dpdk.org>; Sun, 28 Apr 2019 21:07:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=6Xva+exm39q5vWGssLXPFWlCMSjjPNX4c72V0jBFME8=;
 b=gVB6ig0jc4hxPs4HzuFyrjyM+9dwsQ3g/WgJdw8uaxvyTDwFOQOp/fv4hQ43TfK9fC
 TYQUyZ47ykEBZKRBstNMXcFdc6Q0h/EkgHnkOTxDRbryap1gr2p1VsHaNu7H+KYxw5nN
 7VjCCB4MeuICSH1oeuvNN9V1hd/PLDai4a1+s3vhikY0C1ro4aiG2pvJ9Fx7V5OJJxne
 1wZrt3Dg0g4LnhwbFYlgrT2GznkvO4P2rHde3Dy4vyyEFB/6K6YsWJlrVt3r9m4vJm7v
 xGgOfBCix7+gSjDMyruEdTc4GSFq9dAX2JvBx89jXHzp5y+DK96+Ez6vzApFlYwugOzQ
 h+Vw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=6Xva+exm39q5vWGssLXPFWlCMSjjPNX4c72V0jBFME8=;
 b=E9lozDc2alfuhHnfi/IokP/XDRy3iOyZPPh4myP5foESjoTI8+b40kF+IwZ5wQ6SF2
 cv2d9hJxP/+QFlAWARdSFHV7qXibP8rMS55FFgvSFdyZAnA7uaEvDEENkA8GicLYnkSD
 vI2kZOOHwaC1ODYIlpATRdOqTA0LXv426+ceALcRGPUhkLXMfJBilsod/DzCDWmIZLjz
 Ia9TN7e8PZrNVhiEY78EEYu//+9PzvuqqD9OftCON5zY7JnNT5Vr2LjmDjXUcdsqpNwr
 wG693gq/JH7dSacVm87uTtBdPOPB+8UF7NHdnYibiIg256RHdUoVIxzgVJfrAIt8dpI8
 PjLA==
X-Gm-Message-State: APjAAAWHWWK5A36enlnl2lCjXjJr2MHeYp9kIoK0keHeYFqMxZZypqzQ
 y2ujUqZDpqRV/8QDaVCcxB0KydfSanNPfKhYukA=
X-Google-Smtp-Source: APXvYqzoY6FGY9MHZ1QbtC6X07ls+15bj40CM1RQIooeslkdg/wn4qMJOEatjXoVVDnN9X47GJaQSddM4LC7X3Vi8F4=
X-Received: by 2002:aca:4108:: with SMTP id o8mr15465392oia.162.1556510837589; 
 Sun, 28 Apr 2019 21:07:17 -0700 (PDT)
MIME-Version: 1.0
References: <1556190633-8099-1-git-send-email-chuckylinchuckylin@gmail.com>
 <1556271621-8594-1-git-send-email-chuckylinchuckylin@gmail.com>
 <20190428111748.GA16470@___>
In-Reply-To: <20190428111748.GA16470@___>
From: lin li <chuckylinchuckylin@gmail.com>
Date: Mon, 29 Apr 2019 12:07:05 +0800
Message-ID:
 <CAF+hgq2hjNob=Tvz2y6ES4rOXaXKOJstiMpbC9bGh1diMb3wCg@mail.gmail.com>
To: Tiwei Bie <tiwei.bie@intel.com>
Cc: maxime.coquelin@redhat.com, zhihong.wang@intel.com, dev@dpdk.org, 
 dariusz.stojaczyk@intel.com, changpeng.liu@intel.com, 
 james.r.harris@intel.com, lilin24 <lilin24@baidu.com>, Ni Xun <nixun@baidu.com>,
 Zhang Yu <zhangyu31@baidu.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: [dpdk-dev] [PATCH v3] vhost: support inflight share memory
	protocol feature
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>
Message-ID: <20190429040705.7jjSJege98zNEHM8or1MwJtUa4GX4opBs326rfuxeYM@z>

Tiwei Bie <tiwei.bie@intel.com> =E4=BA=8E2019=E5=B9=B44=E6=9C=8828=E6=97=A5=
=E5=91=A8=E6=97=A5 =E4=B8=8B=E5=8D=887:18=E5=86=99=E9=81=93=EF=BC=9A
>
> On Fri, Apr 26, 2019 at 05:40:21AM -0400, Li Lin wrote:
> > From: lilin24 <lilin24@baidu.com>
> >
> > This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
> > buffer between qemu and backend.
> >
> > Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
> > shared buffer from backend. Then qemu should send it back
> > through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.
> >
> > This shared buffer is used to process inflight I/O when backend
> > reconnect.
> >
> > Signed-off-by: lilin24 <lilin24@baidu.com>
>
> You need to use your real name here.
>
> > Signed-off-by: Ni Xun <nixun@baidu.com>
> > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> >
> > v2:
> > - add set&clr inflight entry function
> > v3:
> > - simplified some function implementations
> >
> > ---
>
> You can put change logs here (i.e. after ---).
>
> >  lib/librte_vhost/rte_vhost.h  |  53 ++++++++++
> >  lib/librte_vhost/vhost.c      |  42 ++++++++
> >  lib/librte_vhost/vhost.h      |  11 ++
> >  lib/librte_vhost/vhost_user.c | 231 ++++++++++++++++++++++++++++++++++=
++++++++
> >  lib/librte_vhost/vhost_user.h |  16 ++-
> >  5 files changed, 351 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.=
h
> > index d2c0c93f4..bc25591a8 100644
> > --- a/lib/librte_vhost/rte_vhost.h
> > +++ b/lib/librte_vhost/rte_vhost.h
> > @@ -71,6 +71,10 @@ extern "C" {
> >  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11
> >  #endif
> >
> > +#ifndef VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD
> > +#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
> > +#endif
> > +
> >  /** Indicate whether protocol features negotiation is supported. */
> >  #ifndef VHOST_USER_F_PROTOCOL_FEATURES
> >  #define VHOST_USER_F_PROTOCOL_FEATURES       30
> > @@ -98,12 +102,26 @@ struct rte_vhost_memory {
> >       struct rte_vhost_mem_region regions[];
> >  };
> >
> > +typedef struct VhostUserInflightEntry {
> > +     uint8_t inflight;
> > +} VhostUserInflightEntry;
> > +
> > +typedef struct VhostInflightInfo {
> > +     uint16_t version;
> > +     uint16_t last_inflight_io;
> > +     uint16_t used_idx;
> > +     VhostUserInflightEntry desc[0];
> > +} VhostInflightInfo;
>
> Is there any details on above structure? Why does it not match
> QueueRegionSplit or QueueRegionPacked structures described in
> qemu/docs/interop/vhost-user.txt?

Qemu have its vhost-user backend=EF=BC=8Cqemu did the submission of IO in i=
t.
The implementation of dpdk is more general. It is just to mark inflight ent=
ry.
The submission of inflight entry is handle over to different backends.
They have their own ways to handle it, such as spdk.
So there are some differences in data structure.

>
> > +
> >  struct rte_vhost_vring {
> >       struct vring_desc       *desc;
> >       struct vring_avail      *avail;
> >       struct vring_used       *used;
> >       uint64_t                log_guest_addr;
> >
> > +     VhostInflightInfo       *inflight;
> > +     int                inflight_flag;
> > +
> >       /** Deprecated, use rte_vhost_vring_call() instead. */
> >       int                     callfd;
> >
> > @@ -632,6 +650,41 @@ int rte_vhost_get_vhost_vring(int vid, uint16_t vr=
ing_idx,
> >  int rte_vhost_vring_call(int vid, uint16_t vring_idx);
> >
> >  /**
> > + * set inflight flag for a entry.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_set_inflight(struct rte_vhost_vring *vring,
> > +             uint16_t idx);
> > +
> > +/**
> > + * clear inflight flag for a entry.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param last_used_idx
> > + *  next free used_idx
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_clr_inflight(struct rte_vhost_vring *vring,
> > +             uint16_t last_used_idx, uint16_t idx);
> > +
> > +/**
> > + * set last inflight io index.
> > + *
> > + * @param vring
> > + *  the structure to hold the requested vring info
> > + * @param idx
> > + *  inflight entry index
> > + */
> > +void rte_vhost_set_last_inflight_io(struct rte_vhost_vring *vring,
> > +             uint16_t idx);
>
> New APIs should be experimental and also need to be
> added in rte_vhost_version.map file.
>
> > +
> > +/**
> >   * Get vhost RX queue avail count.
> >   *
> >   * @param vid
> > diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> > index 163f4595e..9ba692935 100644
> > --- a/lib/librte_vhost/vhost.c
> > +++ b/lib/librte_vhost/vhost.c
> > @@ -76,6 +76,8 @@ cleanup_vq(struct vhost_virtqueue *vq, int destroy)
> >               close(vq->callfd);
> >       if (vq->kickfd >=3D 0)
> >               close(vq->kickfd);
> > +     if (vq->inflight)
> > +             vq->inflight =3D NULL;
> >  }
> >
> >  /*
> > @@ -589,6 +591,11 @@ rte_vhost_get_vhost_vring(int vid, uint16_t vring_=
idx,
> >       vring->kickfd  =3D vq->kickfd;
> >       vring->size    =3D vq->size;
> >
> > +     vring->inflight =3D vq->inflight;
> > +
> > +     vring->inflight_flag =3D vq->inflight_flag;
> > +     vq->inflight_flag =3D 0;
>
> This will break the ABI. Better to introduce a new API to do this.
>
> > +
> >       return 0;
> >  }
> >
> > @@ -617,6 +624,41 @@ rte_vhost_vring_call(int vid, uint16_t vring_idx)
> >       return 0;
> >  }
> >
> > +void
> > +rte_vhost_set_inflight(struct rte_vhost_vring *vring, uint16_t idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +     if (unlikely(!inflight))
> > +             return;
> > +     inflight->desc[idx].inflight =3D 1;
> > +}
> > +
> > +void
> > +rte_vhost_clr_inflight(struct rte_vhost_vring *vring,
> > +     uint16_t last_used_idx, uint16_t idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +
> > +     if (unlikely(!inflight))
> > +             return;
> > +
> > +     rte_compiler_barrier();
> > +     inflight->desc[idx].inflight =3D 0;
> > +     rte_compiler_barrier();
> > +     inflight->used_idx =3D last_used_idx;
> > +}
> > +
> > +void
> > +rte_vhost_set_last_inflight_io(struct rte_vhost_vring *vring, uint16_t=
 idx)
> > +{
> > +     VhostInflightInfo *inflight =3D vring->inflight;
> > +
> > +     if (unlikely(!inflight))
> > +             return;
> > +
> > +     inflight->last_inflight_io =3D idx;
> > +}
>
> The rte_vhost_vring is used to return information to apps.
>
> IIUC, you want to update vq->inflight. So the rte_vhost_vring
> param should be replaced by vid + vring_idx.
>
> > +
> >  uint16_t
> >  rte_vhost_avail_entries(int vid, uint16_t queue_id)
> >  {
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index e9138dfab..537d74c71 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -128,6 +128,10 @@ struct vhost_virtqueue {
> >       /* Physical address of used ring, for logging */
> >       uint64_t                log_guest_addr;
> >
> > +     /* Inflight share memory info */
> > +     VhostInflightInfo       *inflight;
> > +     bool                inflight_flag;
> > +
> >       uint16_t                nr_zmbuf;
> >       uint16_t                zmbuf_size;
> >       uint16_t                last_zmbuf_idx;
> > @@ -286,6 +290,12 @@ struct guest_page {
> >       uint64_t size;
> >  };
> >
> > +typedef struct VuDevInflightInfo {
> > +     int fd;
> > +     void *addr;
> > +     uint64_t size;
> > +} VuDevInflightInfo;
>
> Please follow DPDK's coding style when defining internal structure.
>
> > +
> >  /**
> >   * Device structure contains all configuration information relating
> >   * to the device.
> > @@ -303,6 +313,7 @@ struct virtio_net {
> >       uint32_t                nr_vring;
> >       int                     dequeue_zero_copy;
> >       struct vhost_virtqueue  *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
> > +     VuDevInflightInfo inflight_info;
> >  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >       char                    ifname[IF_NAME_SZ];
> >       uint64_t                log_size;
> > diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_use=
r.c
> [...]
> > +static uint32_t get_pervq_shm_size(uint16_t queue_size)
> > +{
> > +     return ALIGN_UP(sizeof(VhostUserInflightEntry) * queue_size +
> > +             sizeof(uint16_t) * 3, INFLIGHT_ALIGNMENT);
> > +}
> > +
> > +static int
> > +vhost_user_get_inflight_fd(struct virtio_net **pdev, VhostUserMsg *msg=
,
> > +             int main_fd __rte_unused)
> > +{
> > +     int fd;
> > +     uint64_t mmap_size;
> > +     void *addr;
> > +     uint16_t num_queues, queue_size;
> > +     struct virtio_net *dev =3D *pdev;
> > +
> > +     if (msg->size !=3D sizeof(msg->payload.inflight)) {
> > +             RTE_LOG(ERR, VHOST_CONFIG,
> > +                     "Invalid get_inflight_fd message size is %d",
> > +                     msg->size);
> > +             msg->payload.inflight.mmap_size =3D 0;
>
> In this case, you can't touch the payload.
>
> > +             return 0;
>
> And an error should be returned.
>
> > +     }
> > +
> > +     num_queues =3D msg->payload.inflight.num_queues;
> > +     queue_size =3D msg->payload.inflight.queue_size;
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG, "get_inflight_fd num_queues: %u\n",
> > +                     msg->payload.inflight.num_queues);
> > +     RTE_LOG(INFO, VHOST_CONFIG, "get_inflight_fd queue_size: %u\n",
> > +                     msg->payload.inflight.queue_size);
> > +     mmap_size =3D num_queues * get_pervq_shm_size(queue_size);
> > +
> > +     addr =3D inflight_mem_alloc("vhost-inflight", mmap_size, &fd);
> > +     if (!addr) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "Failed to alloc vhost infligh=
t area");
> > +                     msg->payload.inflight.mmap_size =3D 0;
> > +             return 0;
>
> Should always return RTE_VHOST_MSG_RESULT_* in
> message handler.
>
> > +     }
> > +     memset(addr, 0, mmap_size);
> > +
> > +     dev->inflight_info.addr =3D addr;
> > +     dev->inflight_info.size =3D msg->payload.inflight.mmap_size =3D m=
map_size;
> > +     dev->inflight_info.fd =3D msg->fds[0] =3D fd;
> > +     msg->payload.inflight.mmap_offset =3D 0;
> > +     msg->fd_num =3D 1;
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight mmap_size: %lu\n",
> > +                     msg->payload.inflight.mmap_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight mmap_offset: %lu\n",
> > +                     msg->payload.inflight.mmap_offset);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +                     "send inflight fd: %d\n", msg->fds[0]);
> > +
> > +     return RTE_VHOST_MSG_RESULT_REPLY;
> > +}
> > +
> > +static int
> > +vhost_user_set_inflight_fd(struct virtio_net **pdev, VhostUserMsg *msg=
,
> > +             int main_fd __rte_unused)
> > +{
> > +     int fd, i;
> > +     uint64_t mmap_size, mmap_offset;
> > +     uint16_t num_queues, queue_size;
> > +     uint32_t pervq_inflight_size;
> > +     void *rc;
> > +     struct vhost_virtqueue *vq;
> > +     struct virtio_net *dev =3D *pdev;
> > +
> > +     fd =3D msg->fds[0];
> > +     if (msg->size !=3D sizeof(msg->payload.inflight) || fd < 0) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "Invalid set_inflight_fd messa=
ge size is %d,fd is %d\n",
> > +                     msg->size, fd);
> > +             return -1;
>
> Ditto.
>
> > +     }
> > +
> > +     mmap_size =3D msg->payload.inflight.mmap_size;
> > +     mmap_offset =3D msg->payload.inflight.mmap_offset;
> > +     num_queues =3D msg->payload.inflight.num_queues;
> > +     queue_size =3D msg->payload.inflight.queue_size;
> > +     pervq_inflight_size =3D get_pervq_shm_size(queue_size);
> > +
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd mmap_size: %lu\n", mmap_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd mmap_offset: %lu\n", mmap_offset);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd num_queues: %u\n", num_queues);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd queue_size: %u\n", queue_size);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd fd: %d\n", fd);
> > +     RTE_LOG(INFO, VHOST_CONFIG,
> > +             "set_inflight_fd pervq_inflight_size: %d\n",
> > +             pervq_inflight_size);
> > +
> > +     if (dev->inflight_info.addr)
> > +             munmap(dev->inflight_info.addr, dev->inflight_info.size);
> > +
> > +     rc =3D mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
> > +                     fd, mmap_offset);
>
> Why call it rc? Maybe addr is a better name?

In some scenarios, shared memory is reallocated or resized by qemu, so
again mmap is needed.

>
> > +     if (rc =3D=3D MAP_FAILED) {
> > +             RTE_LOG(ERR, VHOST_CONFIG, "failed to mmap share memory.\=
n");
> > +             return -1;
>
> Should always return RTE_VHOST_MSG_RESULT_* in
> message handler.
>
> > +     }
> > +
> > +     if (dev->inflight_info.fd)
> > +             close(dev->inflight_info.fd);
> > +
> > +     dev->inflight_info.fd =3D fd;
> > +     dev->inflight_info.addr =3D rc;
> > +     dev->inflight_info.size =3D mmap_size;
> > +
> > +     for (i =3D 0; i < num_queues; i++) {
> > +             vq =3D dev->virtqueue[i];
> > +             vq->inflight =3D (VhostInflightInfo *)rc;
> > +             rc =3D (void *)((char *)rc + pervq_inflight_size);
> > +     }
> > +
> > +     return RTE_VHOST_MSG_RESULT_OK;
> > +}
> > +
> >  static int
> >  vhost_user_set_vring_call(struct virtio_net **pdev, struct VhostUserMs=
g *msg,
> >                       int main_fd __rte_unused)
> > @@ -1202,6 +1402,29 @@ static int vhost_user_set_vring_err(struct virti=
o_net **pdev __rte_unused,
> >  }
> >
> >  static int
> > +vhost_check_queue_inflights(struct vhost_virtqueue *vq)
> > +{
> > +     uint16_t i =3D 0;
> > +
> > +     if ((!vq->inflight))
> > +             return RTE_VHOST_MSG_RESULT_ERR;
>
> Should check whether the feature is negotiated first.
>
> > +
> > +     if (!vq->inflight->version) {
> > +             vq->inflight->version =3D INFLIGHT_VERSION;
> > +             return RTE_VHOST_MSG_RESULT_OK;
> > +     }
> > +
> > +     for (i =3D 0; i < vq->size; i++) {
> > +             if (vq->inflight->desc[i].inflight =3D=3D 1) {
> > +                     vq->inflight_flag =3D 1;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return RTE_VHOST_MSG_RESULT_OK;
> > +}
> > +
> > +static int
> >  vhost_user_set_vring_kick(struct virtio_net **pdev, struct VhostUserMs=
g *msg,
> >                       int main_fd __rte_unused)
> >  {
> > @@ -1242,6 +1465,12 @@ vhost_user_set_vring_kick(struct virtio_net **pd=
ev, struct VhostUserMsg *msg,
> >               close(vq->kickfd);
> >       vq->kickfd =3D file.fd;
> >
> > +     if (vhost_check_queue_inflights(vq)) {
> > +             RTE_LOG(ERR, VHOST_CONFIG,
> > +                     "Failed to inflights for vq: %d\n", file.index);
> > +             return RTE_VHOST_MSG_RESULT_ERR;
> > +     }
> > +
> >       return RTE_VHOST_MSG_RESULT_OK;
> >  }
> >
> > @@ -1762,6 +1991,8 @@ static vhost_message_handler_t vhost_message_hand=
lers[VHOST_USER_MAX] =3D {
> >       [VHOST_USER_POSTCOPY_ADVISE] =3D vhost_user_set_postcopy_advise,
> >       [VHOST_USER_POSTCOPY_LISTEN] =3D vhost_user_set_postcopy_listen,
> >       [VHOST_USER_POSTCOPY_END] =3D vhost_user_postcopy_end,
> > +     [VHOST_USER_GET_INFLIGHT_FD] =3D vhost_user_get_inflight_fd,
> > +     [VHOST_USER_SET_INFLIGHT_FD] =3D vhost_user_set_inflight_fd,
> >  };
> >
> >
> > diff --git a/lib/librte_vhost/vhost_user.h b/lib/librte_vhost/vhost_use=
r.h
> > index 2a650fe4b..35f969b1b 100644
> > --- a/lib/librte_vhost/vhost_user.h
> > +++ b/lib/librte_vhost/vhost_user.h
> > @@ -23,7 +23,8 @@
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_C=
RYPTO_SESSION) | \
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_S=
LAVE_SEND_FD) | \
> >                                        (1ULL << VHOST_USER_PROTOCOL_F_H=
OST_NOTIFIER) | \
> > -                                      (1ULL << VHOST_USER_PROTOCOL_F_P=
AGEFAULT))
> > +                                     (1ULL << VHOST_USER_PROTOCOL_F_PA=
GEFAULT) | \
> > +                                     (1ULL << VHOST_USER_PROTOCOL_F_IN=
FLIGHT_SHMFD))
>
> It will advertise this feature for builtin net and crypto
> backends. It's probably not what you intended.
>

Indeed, this feature is mainly used for spdk-like backends. You mean
this function is disabled by default=EF=BC=9F

> >
> >  typedef enum VhostUserRequest {
> >       VHOST_USER_NONE =3D 0,
> > @@ -54,7 +55,9 @@ typedef enum VhostUserRequest {
> >       VHOST_USER_POSTCOPY_ADVISE =3D 28,
> >       VHOST_USER_POSTCOPY_LISTEN =3D 29,
> >       VHOST_USER_POSTCOPY_END =3D 30,
> > -     VHOST_USER_MAX =3D 31
> > +     VHOST_USER_GET_INFLIGHT_FD =3D 31,
> > +     VHOST_USER_SET_INFLIGHT_FD =3D 32,
> > +     VHOST_USER_MAX =3D 33
> >  } VhostUserRequest;
> >
> >  typedef enum VhostUserSlaveRequest {
> > @@ -112,6 +115,13 @@ typedef struct VhostUserVringArea {
> >       uint64_t offset;
> >  } VhostUserVringArea;
> >
> > +typedef struct VhostUserInflight {
> > +     uint64_t mmap_size;
> > +     uint64_t mmap_offset;
> > +     uint16_t num_queues;
> > +     uint16_t queue_size;
> > +} VhostUserInflight;
> > +
> >  typedef struct VhostUserMsg {
> >       union {
> >               uint32_t master; /* a VhostUserRequest value */
> > @@ -131,6 +141,7 @@ typedef struct VhostUserMsg {
> >               struct vhost_vring_addr addr;
> >               VhostUserMemory memory;
> >               VhostUserLog    log;
> > +             VhostUserInflight inflight;
> >               struct vhost_iotlb_msg iotlb;
> >               VhostUserCryptoSessionParam crypto_session;
> >               VhostUserVringArea area;
> > @@ -148,6 +159,7 @@ typedef struct VhostUserMsg {
> >  /* vhost_user.c */
> >  int vhost_user_msg_handler(int vid, int fd);
> >  int vhost_user_iotlb_miss(struct virtio_net *dev, uint64_t iova, uint8=
_t perm);
> > +void *inflight_mem_alloc(const char *name, size_t size, int *fd);
> >
> >  /* socket.c */
> >  int read_fd_message(int sockfd, char *buf, int buflen, int *fds, int m=
ax_fds,
> > --
> > 2.11.0
> >

Thank you for your comments. They will be revised in the next version.