From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by dpdk.org (Postfix) with ESMTP id E7217D11D for ; Fri, 18 May 2018 15:50:51 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 32EF1BB408; Fri, 18 May 2018 13:50:51 +0000 (UTC) Received: from [10.36.112.43] (ovpn-112-43.ams2.redhat.com [10.36.112.43]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1921710E51AE; Fri, 18 May 2018 13:50:46 +0000 (UTC) To: Dariusz Stojaczyk , dev@dpdk.org, Tiwei Bie , Tetsuya Mukawa , Thomas Monjalon Cc: yliu@fridaylinux.org, Stefan Hajnoczi , James Harris References: <1525958573-184361-1-git-send-email-dariuszx.stojaczyk@intel.com> <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com> From: Maxime Coquelin Message-ID: <27ce772e-9f01-dff9-1f82-b99924efa950@redhat.com> Date: Fri, 18 May 2018 15:50:45 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Fri, 18 May 2018 13:50:51 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Fri, 18 May 2018 13:50:51 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'maxime.coquelin@redhat.com' RCPT:'' Subject: Re: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 May 2018 13:50:52 -0000 Hi Dariusz, On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote: > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS > stage. This is perfectly fine from the Vhost-user spec perspective, but > doesn't meet rte_vhost expectations. rte_vhost waits for all queues > to be fully initialized before it allows the entire device to be > processed. qFixing rte_vhost directly would require quite a big amount > of changes, which would completely break backwards compatibility. > > This rte_vhost2 library is intended to smooth out the transition. > It exposes a low-level API for implementing new Vhost-user slaves. > The existing rte_vhost is about to be refactored to use rte_vhost2 > library underneath, and demanding backends could now use rte_vhost2 > directly. I like the idea, and the proposed way to smooth the transition. I will certainly have other comments later, but please find below the ones I have for the moment. > We're designing a fresh library here, so this opens up a great > amount of possibilities and improvements we can make to the public > API that will pay off significantly for the sake of future > specification/library extensions. > > rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics > and allows developers to implement Vhost devices with an ease. > It calls user-provided callbacks once proper device initialization > state has been reached. That is - memory mappings have changed, > virtqueues are ready to be processed, features have changed in > runtime, etc. > > Compared to the rte_vhost, this lib additionally allows the following: > * ability to start/stop particular queues > * most callbacks are now asynchronous - it greatly simplifies > the event handling for asynchronous applications and doesn't > make anything harder for synchronous ones. > * this is low-level API. It doesn't have any vhost-net, nvme > or crypto references. These backend-specific libraries will > be later refactored to use *this* generic library underneath. > This implies that the library doesn't do any virtqueue processing, > it only delivers vring addresses to the user, so he can process > virtqueues by himself. > * abstracting away Unix domain sockets (vhost-user) and PCI > (virtio-vhost-user). > * The API imposes how public functions can be called and how > internal data can change, so there's only a minimal work required > to ensure thread-safety. Possibly no mutexes are required at all. > * full Virtio 1.0/vhost-user specification compliance. > > This patch only introduces the API. Some additional functions > for vDPA might be still required, but everything present here > so far shouldn't need changing. > > Signed-off-by: Dariusz Stojaczyk > --- > lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 331 insertions(+) > create mode 100644 lib/librte_vhost2/rte_vhost2.h > > diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h > new file mode 100644 > index 0000000..385b093 > --- /dev/null > +++ b/lib/librte_vhost2/rte_vhost2.h > @@ -0,0 +1,331 @@ > +/*- > + * BSD LICENSE > + * > + * Copyright (c) Intel Corporation. > + * All rights reserved. > + * > + * Redistribution and use in source and binary forms, with or without > + * modification, are permitted provided that the following conditions > + * are met: > + * > + * * Redistributions of source code must retain the above copyright > + * notice, this list of conditions and the following disclaimer. > + * * Redistributions in binary form must reproduce the above copyright > + * notice, this list of conditions and the following disclaimer in > + * the documentation and/or other materials provided with the > + * distribution. > + * * Neither the name of Intel Corporation nor the names of its > + * contributors may be used to endorse or promote products derived > + * from this software without specific prior written permission. > + * > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS > + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT > + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR > + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT > + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, > + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT > + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, > + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY > + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT > + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE > + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > + */ > + > +#ifndef _RTE_VHOST2_H_ > +#define _RTE_VHOST2_H_ > + > +/** > + * @file > + * This library abstracts away most Vhost-user/virtio-vhost-user specifics > + * and allows developers to implement Vhost devices with an ease. > + * It calls user-provided callbacks once proper device initialization > + * state has been reached. That is - memory mappings have changed, > + * virtqueues are ready to be processed, features have changed in runtime, etc. > + */ > + > +#include > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +/* Not C++-aware. */ > +#include > + > +#define RTE_VHOST2_MEMORY_MAX_NREGIONS 8 > + > +#define RTE_VHOST2_CLIENT (1ULL << 0) > +#define RTE_VHOST2_NO_RECONNECT (1ULL << 1) > + > +enum rte_vhost2_set_config_type { > + /** Config changed on request by the vhost driver. */ > + VHOST_SET_CONFIG_TYPE_MASTER = 0, > + /** Config is being restored after a successful migration. */ > + VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > +}; > + > +#define RTE_VHOST2_MSG_VERSION_MASK (0x3) > +#define RTE_VHOST2_MSG_REPLY_MASK (0x1 << 2) > +#define RTE_VHOST2_MSG_NEED_REPLY (0x1 << 3) > + > +struct rte_vhost2_msg { > + uint32_t id; > + uint32_t flags; > + uint32_t size; /**< The following payload size. */ > + void *payload; > + int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS]; > +}; > + > +/** Single memory region. Both physically and virtually contiguous */ > +struct rte_vhost2_mem_region { > + uint64_t guest_phys_addr; > + uint64_t guest_user_addr; > + uint64_t host_user_addr; > + uint64_t size; > + void *mmap_addr; > + uint64_t mmap_size; > + int fd; > +}; > + > +struct rte_vhost2_memory { > + uint32_t nregions; > + struct rte_vhost2_mem_region regions[]; > +}; > + > +/** > + * Vhost device created and managed by rte_vhost2. Accessible via > + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real > + * vhost device data. This struct is published just for inline vdev > + * functions to access their data directly. > + */ > +struct rte_vhost2_dev { > + struct rte_vhost2_memory *mem; > + bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */ > +}; > + > +/** > + * Virtqueue created and managed by rte_vhost2. Accessible via > + * \c rte_vhost2_tgt_ops callbacks. > + */ > +struct rte_vhost2_vq { > + struct vring_desc *desc; > + struct vring_avail *avail; > + struct vring_used *used; > + /* available only if \c VHOST_F_LOG_ALL has been negotiated */ > + void *log; > + uint16_t size; > +}; > + > +/** > + * Device/queue related callbacks, all optional. Provided callback > + * parameters are guaranteed not to be NULL unless explicitly specified. > + */ > +struct rte_vhost2_tgt_ops { > + /** > + * New driver connected. If this is completed with a non-zero status, > + * rte_vhost2 will terminate the connection. > + */ > + void (*device_create)(struct rte_vhost2_dev *vdev); > + /** > + * Device is ready to operate. vdev data is now initialized. This callback > + * may be called multiple times as e.g. memory mappings can change > + * dynamically. All queues are guaranteed to be stopped by now. > + */ > + void (*device_init)(struct rte_vhost2_dev *vdev); > + /** > + * Features have changed in runtime. This is called at least once during > + * initialization before `device_init`. Queues might be still running > + * at this point. > + */ > + void (*device_features_changed)(struct rte_vhost2_dev *vdev, > + uint64_t features); > + /** > + * Start processing vq. The `vq` is guaranteed not to be modified before > + * `queue_stop` is called. > + */ > + void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); > + /** > + * Stop processing vq. It shouldn't be accessed after this callback > + * completes (via \c rte_vhost2_tgt_cb_complete). This can be called > + * prior to shutdown or before actions that require changing vhost > + * device/vq state. > + */ > + void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); > + /** Device disconnected. All queues are guaranteed to be stopped by now */ > + void (*device_destroy)(struct rte_vhost2_dev *vdev); > + /** > + * Custom vhost-user message handler. This is called for > + * backend-specific messages (net/crypto/scsi) that weren't recognized > + * by the generic message parser. `msg` is available until > + * \c rte_vhost2_tgt_cb_complete is called. > + */ > + void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg); > + > + /** Interrupt handler, synchronous. */ > + void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); > + /** > + * Full device config read, synchronous. Return 0 if `len` bytes of > + * `config` have been successfully set, -1 otherwise. > + */ > + int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config, > + uint32_t len); > + /** > + * Device config changed by the driver, synchronous. `type` indicates > + * the reason of change. > + */ > + int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config, > + uint32_t offset, uint32_t len, > + enum rte_vhost2_set_config_type type); > + > + void *reserved[8]; /**< Reserved for future extension */ > +}; > + > +/** > + * Registers a new vhost target accepting remote connections. Multiple > + * available transports are available. It is possible to create a Vhost-user > + * Unix domain socket polling local connections or connect to a physical > + * Virtio device and install an interrupt handler . > + * > + * This function is thread-safe. > + * > + * \param trtype type of the transport used, e.g. "vhost-user", > + * "PCI-vhost-user", "PCI-vDPA". > + * \param trid identifier of the device. For PCI this would be the BDF address, > + * for vhost-user the socket name. > + * \param trflags additional options for the specified transport > + * \param trctx additional data for the specified transport. Can be NULL. > + * \param tgt_ops callbacks to be called upon reaching specific initialization > + * states. > + * \param features supported Virtio features. To be negotiated with the > + * driver ones. rte_vhost2 will append a couple of generic feature bits > + * which are required by the Virtio spec. TODO list these features here > + * \return 0 on success, negative errno otherwise > + */ > +int rte_vhost2_tgt_register(const char *trtype, const char *trid, > + uint64_t trflags, void *trctx, > + struct rte_vhost2_tgt_ops *tgt_ops, > + uint64_t features); Couldn't the register API also pass the vdev? Doing this, the backend could have rte_vhost2_dev in its device struct. > + > +/** > + * Finish async device tgt ops callback. Unless a tgt op has been documented > + * as 'synchronous' this function must be called at the end of the op handler. > + * It can be called either before or after the op handler returns. rte_vhost2 > + * won't call any tgt ops callbacks while another one hasn't been finished yet. > + * > + * This function is thread-safe. > + * > + * \param vdev vhost device > + * \param rc 0 on success, negative errno otherwise. If non-zero value is > + * given, the current callback will be perceived as failed. A queue that failed > + * to start won't need to be stopped. > + */ > +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc); > + > +/** > + * Unregisters a vhost target asynchronously. All active queue will be stopped > + * and all devices destroyed. > + * > + * This function is thread-safe. > + * > + * \param cb_fn callback to be called on finish. It'll be called from the same > + * thread that calls \c rte_vhost2_tgt_ops. > + * \param cb_arg argument for \c cb_fn > + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called > + * if non-zero value is returned. > + */ > +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid, > + void (*cb_fn)(void *arg), void *cb_arg); > + > +/** > + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly. > + * > + * This function is thread-safe. > + * > + * \param mem vhost device memory > + * \param gpa guest physical address > + * \param len length of the memory to translate (in bytes). If requested > + * memory chunk crosses memory region boundary, the *len will be set to > + * the remaining, maximum length of virtually contiguous memory. In such > + * case the user will be required to call another gpa_to_vva(gpa + *len). > + * \return vhost virtual address or NULL if requested `gpa` is not mapped. > + */ > +static inline void * > +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len) > +{ > + struct rte_vhost2_mem_region *r; > + uint32_t i; > + > + for (i = 0; i < mem->nregions; i++) { > + r = &mem->regions[i]; > + if (gpa >= r->guest_phys_addr && > + gpa < r->guest_phys_addr + r->size) { > + > + if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) { > + *len = r->guest_phys_addr + r->size - gpa; > + } > + > + return gpa - r->guest_phys_addr + r->host_user_addr; > + } > + } > + *len = 0; > + > + return 0; > +} Maybe we could take the opportunity to only have rte_vhost2_iova_to_vva. > +/** > + * Translate I/O virtual address to vhost address space. > + * > + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send > + * a TLB miss and wait for the TLB update response. > + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical > + * address and `perm` is ignored. > + * > + * This function is thread-safe. > + * > + * \param vdev vhost device > + * \param vq virtqueue. Must be started. > + * \param iova I/O virtual address > + * \param len length of the memory to translate (in bytes). If requested > + * memory chunk crosses memory region boundary, the *len will be set to > + * the remaining, maximum length of virtually contiguous memory. In such > + * case the user will be required to call another gpa_to_vva(gpa + *len). > + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW > + * \return vhost virtual address or NULL if requested `iova` is not mapped > + * or the `perm` doesn't match. > + */ > +static inline void * > +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq, > + uint64_t iova, uint32_t *len, uint8_t perm) > +{ > + void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq, > + uint64_t iova, uint64_t size, uint8_t perm); > + > + if (!vdev->iommu) { > + return rte_vhost2_gpa_to_vva(vdev->mem, iova, len); > + } > + > + return __vhost_iova_to_vva(vdev, vq, iova, len, perm); > +} > + > +/** > + * Notify the driver about vq change. This is an eventfd_write for vhost-user > + * or MMIO write for PCI devices. > + * > + * \param vdev vhost device > + * \param vq virtqueue. Must be started. > + */ > +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); > + > +/** > + * Notify the driver about device config change. This will result in \c > + * rte_vhost2_tgt_ops->get_config being called. > + * > + * \param vdev vhost device > + */ > +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev); > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_VHOST2_H_ */ >