From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id C56017FC9 for ; Fri, 18 May 2018 11:23:51 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 May 2018 02:23:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,414,1520924400"; d="scan'208";a="57009025" Received: from kraken.imu.intel.com (HELO Sent) ([10.217.246.153]) by orsmga001.jf.intel.com with SMTP; 18 May 2018 02:23:46 -0700 Received: by Sent (sSMTP sendmail emulation); Fri, 18 May 2018 15:01:07 +0200 From: Dariusz Stojaczyk To: dev@dpdk.org, Maxime Coquelin , Tiwei Bie , Tetsuya Mukawa , Thomas Monjalon Cc: yliu@fridaylinux.org, Stefan Hajnoczi , James Harris , Dariusz Stojaczyk Date: Fri, 18 May 2018 15:01:05 +0200 Message-Id: <1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1525958573-184361-1-git-send-email-dariuszx.stojaczyk@intel.com> References: <1525958573-184361-1-git-send-email-dariuszx.stojaczyk@intel.com> Subject: [dpdk-dev] [RFC v2] vhost: new rte_vhost API proposal X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 May 2018 09:23:53 -0000 rte_vhost is not vhost-user spec compliant. Some Vhost drivers have been already confirmed not to work with rte_vhost. virtio-user-scsi-pci in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS stage. This is perfectly fine from the Vhost-user spec perspective, but doesn't meet rte_vhost expectations. rte_vhost waits for all queues to be fully initialized before it allows the entire device to be processed. qFixing rte_vhost directly would require quite a big amount of changes, which would completely break backwards compatibility. This rte_vhost2 library is intended to smooth out the transition. It exposes a low-level API for implementing new Vhost-user slaves. The existing rte_vhost is about to be refactored to use rte_vhost2 library underneath, and demanding backends could now use rte_vhost2 directly. We're designing a fresh library here, so this opens up a great amount of possibilities and improvements we can make to the public API that will pay off significantly for the sake of future specification/library extensions. rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics and allows developers to implement Vhost devices with an ease. It calls user-provided callbacks once proper device initialization state has been reached. That is - memory mappings have changed, virtqueues are ready to be processed, features have changed in runtime, etc. Compared to the rte_vhost, this lib additionally allows the following: * ability to start/stop particular queues * most callbacks are now asynchronous - it greatly simplifies the event handling for asynchronous applications and doesn't make anything harder for synchronous ones. * this is low-level API. It doesn't have any vhost-net, nvme or crypto references. These backend-specific libraries will be later refactored to use *this* generic library underneath. This implies that the library doesn't do any virtqueue processing, it only delivers vring addresses to the user, so he can process virtqueues by himself. * abstracting away Unix domain sockets (vhost-user) and PCI (virtio-vhost-user). * The API imposes how public functions can be called and how internal data can change, so there's only a minimal work required to ensure thread-safety. Possibly no mutexes are required at all. * full Virtio 1.0/vhost-user specification compliance. This patch only introduces the API. Some additional functions for vDPA might be still required, but everything present here so far shouldn't need changing. Signed-off-by: Dariusz Stojaczyk --- lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 331 insertions(+) create mode 100644 lib/librte_vhost2/rte_vhost2.h diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h new file mode 100644 index 0000000..385b093 --- /dev/null +++ b/lib/librte_vhost2/rte_vhost2.h @@ -0,0 +1,331 @@ +/*- + * BSD LICENSE + * + * Copyright (c) Intel Corporation. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _RTE_VHOST2_H_ +#define _RTE_VHOST2_H_ + +/** + * @file + * This library abstracts away most Vhost-user/virtio-vhost-user specifics + * and allows developers to implement Vhost devices with an ease. + * It calls user-provided callbacks once proper device initialization + * state has been reached. That is - memory mappings have changed, + * virtqueues are ready to be processed, features have changed in runtime, etc. + */ + +#include + +#ifdef __cplusplus +extern "C" { +#endif + +/* Not C++-aware. */ +#include + +#define RTE_VHOST2_MEMORY_MAX_NREGIONS 8 + +#define RTE_VHOST2_CLIENT (1ULL << 0) +#define RTE_VHOST2_NO_RECONNECT (1ULL << 1) + +enum rte_vhost2_set_config_type { + /** Config changed on request by the vhost driver. */ + VHOST_SET_CONFIG_TYPE_MASTER = 0, + /** Config is being restored after a successful migration. */ + VHOST_SET_CONFIG_TYPE_MIGRATION = 1, +}; + +#define RTE_VHOST2_MSG_VERSION_MASK (0x3) +#define RTE_VHOST2_MSG_REPLY_MASK (0x1 << 2) +#define RTE_VHOST2_MSG_NEED_REPLY (0x1 << 3) + +struct rte_vhost2_msg { + uint32_t id; + uint32_t flags; + uint32_t size; /**< The following payload size. */ + void *payload; + int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS]; +}; + +/** Single memory region. Both physically and virtually contiguous */ +struct rte_vhost2_mem_region { + uint64_t guest_phys_addr; + uint64_t guest_user_addr; + uint64_t host_user_addr; + uint64_t size; + void *mmap_addr; + uint64_t mmap_size; + int fd; +}; + +struct rte_vhost2_memory { + uint32_t nregions; + struct rte_vhost2_mem_region regions[]; +}; + +/** + * Vhost device created and managed by rte_vhost2. Accessible via + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real + * vhost device data. This struct is published just for inline vdev + * functions to access their data directly. + */ +struct rte_vhost2_dev { + struct rte_vhost2_memory *mem; + bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */ +}; + +/** + * Virtqueue created and managed by rte_vhost2. Accessible via + * \c rte_vhost2_tgt_ops callbacks. + */ +struct rte_vhost2_vq { + struct vring_desc *desc; + struct vring_avail *avail; + struct vring_used *used; + /* available only if \c VHOST_F_LOG_ALL has been negotiated */ + void *log; + uint16_t size; +}; + +/** + * Device/queue related callbacks, all optional. Provided callback + * parameters are guaranteed not to be NULL unless explicitly specified. + */ +struct rte_vhost2_tgt_ops { + /** + * New driver connected. If this is completed with a non-zero status, + * rte_vhost2 will terminate the connection. + */ + void (*device_create)(struct rte_vhost2_dev *vdev); + /** + * Device is ready to operate. vdev data is now initialized. This callback + * may be called multiple times as e.g. memory mappings can change + * dynamically. All queues are guaranteed to be stopped by now. + */ + void (*device_init)(struct rte_vhost2_dev *vdev); + /** + * Features have changed in runtime. This is called at least once during + * initialization before `device_init`. Queues might be still running + * at this point. + */ + void (*device_features_changed)(struct rte_vhost2_dev *vdev, + uint64_t features); + /** + * Start processing vq. The `vq` is guaranteed not to be modified before + * `queue_stop` is called. + */ + void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); + /** + * Stop processing vq. It shouldn't be accessed after this callback + * completes (via \c rte_vhost2_tgt_cb_complete). This can be called + * prior to shutdown or before actions that require changing vhost + * device/vq state. + */ + void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); + /** Device disconnected. All queues are guaranteed to be stopped by now */ + void (*device_destroy)(struct rte_vhost2_dev *vdev); + /** + * Custom vhost-user message handler. This is called for + * backend-specific messages (net/crypto/scsi) that weren't recognized + * by the generic message parser. `msg` is available until + * \c rte_vhost2_tgt_cb_complete is called. + */ + void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg); + + /** Interrupt handler, synchronous. */ + void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); + /** + * Full device config read, synchronous. Return 0 if `len` bytes of + * `config` have been successfully set, -1 otherwise. + */ + int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config, + uint32_t len); + /** + * Device config changed by the driver, synchronous. `type` indicates + * the reason of change. + */ + int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config, + uint32_t offset, uint32_t len, + enum rte_vhost2_set_config_type type); + + void *reserved[8]; /**< Reserved for future extension */ +}; + +/** + * Registers a new vhost target accepting remote connections. Multiple + * available transports are available. It is possible to create a Vhost-user + * Unix domain socket polling local connections or connect to a physical + * Virtio device and install an interrupt handler . + * + * This function is thread-safe. + * + * \param trtype type of the transport used, e.g. "vhost-user", + * "PCI-vhost-user", "PCI-vDPA". + * \param trid identifier of the device. For PCI this would be the BDF address, + * for vhost-user the socket name. + * \param trflags additional options for the specified transport + * \param trctx additional data for the specified transport. Can be NULL. + * \param tgt_ops callbacks to be called upon reaching specific initialization + * states. + * \param features supported Virtio features. To be negotiated with the + * driver ones. rte_vhost2 will append a couple of generic feature bits + * which are required by the Virtio spec. TODO list these features here + * \return 0 on success, negative errno otherwise + */ +int rte_vhost2_tgt_register(const char *trtype, const char *trid, + uint64_t trflags, void *trctx, + struct rte_vhost2_tgt_ops *tgt_ops, + uint64_t features); + +/** + * Finish async device tgt ops callback. Unless a tgt op has been documented + * as 'synchronous' this function must be called at the end of the op handler. + * It can be called either before or after the op handler returns. rte_vhost2 + * won't call any tgt ops callbacks while another one hasn't been finished yet. + * + * This function is thread-safe. + * + * \param vdev vhost device + * \param rc 0 on success, negative errno otherwise. If non-zero value is + * given, the current callback will be perceived as failed. A queue that failed + * to start won't need to be stopped. + */ +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc); + +/** + * Unregisters a vhost target asynchronously. All active queue will be stopped + * and all devices destroyed. + * + * This function is thread-safe. + * + * \param cb_fn callback to be called on finish. It'll be called from the same + * thread that calls \c rte_vhost2_tgt_ops. + * \param cb_arg argument for \c cb_fn + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called + * if non-zero value is returned. + */ +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid, + void (*cb_fn)(void *arg), void *cb_arg); + +/** + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly. + * + * This function is thread-safe. + * + * \param mem vhost device memory + * \param gpa guest physical address + * \param len length of the memory to translate (in bytes). If requested + * memory chunk crosses memory region boundary, the *len will be set to + * the remaining, maximum length of virtually contiguous memory. In such + * case the user will be required to call another gpa_to_vva(gpa + *len). + * \return vhost virtual address or NULL if requested `gpa` is not mapped. + */ +static inline void * +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len) +{ + struct rte_vhost2_mem_region *r; + uint32_t i; + + for (i = 0; i < mem->nregions; i++) { + r = &mem->regions[i]; + if (gpa >= r->guest_phys_addr && + gpa < r->guest_phys_addr + r->size) { + + if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) { + *len = r->guest_phys_addr + r->size - gpa; + } + + return gpa - r->guest_phys_addr + r->host_user_addr; + } + } + *len = 0; + + return 0; +} + +/** + * Translate I/O virtual address to vhost address space. + * + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send + * a TLB miss and wait for the TLB update response. + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical + * address and `perm` is ignored. + * + * This function is thread-safe. + * + * \param vdev vhost device + * \param vq virtqueue. Must be started. + * \param iova I/O virtual address + * \param len length of the memory to translate (in bytes). If requested + * memory chunk crosses memory region boundary, the *len will be set to + * the remaining, maximum length of virtually contiguous memory. In such + * case the user will be required to call another gpa_to_vva(gpa + *len). + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW + * \return vhost virtual address or NULL if requested `iova` is not mapped + * or the `perm` doesn't match. + */ +static inline void * +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq, + uint64_t iova, uint32_t *len, uint8_t perm) +{ + void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq, + uint64_t iova, uint64_t size, uint8_t perm); + + if (!vdev->iommu) { + return rte_vhost2_gpa_to_vva(vdev->mem, iova, len); + } + + return __vhost_iova_to_vva(vdev, vq, iova, len, perm); +} + +/** + * Notify the driver about vq change. This is an eventfd_write for vhost-user + * or MMIO write for PCI devices. + * + * \param vdev vhost device + * \param vq virtqueue. Must be started. + */ +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq); + +/** + * Notify the driver about device config change. This will result in \c + * rte_vhost2_tgt_ops->get_config being called. + * + * \param vdev vhost device + */ +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev); + +#ifdef __cplusplus +} +#endif + +#endif /* _RTE_VHOST2_H_ */ -- 2.7.4