From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id DC82CA00BE; Fri, 1 Nov 2019 03:59:18 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1F9D41D44B; Fri, 1 Nov 2019 03:59:18 +0100 (CET) Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id 3FFE21D429 for ; Fri, 1 Nov 2019 03:59:16 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 Oct 2019 19:59:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.68,253,1569308400"; d="scan'208";a="194540357" Received: from storage36.sh.intel.com ([10.67.110.177]) by orsmga008.jf.intel.com with ESMTP; 31 Oct 2019 19:58:45 -0700 From: Jin Yu To: Thomas Monjalon , John McNamara , Marko Kovacevic , Maxime Coquelin , Tiwei Bie , Zhihong Wang Cc: dev@dpdk.org, Jin Yu Date: Fri, 1 Nov 2019 18:42:46 +0800 Message-Id: <20191101104246.64030-1-jin.yu@intel.com> X-Mailer: git-send-email 2.17.2 In-Reply-To: <20191028193756.27757-1-jin.yu@intel.com> References: <20191028193756.27757-1-jin.yu@intel.com> Subject: [dpdk-dev] [PATCH v5] vhost: add vhost-user-blk example which support inflight X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" A vhost-user-blk example that support inflight feature. It uses the new APIs that introduced in the first patch, so it can show how these APIs work to support inflight feature. Signed-off-by: Jin Yu --- v1 - add the case. v2 - add the rte_vhost prefix. v3 - add packed ring support v4 - fix build, MAINTAINERS and add guides v5 - fix ci/intel-compilation errors --- MAINTAINERS | 2 + doc/guides/sample_app_ug/index.rst | 1 + doc/guides/sample_app_ug/vhost_blk.rst | 63 ++ examples/meson.build | 2 +- examples/vhost_blk/Makefile | 68 ++ examples/vhost_blk/blk.c | 125 +++ examples/vhost_blk/blk_spec.h | 95 ++ examples/vhost_blk/meson.build | 21 + examples/vhost_blk/vhost_blk.c | 1094 ++++++++++++++++++++++++ examples/vhost_blk/vhost_blk.h | 127 +++ examples/vhost_blk/vhost_blk_compat.c | 173 ++++ 11 files changed, 1770 insertions(+), 1 deletion(-) create mode 100644 doc/guides/sample_app_ug/vhost_blk.rst create mode 100644 examples/vhost_blk/Makefile create mode 100644 examples/vhost_blk/blk.c create mode 100644 examples/vhost_blk/blk_spec.h create mode 100644 examples/vhost_blk/meson.build create mode 100644 examples/vhost_blk/vhost_blk.c create mode 100644 examples/vhost_blk/vhost_blk.h create mode 100644 examples/vhost_blk/vhost_blk_compat.c diff --git a/MAINTAINERS b/MAINTAINERS index 717c31801..c22a8312e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -839,6 +839,8 @@ F: lib/librte_vhost/ F: doc/guides/prog_guide/vhost_lib.rst F: examples/vhost/ F: doc/guides/sample_app_ug/vhost.rst +F: example/vhost_blk/ +F: doc/guides/sample_app_ug/vhost_blk.rst F: examples/vhost_crypto/ F: examples/vdpa/ F: doc/guides/sample_app_ug/vdpa.rst diff --git a/doc/guides/sample_app_ug/index.rst b/doc/guides/sample_app_ug/index.rst index a3737c118..613f483f3 100644 --- a/doc/guides/sample_app_ug/index.rst +++ b/doc/guides/sample_app_ug/index.rst @@ -40,6 +40,7 @@ Sample Applications User Guides packet_ordering vmdq_dcb_forwarding vhost + vhost_blk vhost_crypto vdpa ip_pipeline diff --git a/doc/guides/sample_app_ug/vhost_blk.rst b/doc/guides/sample_app_ug/vhost_blk.rst new file mode 100644 index 000000000..39096e2e4 --- /dev/null +++ b/doc/guides/sample_app_ug/vhost_blk.rst @@ -0,0 +1,63 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright(c) 2010-2017 Intel Corporation. + +Vhost_blk Sample Application +============================= + +The vhost_blk sample application implemented a simple block device, +which used as the backend of Qemu vhost-user-blk device. Users can extend +the exist example to use other type of block device(e.g. AIO) besides +memory based block device. Similar with vhost-user-net device, the sample +application used domain socket to communicate with Qemu, and the virtio +ring (split or packed format) was processed by vhost_blk sample application. + +The sample application reuse lots codes from SPDK(Storage Performance +Development Kit, https://github.com/spdk/spdk) vhost-user-blk target, +for DPDK vhost library used in storage area, user can take SPDK as +reference as well. + +Testing steps +------------- + +This section shows the steps how to start a VM with the block device as +fast data path for critical application. + +Compiling the Application +------------------------- + +To compile the sample application see :doc:`compiling`. + +The application is located in the ``examples`` sub-directory. + +You will also need to build DPDK both on the host and inside the guest + +Start the vhost_blk example +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: console + + ./vhost_blk -m 1024 + +.. _vhost_blk_app_run_vm: + +Start the VM +~~~~~~~~~~~~ + +.. code-block:: console + + qemu-system-x86_64 -machine accel=kvm \ + -m $mem -object memory-backend-file,id=mem,size=$mem,\ + mem-path=/dev/hugepages,share=on -numa node,memdev=mem \ + -drive file=os.img,if=none,id=disk \ + -device ide-hd,drive=disk,bootindex=0 \ + -chardev socket,id=char0,reconnect=1,path=/tmp/vhost.socket \ + -device vhost-user-blk-pci,ring_packed=1,chardev=char0,num-queues=1 \ + ... + +.. note:: + You must check whether your Qemu can support "vhost-user-blk" or not, + Qemu v4.0 or newer version is required. + reconnect=1 means live recovery support that qemu can reconnect vhost_blk + after we restart vhost_blk example. + ring_packed=1 means the device support packed ring but need the guest kernel + version >= 5.0 diff --git a/examples/meson.build b/examples/meson.build index 98ae50a49..10a6bd7ef 100644 --- a/examples/meson.build +++ b/examples/meson.build @@ -42,7 +42,7 @@ all_examples = [ 'skeleton', 'tep_termination', 'timer', 'vdpa', 'vhost', 'vhost_crypto', - 'vm_power_manager', + 'vhost_blk', 'vm_power_manager', 'vm_power_manager/guest_cli', 'vmdq', 'vmdq_dcb', ] diff --git a/examples/vhost_blk/Makefile b/examples/vhost_blk/Makefile new file mode 100644 index 000000000..a10a90071 --- /dev/null +++ b/examples/vhost_blk/Makefile @@ -0,0 +1,68 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2010-2014 Intel Corporation + +# binary name +APP = vhost-blk + +# all source are stored in SRCS-y +SRCS-y := blk.c vhost_blk.c vhost_blk_compat.c + +# Build using pkg-config variables if possible +$(shell pkg-config --exists libdpdk) +ifeq ($(.SHELLSTATUS),0) + +all: shared +.PHONY: shared static +shared: build/$(APP)-shared + ln -sf $(APP)-shared build/$(APP) +static: build/$(APP)-static + ln -sf $(APP)-static build/$(APP) + +LDFLAGS += -pthread + +PC_FILE := $(shell pkg-config --path libdpdk) +CFLAGS += -O3 $(shell pkg-config --cflags libdpdk) +LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk) +LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk) + +CFLAGS += -DALLOW_EXPERIMENTAL_API + +build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED) + +build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build + $(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC) + +build: + @mkdir -p $@ + +.PHONY: clean +clean: + rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared + test -d build && rmdir -p build || true + +else # Build using legacy build system + +ifeq ($(RTE_SDK),) +$(error "Please define RTE_SDK environment variable") +endif + +# Default target, detect a build directory, by looking for a path with a .config +RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config))))) + +include $(RTE_SDK)/mk/rte.vars.mk + +ifneq ($(CONFIG_RTE_EXEC_ENV_LINUX),y) +$(info This application can only operate in a linux environment, \ +please change the definition of the RTE_TARGET environment variable) +all: +else + +CFLAGS += -DALLOW_EXPERIMENTAL_API +CFLAGS += -O2 -D_FILE_OFFSET_BITS=64 +CFLAGS += $(WERROR_FLAGS) + +include $(RTE_SDK)/mk/rte.extapp.mk + +endif +endif diff --git a/examples/vhost_blk/blk.c b/examples/vhost_blk/blk.c new file mode 100644 index 000000000..424ed3015 --- /dev/null +++ b/examples/vhost_blk/blk.c @@ -0,0 +1,125 @@ +// SPDX-License-Identifier: BSD-3-Clause +// Copyright(c) 2010-2019 Intel Corporation + +/** + * This work is largely based on the "vhost-user-blk" implementation by + * SPDK(https://github.com/spdk/spdk). + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#include "vhost_blk.h" +#include "blk_spec.h" + +static void +vhost_strcpy_pad(void *dst, const char *src, size_t size, int pad) +{ + size_t len; + + len = strlen(src); + if (len < size) { + memcpy(dst, src, len); + memset((char *)dst + len, pad, size - len); + } else { + memcpy(dst, src, size); + } +} + +static int +vhost_bdev_blk_readwrite(struct vhost_block_dev *bdev, + struct vhost_blk_task *task, + uint64_t lba_512, __rte_unused uint32_t xfer_len) +{ + uint32_t i; + uint64_t offset; + uint32_t nbytes = 0; + + offset = lba_512 * 512; + + for (i = 0; i < task->iovs_cnt; i++) { + if (task->dxfer_dir == BLK_DIR_TO_DEV) + memcpy(bdev->data + offset, task->iovs[i].iov_base, + task->iovs[i].iov_len); + else + memcpy(task->iovs[i].iov_base, bdev->data + offset, + task->iovs[i].iov_len); + offset += task->iovs[i].iov_len; + nbytes += task->iovs[i].iov_len; + } + + return nbytes; +} + +int +vhost_bdev_process_blk_commands(struct vhost_block_dev *bdev, + struct vhost_blk_task *task) +{ + int used_len; + + if (unlikely(task->data_len > (bdev->blockcnt * bdev->blocklen))) { + fprintf(stderr, "read or write beyond capacity\n"); + return VIRTIO_BLK_S_UNSUPP; + } + + switch (task->req->type) { + case VIRTIO_BLK_T_IN: + if (unlikely(task->data_len == 0 || + (task->data_len & (512 - 1)) != 0)) { + fprintf(stderr, + "%s - passed IO buffer is not multiple of 512b" + "(req_idx = %"PRIu16").\n", + task->req->type ? "WRITE" : "READ", + task->head_idx); + return VIRTIO_BLK_S_UNSUPP; + } + + task->dxfer_dir = BLK_DIR_FROM_DEV; + vhost_bdev_blk_readwrite(bdev, task, + task->req->sector, task->data_len); + break; + case VIRTIO_BLK_T_OUT: + if (unlikely(task->data_len == 0 || + (task->data_len & (512 - 1)) != 0)) { + fprintf(stderr, + "%s - passed IO buffer is not multiple of 512b" + "(req_idx = %"PRIu16").\n", + task->req->type ? "WRITE" : "READ", + task->head_idx); + return VIRTIO_BLK_S_UNSUPP; + } + + if (task->readtype) { + fprintf(stderr, "type isn't right\n"); + return VIRTIO_BLK_S_IOERR; + } + task->dxfer_dir = BLK_DIR_TO_DEV; + vhost_bdev_blk_readwrite(bdev, task, + task->req->sector, task->data_len); + break; + case VIRTIO_BLK_T_GET_ID: + if (!task->iovs_cnt || task->data_len) + return VIRTIO_BLK_S_UNSUPP; + used_len = min(VIRTIO_BLK_ID_BYTES, task->data_len); + vhost_strcpy_pad(task->iovs[0].iov_base, + bdev->product_name, used_len, ' '); + break; + default: + fprintf(stderr, "unsupported cmd\n"); + return VIRTIO_BLK_S_UNSUPP; + } + + return VIRTIO_BLK_S_OK; +} diff --git a/examples/vhost_blk/blk_spec.h b/examples/vhost_blk/blk_spec.h new file mode 100644 index 000000000..5875e2f86 --- /dev/null +++ b/examples/vhost_blk/blk_spec.h @@ -0,0 +1,95 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2019 Intel Corporation + */ + +#ifndef _BLK_SPEC_H +#define _BLK_SPEC_H + +#include + +#ifndef VHOST_USER_MEMORY_MAX_NREGIONS +#define VHOST_USER_MEMORY_MAX_NREGIONS 8 +#endif + +#ifndef VHOST_USER_MAX_CONFIG_SIZE +#define VHOST_USER_MAX_CONFIG_SIZE 256 +#endif + +#ifndef VHOST_USER_PROTOCOL_F_CONFIG +#define VHOST_USER_PROTOCOL_F_CONFIG 9 +#endif + +#ifndef VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD +#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 +#endif + +#define VIRTIO_BLK_ID_BYTES 20 /* ID string length */ + +#define VIRTIO_BLK_T_IN 0 +#define VIRTIO_BLK_T_OUT 1 +#define VIRTIO_BLK_T_FLUSH 4 +#define VIRTIO_BLK_T_GET_ID 8 +#define VIRTIO_BLK_T_DISCARD 11 +#define VIRTIO_BLK_T_WRITE_ZEROES 13 + +#define VIRTIO_BLK_S_OK 0 +#define VIRTIO_BLK_S_IOERR 1 +#define VIRTIO_BLK_S_UNSUPP 2 + +enum vhost_user_request { + VHOST_USER_NONE = 0, + VHOST_USER_GET_FEATURES = 1, + VHOST_USER_SET_FEATURES = 2, + VHOST_USER_SET_OWNER = 3, + VHOST_USER_RESET_OWNER = 4, + VHOST_USER_SET_MEM_TABLE = 5, + VHOST_USER_SET_LOG_BASE = 6, + VHOST_USER_SET_LOG_FD = 7, + VHOST_USER_SET_VRING_NUM = 8, + VHOST_USER_SET_VRING_ADDR = 9, + VHOST_USER_SET_VRING_BASE = 10, + VHOST_USER_GET_VRING_BASE = 11, + VHOST_USER_SET_VRING_KICK = 12, + VHOST_USER_SET_VRING_CALL = 13, + VHOST_USER_SET_VRING_ERR = 14, + VHOST_USER_GET_PROTOCOL_FEATURES = 15, + VHOST_USER_SET_PROTOCOL_FEATURES = 16, + VHOST_USER_GET_QUEUE_NUM = 17, + VHOST_USER_SET_VRING_ENABLE = 18, + VHOST_USER_MAX +}; + +/** Get/set config msg payload */ +struct vhost_user_config { + uint32_t offset; + uint32_t size; + uint32_t flags; + uint8_t region[VHOST_USER_MAX_CONFIG_SIZE]; +}; + +/** Fixed-size vhost_memory struct */ +struct vhost_memory_padded { + uint32_t nregions; + uint32_t padding; + struct vhost_memory_region regions[VHOST_USER_MEMORY_MAX_NREGIONS]; +}; + +struct vhost_user_msg { + enum vhost_user_request request; + +#define VHOST_USER_VERSION_MASK 0x3 +#define VHOST_USER_REPLY_MASK (0x1 << 2) + uint32_t flags; + uint32_t size; /**< the following payload size */ + union { +#define VHOST_USER_VRING_IDX_MASK 0xff +#define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) + uint64_t u64; + struct vhost_vring_state state; + struct vhost_vring_addr addr; + struct vhost_memory_padded memory; + struct vhost_user_config cfg; + } payload; +} __attribute((packed)); + +#endif diff --git a/examples/vhost_blk/meson.build b/examples/vhost_blk/meson.build new file mode 100644 index 000000000..857367192 --- /dev/null +++ b/examples/vhost_blk/meson.build @@ -0,0 +1,21 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright(c) 2017 Intel Corporation + +# meson file, for building this example as part of a main DPDK build. +# +# To build this example as a standalone application with an already-installed +# DPDK instance, use 'make' + +if not is_linux + build = false +endif + +if not cc.has_header('linux/virtio_blk.h') + build = false +endif + +deps += 'vhost' +allow_experimental_apis = true +sources = files( + 'blk.c', 'vhost_blk.c', 'vhost_blk_compat.c' +) diff --git a/examples/vhost_blk/vhost_blk.c b/examples/vhost_blk/vhost_blk.c new file mode 100644 index 000000000..24807c82f --- /dev/null +++ b/examples/vhost_blk/vhost_blk.c @@ -0,0 +1,1094 @@ +// SPDX-License-Identifier: BSD-3-Clause +// Copyright(c) 2010-2017 Intel Corporation + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include "vhost_blk.h" +#include "blk_spec.h" + +#define VIRTQ_DESC_F_NEXT 1 +#define VIRTQ_DESC_F_AVAIL (1 << 7) +#define VIRTQ_DESC_F_USED (1 << 15) + +#define MAX_TASK 12 + +#define VHOST_BLK_FEATURES ((1ULL << VIRTIO_F_RING_PACKED) | \ + (1ULL << VIRTIO_F_VERSION_1) |\ + (1ULL << VIRTIO_F_NOTIFY_ON_EMPTY) | \ + (1ULL << VHOST_USER_F_PROTOCOL_FEATURES)) + +/* Path to folder where character device will be created. Can be set by user. */ +static char dev_pathname[PATH_MAX] = ""; +static sem_t exit_sem; +static int g_should_stop = -1; + +struct vhost_blk_ctrlr * +vhost_blk_ctrlr_find(const char *ctrlr_name) +{ + if (ctrlr_name == NULL) + return NULL; + + /* currently we only support 1 socket file fd */ + return g_vhost_ctrlr; +} + +static uint64_t gpa_to_vva(int vid, uint64_t gpa, uint64_t *len) +{ + char path[PATH_MAX]; + struct vhost_blk_ctrlr *ctrlr; + int ret = 0; + + ret = rte_vhost_get_ifname(vid, path, PATH_MAX); + if (ret) { + fprintf(stderr, "Cannot get socket name\n"); + assert(ret != 0); + } + + ctrlr = vhost_blk_ctrlr_find(path); + if (!ctrlr) { + fprintf(stderr, "Controller is not ready\n"); + assert(ctrlr != NULL); + } + + assert(ctrlr->mem != NULL); + + return rte_vhost_va_from_guest_pa(ctrlr->mem, gpa, len); +} + +static struct vring_packed_desc * +descriptor_get_next_packed(struct rte_vhost_vring *vq, + uint16_t *idx) +{ + if (vq->desc_packed[*idx % vq->size].flags & VIRTQ_DESC_F_NEXT) { + *idx += 1; + return &vq->desc_packed[*idx % vq->size]; + } + + return NULL; +} + +static bool +descriptor_has_next_packed(struct vring_packed_desc *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_NEXT); +} + +static bool +descriptor_is_wr_packed(struct vring_packed_desc *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_WRITE); +} + +static struct rte_vhost_inflight_desc_packed * +inflight_desc_get_next(struct rte_vhost_inflight_info_packed *inflight_packed, + struct rte_vhost_inflight_desc_packed *cur_desc) +{ + if (!!(cur_desc->flags & VIRTQ_DESC_F_NEXT)) + return &inflight_packed->desc[cur_desc->next]; + + return NULL; +} + +static bool +inflight_desc_has_next(struct rte_vhost_inflight_desc_packed *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_NEXT); +} + +static bool +inflight_desc_is_wr(struct rte_vhost_inflight_desc_packed *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_WRITE); +} + +static void +inflight_process_payload_chain_packed(struct inflight_blk_task *task) +{ + void *data; + uint64_t chunck_len; + struct vhost_blk_task *blk_task; + struct rte_vhost_inflight_desc_packed *desc; + + blk_task = &task->blk_task; + blk_task->iovs_cnt = 0; + + do { + desc = task->inflight_desc; + chunck_len = desc->len; + data = (void *)(uintptr_t)gpa_to_vva(blk_task->bdev->vid, + desc->addr, + &chunck_len); + if (!data || chunck_len != desc->len) { + fprintf(stderr, "failed to translate desc address.\n"); + return; + } + + blk_task->iovs[blk_task->iovs_cnt].iov_base = data; + blk_task->iovs[blk_task->iovs_cnt].iov_len = desc->len; + blk_task->data_len += desc->len; + blk_task->iovs_cnt++; + task->inflight_desc = inflight_desc_get_next( + task->inflight_packed, desc); + } while (inflight_desc_has_next(task->inflight_desc)); + + chunck_len = task->inflight_desc->len; + blk_task->status = (void *)(uintptr_t)gpa_to_vva( + blk_task->bdev->vid, task->inflight_desc->addr, &chunck_len); + if (!blk_task->status || chunck_len != task->inflight_desc->len) + fprintf(stderr, "failed to translate desc address.\n"); +} + +static void +inflight_submit_completion_packed(struct inflight_blk_task *task, + uint32_t q_idx, uint16_t *used_id, + bool *used_wrap_counter) +{ + struct vhost_blk_ctrlr *ctrlr; + struct rte_vhost_vring *vq; + struct vring_packed_desc *desc; + int ret; + + ctrlr = vhost_blk_ctrlr_find(dev_pathname); + vq = task->blk_task.vq; + + ret = rte_vhost_set_last_inflight_io_packed(ctrlr->bdev->vid, q_idx, + task->blk_task.head_idx); + if (ret != 0) + fprintf(stderr, "failed to set last inflight io\n"); + + desc = &vq->desc_packed[*used_id]; + desc->id = task->blk_task.buffer_id; + rte_smp_mb(); + if (*used_wrap_counter) + desc->flags |= VIRTQ_DESC_F_AVAIL | VIRTQ_DESC_F_USED; + else + desc->flags &= ~(VIRTQ_DESC_F_AVAIL | VIRTQ_DESC_F_USED); + rte_smp_mb(); + + *used_id += task->blk_task.iovs_cnt + 2; + if (*used_id >= vq->size) { + *used_id -= vq->size; + *used_wrap_counter = !(*used_wrap_counter); + } + + ret = rte_vhost_clr_inflight_desc_packed(ctrlr->bdev->vid, q_idx, + task->blk_task.head_idx); + if (ret != 0) + fprintf(stderr, "failed to clear inflight io\n"); + + /* Send an interrupt back to the guest VM so that it knows + * a completion is ready to be processed. + */ + rte_vhost_vring_call(task->blk_task.bdev->vid, q_idx); +} + +static void +submit_completion_packed(struct vhost_blk_task *task, uint32_t q_idx, + uint16_t *used_id, bool *used_wrap_counter) +{ + struct vhost_blk_ctrlr *ctrlr; + struct rte_vhost_vring *vq; + struct vring_packed_desc *desc; + int ret; + + ctrlr = vhost_blk_ctrlr_find(dev_pathname); + vq = task->vq; + + ret = rte_vhost_set_last_inflight_io_packed(ctrlr->bdev->vid, q_idx, + task->inflight_idx); + if (ret != 0) + fprintf(stderr, "failed to set last inflight io\n"); + + desc = &vq->desc_packed[*used_id]; + desc->id = task->buffer_id; + rte_smp_mb(); + if (*used_wrap_counter) + desc->flags |= VIRTQ_DESC_F_AVAIL | VIRTQ_DESC_F_USED; + else + desc->flags &= ~(VIRTQ_DESC_F_AVAIL | VIRTQ_DESC_F_USED); + rte_smp_mb(); + + *used_id += task->iovs_cnt + 2; + if (*used_id >= vq->size) { + *used_id -= vq->size; + *used_wrap_counter = !(*used_wrap_counter); + } + + ret = rte_vhost_clr_inflight_desc_packed(ctrlr->bdev->vid, q_idx, + task->inflight_idx); + if (ret != 0) + fprintf(stderr, "failed to clear inflight io\n"); + + /* Send an interrupt back to the guest VM so that it knows + * a completion is ready to be processed. + */ + rte_vhost_vring_call(task->bdev->vid, q_idx); +} + +static void +vhost_process_payload_chain_packed(struct vhost_blk_task *task, + uint16_t *idx) +{ + void *data; + uint64_t chunck_len; + + task->iovs_cnt = 0; + + do { + chunck_len = task->desc_packed->len; + data = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_packed->addr, + &chunck_len); + if (!data || chunck_len != task->desc_packed->len) { + fprintf(stderr, "failed to translate desc address.\n"); + return; + } + + task->iovs[task->iovs_cnt].iov_base = data; + task->iovs[task->iovs_cnt].iov_len = task->desc_packed->len; + task->data_len += task->desc_packed->len; + task->iovs_cnt++; + task->desc_packed = descriptor_get_next_packed(task->vq, idx); + } while (descriptor_has_next_packed(task->desc_packed)); + + task->last_idx = *idx % task->vq->size; + chunck_len = task->desc_packed->len; + task->status = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_packed->addr, + &chunck_len); + if (!task->status || chunck_len != task->desc_packed->len) + fprintf(stderr, "failed to translate desc address.\n"); +} + + +static int +descriptor_is_available(struct rte_vhost_vring *vring, uint16_t idx, + bool avail_wrap_counter) +{ + uint16_t flags = vring->desc_packed[idx].flags; + + return ((!!(flags & VIRTQ_DESC_F_AVAIL) == avail_wrap_counter) && + (!!(flags & VIRTQ_DESC_F_USED) != avail_wrap_counter)); +} + +static void +process_requestq_packed(struct vhost_blk_ctrlr *ctrlr, uint32_t q_idx) +{ + bool avail_wrap_counter, used_wrap_counter; + uint16_t avail_idx, used_idx; + int ret; + uint64_t chunck_len; + struct vhost_blk_queue *blk_vq; + struct rte_vhost_vring *vq; + struct vhost_blk_task *task; + + blk_vq = &ctrlr->bdev->queues[q_idx]; + vq = &blk_vq->vq; + + avail_idx = blk_vq->last_avail_idx; + avail_wrap_counter = blk_vq->avail_wrap_counter; + used_idx = blk_vq->last_used_idx; + used_wrap_counter = blk_vq->used_wrap_counter; + + task = rte_zmalloc(NULL, sizeof(*task), 0); + assert(task != NULL); + task->vq = vq; + task->bdev = ctrlr->bdev; + + while (descriptor_is_available(vq, avail_idx, avail_wrap_counter)) { + task->head_idx = avail_idx; + task->desc_packed = &task->vq->desc_packed[task->head_idx]; + task->iovs_cnt = 0; + task->data_len = 0; + task->req = NULL; + task->status = NULL; + + /* does not support indirect descriptors */ + assert((task->desc_packed->flags & VRING_DESC_F_INDIRECT) == 0); + + chunck_len = task->desc_packed->len; + task->req = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_packed->addr, &chunck_len); + if (!task->req || chunck_len != task->desc_packed->len) { + fprintf(stderr, "failed to translate desc address.\n"); + rte_free(task); + return; + } + + task->desc_packed = descriptor_get_next_packed(task->vq, + &avail_idx); + assert(task->desc_packed != NULL); + if (!descriptor_has_next_packed(task->desc_packed)) { + task->dxfer_dir = BLK_DIR_NONE; + task->last_idx = avail_idx % vq->size; + chunck_len = task->desc_packed->len; + task->status = (void *)(uintptr_t) + gpa_to_vva(task->bdev->vid, + task->desc_packed->addr, + &chunck_len); + if (!task->status || + chunck_len != task->desc_packed->len) { + fprintf(stderr, + "failed to translate desc address.\n"); + rte_free(task); + return; + } + } else { + task->readtype = descriptor_is_wr_packed( + task->desc_packed); + vhost_process_payload_chain_packed(task, &avail_idx); + } + task->buffer_id = vq->desc_packed[task->last_idx].id; + rte_vhost_set_inflight_desc_packed(ctrlr->bdev->vid, q_idx, + task->head_idx, + task->last_idx, + &task->inflight_idx); + + if (++avail_idx >= vq->size) { + avail_idx -= vq->size; + avail_wrap_counter = !avail_wrap_counter; + } + blk_vq->last_avail_idx = avail_idx; + blk_vq->avail_wrap_counter = avail_wrap_counter; + + ret = vhost_bdev_process_blk_commands(ctrlr->bdev, task); + if (ret) { + /* invalid response */ + *task->status = VIRTIO_BLK_S_IOERR; + } else { + /* successfully */ + *task->status = VIRTIO_BLK_S_OK; + } + + submit_completion_packed(task, q_idx, &used_idx, + &used_wrap_counter); + blk_vq->last_used_idx = used_idx; + blk_vq->used_wrap_counter = used_wrap_counter; + } + + rte_free(task); +} + +static void +submit_inflight_vq_packed(struct vhost_blk_ctrlr *ctrlr, + uint16_t q_idx) +{ + bool used_wrap_counter; + int req_idx, ret; + uint16_t used_idx; + uint64_t chunck_len; + struct vhost_blk_queue *blk_vq; + struct rte_vhost_ring_inflight *inflight_vq; + struct rte_vhost_resubmit_info *resubmit_info; + struct rte_vhost_vring *vq; + struct inflight_blk_task *task; + struct vhost_blk_task *blk_task; + struct rte_vhost_inflight_info_packed *inflight_info; + + blk_vq = &ctrlr->bdev->queues[q_idx]; + vq = &blk_vq->vq; + inflight_vq = &blk_vq->inflight_vq; + resubmit_info = inflight_vq->resubmit_inflight; + inflight_info = inflight_vq->inflight_packed; + used_idx = blk_vq->last_used_idx; + used_wrap_counter = blk_vq->used_wrap_counter; + + task = rte_malloc(NULL, sizeof(*task), 0); + if (!task) { + fprintf(stderr, "failed to allocate memory\n"); + return; + } + blk_task = &task->blk_task; + blk_task->vq = vq; + blk_task->bdev = ctrlr->bdev; + task->inflight_packed = inflight_vq->inflight_packed; + + while (resubmit_info->resubmit_num-- > 0) { + req_idx = resubmit_info->resubmit_num; + blk_task->head_idx = + resubmit_info->resubmit_list[req_idx].index; + task->inflight_desc = + &inflight_info->desc[blk_task->head_idx]; + task->blk_task.iovs_cnt = 0; + task->blk_task.data_len = 0; + task->blk_task.req = NULL; + task->blk_task.status = NULL; + + /* update the avail idx too + * as it's initial value equals to used idx + */ + blk_vq->last_avail_idx += task->inflight_desc->num; + if (blk_vq->last_avail_idx >= vq->size) { + blk_vq->last_avail_idx -= vq->size; + blk_vq->avail_wrap_counter = + !blk_vq->avail_wrap_counter; + } + + /* does not support indirect descriptors */ + assert(task->inflight_desc != NULL); + assert((task->inflight_desc->flags & + VRING_DESC_F_INDIRECT) == 0); + + chunck_len = task->inflight_desc->len; + blk_task->req = (void *)(uintptr_t) + gpa_to_vva(blk_task->bdev->vid, + task->inflight_desc->addr, + &chunck_len); + if (!blk_task->req || + chunck_len != task->inflight_desc->len) { + fprintf(stderr, "failed to translate desc address.\n"); + rte_free(task); + return; + } + + task->inflight_desc = inflight_desc_get_next( + task->inflight_packed, task->inflight_desc); + assert(task->inflight_desc != NULL); + if (!inflight_desc_has_next(task->inflight_desc)) { + blk_task->dxfer_dir = BLK_DIR_NONE; + chunck_len = task->inflight_desc->len; + blk_task->status = (void *)(uintptr_t) + gpa_to_vva(blk_task->bdev->vid, + task->inflight_desc->addr, + &chunck_len); + if (!blk_task->status || + chunck_len != task->inflight_desc->len) { + fprintf(stderr, + "failed to translate desc address.\n"); + rte_free(task); + return; + } + } else { + blk_task->readtype = + inflight_desc_is_wr(task->inflight_desc); + inflight_process_payload_chain_packed(task); + } + + blk_task->buffer_id = task->inflight_desc->id; + + ret = vhost_bdev_process_blk_commands(ctrlr->bdev, blk_task); + if (ret) + /* invalid response */ + *blk_task->status = VIRTIO_BLK_S_IOERR; + else + /* successfully */ + *blk_task->status = VIRTIO_BLK_S_OK; + + inflight_submit_completion_packed(task, q_idx, &used_idx, + &used_wrap_counter); + + blk_vq->last_used_idx = used_idx; + blk_vq->used_wrap_counter = used_wrap_counter; + } + + rte_free(task); +} + +static struct vring_desc * +descriptor_get_next_split(struct vring_desc *vq_desc, + struct vring_desc *cur_desc) +{ + return &vq_desc[cur_desc->next]; +} + +static bool +descriptor_has_next_split(struct vring_desc *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_NEXT); +} + +static bool +descriptor_is_wr_split(struct vring_desc *cur_desc) +{ + return !!(cur_desc->flags & VRING_DESC_F_WRITE); +} + +static void +vhost_process_payload_chain_split(struct vhost_blk_task *task) +{ + void *data; + uint64_t chunck_len; + + task->iovs_cnt = 0; + + do { + chunck_len = task->desc_split->len; + data = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_split->addr, + &chunck_len); + if (!data || chunck_len != task->desc_split->len) { + fprintf(stderr, "failed to translate desc address.\n"); + return; + } + + task->iovs[task->iovs_cnt].iov_base = data; + task->iovs[task->iovs_cnt].iov_len = task->desc_split->len; + task->data_len += task->desc_split->len; + task->iovs_cnt++; + task->desc_split = + descriptor_get_next_split(task->vq->desc, task->desc_split); + } while (descriptor_has_next_split(task->desc_split)); + + chunck_len = task->desc_split->len; + task->status = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_split->addr, + &chunck_len); + if (!task->status || chunck_len != task->desc_split->len) + fprintf(stderr, "failed to translate desc address.\n"); +} + +static void +submit_completion_split(struct vhost_blk_task *task, uint32_t vid, + uint32_t q_idx) +{ + struct rte_vhost_vring *vq; + struct vring_used *used; + + vq = task->vq; + used = vq->used; + + rte_vhost_set_last_inflight_io_split(vid, q_idx, task->req_idx); + + /* Fill out the next entry in the "used" ring. id = the + * index of the descriptor that contained the blk request. + * len = the total amount of data transferred for the blk + * request. We must report the correct len, for variable + * length blk CDBs, where we may return less data than + * allocated by the guest VM. + */ + used->ring[used->idx & (vq->size - 1)].id = task->req_idx; + used->ring[used->idx & (vq->size - 1)].len = task->data_len; + rte_smp_mb(); + used->idx++; + rte_smp_mb(); + + rte_vhost_clr_inflight_desc_split(vid, q_idx, used->idx, task->req_idx); + + /* Send an interrupt back to the guest VM so that it knows + * a completion is ready to be processed. + */ + rte_vhost_vring_call(task->bdev->vid, q_idx); +} + +static void +submit_inflight_vq_split(struct vhost_blk_ctrlr *ctrlr, + uint32_t q_idx) +{ + struct vhost_blk_queue *blk_vq; + struct rte_vhost_ring_inflight *inflight_vq; + struct rte_vhost_resubmit_info *resubmit_inflight; + struct rte_vhost_resubmit_desc *resubmit_list; + struct vhost_blk_task *task; + int req_idx; + uint64_t chunck_len; + int ret; + + blk_vq = &ctrlr->bdev->queues[q_idx]; + inflight_vq = &blk_vq->inflight_vq; + resubmit_inflight = inflight_vq->resubmit_inflight; + resubmit_list = resubmit_inflight->resubmit_list; + + task = rte_zmalloc(NULL, sizeof(*task), 0); + assert(task != NULL); + + task->ctrlr = ctrlr; + task->bdev = ctrlr->bdev; + task->vq = &blk_vq->vq; + + while (resubmit_inflight->resubmit_num-- > 0) { + req_idx = resubmit_list[resubmit_inflight->resubmit_num].index; + task->req_idx = req_idx; + task->desc_split = &task->vq->desc[task->req_idx]; + task->iovs_cnt = 0; + task->data_len = 0; + task->req = NULL; + task->status = NULL; + + /* does not support indirect descriptors */ + assert(task->desc_split != NULL); + assert((task->desc_split->flags & VRING_DESC_F_INDIRECT) == 0); + + chunck_len = task->desc_split->len; + task->req = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_split->addr, &chunck_len); + if (!task->req || chunck_len != task->desc_split->len) { + fprintf(stderr, "failed to translate desc address.\n"); + rte_free(task); + return; + } + + task->desc_split = descriptor_get_next_split(task->vq->desc, + task->desc_split); + if (!descriptor_has_next_split(task->desc_split)) { + task->dxfer_dir = BLK_DIR_NONE; + chunck_len = task->desc_split->len; + task->status = (void *)(uintptr_t) + gpa_to_vva(task->bdev->vid, + task->desc_split->addr, + &chunck_len); + if (!task->status || + chunck_len != task->desc_split->len) { + fprintf(stderr, + "failed to translate desc address.\n"); + rte_free(task); + return; + } + } else { + task->readtype = + descriptor_is_wr_split(task->desc_split); + vhost_process_payload_chain_split(task); + } + + ret = vhost_bdev_process_blk_commands(ctrlr->bdev, task); + if (ret) { + /* invalid response */ + *task->status = VIRTIO_BLK_S_IOERR; + } else { + /* successfully */ + *task->status = VIRTIO_BLK_S_OK; + } + submit_completion_split(task, ctrlr->bdev->vid, q_idx); + } + + rte_free(task); +} + +static void +process_requestq_split(struct vhost_blk_ctrlr *ctrlr, uint32_t q_idx) +{ + int ret; + int req_idx; + uint16_t last_idx; + uint64_t chunck_len; + struct vhost_blk_queue *blk_vq; + struct rte_vhost_vring *vq; + struct vhost_blk_task *task; + + blk_vq = &ctrlr->bdev->queues[q_idx]; + vq = &blk_vq->vq; + + task = rte_zmalloc(NULL, sizeof(*task), 0); + assert(task != NULL); + task->ctrlr = ctrlr; + task->bdev = ctrlr->bdev; + task->vq = vq; + + while (vq->avail->idx != blk_vq->last_avail_idx) { + last_idx = blk_vq->last_avail_idx & (vq->size - 1); + req_idx = vq->avail->ring[last_idx]; + task->req_idx = req_idx; + task->desc_split = &task->vq->desc[task->req_idx]; + task->iovs_cnt = 0; + task->data_len = 0; + task->req = NULL; + task->status = NULL; + + rte_vhost_set_inflight_desc_split(ctrlr->bdev->vid, q_idx, + task->req_idx); + + /* does not support indirect descriptors */ + assert((task->desc_split->flags & VRING_DESC_F_INDIRECT) == 0); + + chunck_len = task->desc_split->len; + task->req = (void *)(uintptr_t)gpa_to_vva(task->bdev->vid, + task->desc_split->addr, &chunck_len); + if (!task->req || chunck_len != task->desc_split->len) { + fprintf(stderr, "failed to translate desc address.\n"); + rte_free(task); + return; + } + + task->desc_split = descriptor_get_next_split(task->vq->desc, + task->desc_split); + if (!descriptor_has_next_split(task->desc_split)) { + task->dxfer_dir = BLK_DIR_NONE; + chunck_len = task->desc_split->len; + task->status = (void *)(uintptr_t) + gpa_to_vva(task->bdev->vid, + task->desc_split->addr, + &chunck_len); + if (!task->status || + chunck_len != task->desc_split->len) { + fprintf(stderr, + "failed to translate desc address.\n"); + rte_free(task); + return; + } + } else { + task->readtype = + descriptor_is_wr_split(task->desc_split); + vhost_process_payload_chain_split(task); + } + blk_vq->last_avail_idx++; + + ret = vhost_bdev_process_blk_commands(ctrlr->bdev, task); + if (ret) { + /* invalid response */ + *task->status = VIRTIO_BLK_S_IOERR; + } else { + /* successfully */ + *task->status = VIRTIO_BLK_S_OK; + } + + submit_completion_split(task, ctrlr->bdev->vid, q_idx); + } + + rte_free(task); +} + +static void * +ctrlr_worker(void *arg) +{ + struct vhost_blk_ctrlr *ctrlr = (struct vhost_blk_ctrlr *)arg; + struct vhost_blk_queue *blk_vq; + struct rte_vhost_ring_inflight *inflight_vq; + cpu_set_t cpuset; + pthread_t thread; + int i; + + fprintf(stdout, "Ctrlr Worker Thread start\n"); + + if (ctrlr == NULL || ctrlr->bdev == NULL) { + fprintf(stderr, + "%s: Error, invalid argument passed to worker thread\n", + __func__); + exit(0); + } + + thread = pthread_self(); + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); + + for (i = 0; i < NUM_OF_BLK_QUEUES; i++) { + blk_vq = &ctrlr->bdev->queues[i]; + inflight_vq = &blk_vq->inflight_vq; + if (inflight_vq->resubmit_inflight != NULL && + inflight_vq->resubmit_inflight->resubmit_num != 0) { + if (ctrlr->packed_ring) + submit_inflight_vq_packed(ctrlr, i); + else + submit_inflight_vq_split(ctrlr, i); + } + } + + while (!g_should_stop && ctrlr->bdev != NULL) { + for (i = 0; i < NUM_OF_BLK_QUEUES; i++) { + if (ctrlr->packed_ring) + process_requestq_packed(ctrlr, i); + else + process_requestq_split(ctrlr, i); + } + } + + g_should_stop = 2; + fprintf(stdout, "Ctrlr Worker Thread Exiting\n"); + sem_post(&exit_sem); + return NULL; +} + +static int +new_device(int vid) +{ + struct vhost_blk_ctrlr *ctrlr; + struct vhost_blk_queue *blk_vq; + struct rte_vhost_vring *vq; + uint64_t features; + pthread_t tid; + int i, ret; + + ctrlr = vhost_blk_ctrlr_find(dev_pathname); + if (!ctrlr) { + fprintf(stderr, "Controller is not ready\n"); + return -1; + } + + if (ctrlr->started) + return 0; + + ctrlr->bdev->vid = vid; + ret = rte_vhost_get_negotiated_features(vid, &features); + if (ret) { + fprintf(stderr, "failed to get the negotiated features\n"); + return -1; + } + ctrlr->packed_ring = !!(features & (1ULL << VIRTIO_F_RING_PACKED)); + + ret = rte_vhost_get_mem_table(vid, &ctrlr->mem); + if (ret) + fprintf(stderr, "Get Controller memory region failed\n"); + assert(ctrlr->mem != NULL); + + /* Disable Notifications and init last idx */ + for (i = 0; i < NUM_OF_BLK_QUEUES; i++) { + blk_vq = &ctrlr->bdev->queues[i]; + vq = &blk_vq->vq; + + ret = rte_vhost_get_vhost_vring(ctrlr->bdev->vid, i, vq); + assert(ret == 0); + + ret = rte_vhost_get_vring_base(ctrlr->bdev->vid, i, + &blk_vq->last_avail_idx, + &blk_vq->last_used_idx); + assert(ret == 0); + + ret = rte_vhost_get_vhost_ring_inflight(ctrlr->bdev->vid, i, + &blk_vq->inflight_vq); + assert(ret == 0); + + if (ctrlr->packed_ring) { + /* for the reconnection */ + ret = rte_vhost_get_vring_base_from_inflight( + ctrlr->bdev->vid, i, + &blk_vq->last_avail_idx, + &blk_vq->last_used_idx); + + blk_vq->avail_wrap_counter = blk_vq->last_avail_idx & + (1 << 15); + blk_vq->last_avail_idx = blk_vq->last_avail_idx & + 0x7fff; + blk_vq->used_wrap_counter = blk_vq->last_used_idx & + (1 << 15); + blk_vq->last_used_idx = blk_vq->last_used_idx & + 0x7fff; + } + + rte_vhost_enable_guest_notification(vid, i, 0); + } + + /* start polling vring */ + g_should_stop = 0; + fprintf(stdout, "New Device %s, Device ID %d\n", dev_pathname, vid); + if (pthread_create(&tid, NULL, &ctrlr_worker, ctrlr) < 0) { + fprintf(stderr, "Worker Thread Started Failed\n"); + return -1; + } + + /* device has been started */ + ctrlr->started = 1; + pthread_detach(tid); + return 0; +} + +static void +destroy_device(int vid) +{ + char path[PATH_MAX]; + struct vhost_blk_ctrlr *ctrlr; + struct vhost_blk_queue *blk_vq; + int i, ret; + + ret = rte_vhost_get_ifname(vid, path, PATH_MAX); + if (ret) { + fprintf(stderr, "Destroy Ctrlr Failed\n"); + return; + } + + fprintf(stdout, "Destroy %s Device ID %d\n", path, vid); + ctrlr = vhost_blk_ctrlr_find(path); + if (!ctrlr) { + fprintf(stderr, "Destroy Ctrlr Failed\n"); + return; + } + + if (!ctrlr->started) + return; + + g_should_stop = 1; + while (g_should_stop != 2) + ; + + for (i = 0; i < NUM_OF_BLK_QUEUES; i++) { + blk_vq = &ctrlr->bdev->queues[i]; + if (ctrlr->packed_ring) { + blk_vq->last_avail_idx |= (blk_vq->avail_wrap_counter << + 15); + blk_vq->last_used_idx |= (blk_vq->used_wrap_counter << + 15); + } + rte_vhost_set_vring_base(ctrlr->bdev->vid, i, + blk_vq->last_avail_idx, + blk_vq->last_used_idx); + } + + free(ctrlr->mem); + + ctrlr->started = 0; + sem_wait(&exit_sem); +} + +static int +new_connection(int vid) +{ + /* extend the proper features for block device */ + vhost_session_install_rte_compat_hooks(vid); + + return 0; +} + +struct vhost_device_ops vhost_blk_device_ops = { + .new_device = new_device, + .destroy_device = destroy_device, + .new_connection = new_connection, +}; + +static struct vhost_block_dev * +vhost_blk_bdev_construct(const char *bdev_name, + const char *bdev_serial, uint32_t blk_size, uint64_t blk_cnt, + bool wce_enable) +{ + struct vhost_block_dev *bdev; + + bdev = rte_zmalloc(NULL, sizeof(*bdev), RTE_CACHE_LINE_SIZE); + if (!bdev) + return NULL; + + strncpy(bdev->name, bdev_name, sizeof(bdev->name)); + strncpy(bdev->product_name, bdev_serial, sizeof(bdev->product_name)); + bdev->blocklen = blk_size; + bdev->blockcnt = blk_cnt; + bdev->write_cache = wce_enable; + + fprintf(stdout, "blocklen=%d, blockcnt=%"PRIx64"\n", bdev->blocklen, + bdev->blockcnt); + + /* use memory as disk storage space */ + bdev->data = rte_zmalloc(NULL, blk_cnt * blk_size, 0); + if (!bdev->data) { + fprintf(stderr, "no enough reserved huge memory for disk\n"); + free(bdev); + return NULL; + } + + return bdev; +} + +static struct vhost_blk_ctrlr * +vhost_blk_ctrlr_construct(const char *ctrlr_name) +{ + int ret; + struct vhost_blk_ctrlr *ctrlr; + char *path; + char cwd[PATH_MAX]; + + /* always use current directory */ + path = getcwd(cwd, PATH_MAX); + if (!path) { + fprintf(stderr, "Cannot get current working directory\n"); + return NULL; + } + snprintf(dev_pathname, sizeof(dev_pathname), "%s/%s", path, ctrlr_name); + + if (access(dev_pathname, F_OK) != -1) { + if (unlink(dev_pathname) != 0) + rte_exit(EXIT_FAILURE, "Cannot remove %s.\n", + dev_pathname); + } + + if (rte_vhost_driver_register(dev_pathname, 0) != 0) { + fprintf(stderr, "socket %s already exists\n", dev_pathname); + return NULL; + } + + ret = rte_vhost_driver_set_features(dev_pathname, VHOST_BLK_FEATURES); + if (ret != 0) { + fprintf(stderr, "Set vhost driver features failed\n"); + rte_vhost_driver_unregister(dev_pathname); + return NULL; + } + + /* set proper features */ + vhost_dev_install_rte_compat_hooks(dev_pathname); + + ctrlr = rte_zmalloc(NULL, sizeof(*ctrlr), RTE_CACHE_LINE_SIZE); + if (!ctrlr) { + rte_vhost_driver_unregister(dev_pathname); + return NULL; + } + + /* hardcoded block device information with 128MiB */ + ctrlr->bdev = vhost_blk_bdev_construct("malloc0", "vhost_blk_malloc0", + 4096, 32768, 0); + if (!ctrlr->bdev) { + rte_free(ctrlr); + rte_vhost_driver_unregister(dev_pathname); + return NULL; + } + + rte_vhost_driver_callback_register(dev_pathname, + &vhost_blk_device_ops); + + return ctrlr; +} + +static void +signal_handler(__rte_unused int signum) +{ + struct vhost_blk_ctrlr *ctrlr; + + if (access(dev_pathname, F_OK) == 0) + unlink(dev_pathname); + + if (g_should_stop != -1) { + g_should_stop = 1; + while (g_should_stop != 2) + ; + } + + ctrlr = vhost_blk_ctrlr_find(dev_pathname); + if (ctrlr != NULL) { + if (ctrlr->bdev != NULL) { + rte_free(ctrlr->bdev->data); + rte_free(ctrlr->bdev); + } + rte_free(ctrlr); + } + + rte_vhost_driver_unregister(dev_pathname); + exit(0); +} + +int main(int argc, char *argv[]) +{ + int ret; + + signal(SIGINT, signal_handler); + + /* init EAL */ + ret = rte_eal_init(argc, argv); + if (ret < 0) + rte_exit(EXIT_FAILURE, "Error with EAL initialization\n"); + + g_vhost_ctrlr = vhost_blk_ctrlr_construct("vhost.socket"); + if (g_vhost_ctrlr == NULL) { + fprintf(stderr, "Construct vhost blk controller failed\n"); + return 0; + } + + if (sem_init(&exit_sem, 0, 0) < 0) { + fprintf(stderr, "Error init exit_sem\n"); + return -1; + } + + rte_vhost_driver_start(dev_pathname); + + /* loop for exit the application */ + while (1) + sleep(1); + + return 0; +} + diff --git a/examples/vhost_blk/vhost_blk.h b/examples/vhost_blk/vhost_blk.h new file mode 100644 index 000000000..933e2b7c5 --- /dev/null +++ b/examples/vhost_blk/vhost_blk.h @@ -0,0 +1,127 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2010-2017 Intel Corporation + */ + +#ifndef _VHOST_BLK_H_ +#define _VHOST_BLK_H_ + +#include +#include +#include +#include +#include +#include + +#include + +#ifndef VIRTIO_F_RING_PACKED +#define VIRTIO_F_RING_PACKED 34 + +struct vring_packed_desc { + /* Buffer Address. */ + __le64 addr; + /* Buffer Length. */ + __le32 len; + /* Buffer ID. */ + __le16 id; + /* The flags depending on descriptor type. */ + __le16 flags; +}; +#endif + +struct vhost_blk_queue { + struct rte_vhost_vring vq; + struct rte_vhost_ring_inflight inflight_vq; + uint16_t last_avail_idx; + uint16_t last_used_idx; + bool avail_wrap_counter; + bool used_wrap_counter; +}; + +#define NUM_OF_BLK_QUEUES 1 + +#define min(a, b) (((a) < (b)) ? (a) : (b)) + +struct vhost_block_dev { + /** ID for vhost library. */ + int vid; + /** Queues for the block device */ + struct vhost_blk_queue queues[NUM_OF_BLK_QUEUES]; + /** Unique name for this block device. */ + char name[64]; + + /** Unique product name for this kind of block device. */ + char product_name[256]; + + /** Size in bytes of a logical block for the backend */ + uint32_t blocklen; + + /** Number of blocks */ + uint64_t blockcnt; + + /** write cache enabled, not used at the moment */ + int write_cache; + + /** use memory as disk storage space */ + uint8_t *data; +}; + +struct vhost_blk_ctrlr { + uint8_t started; + uint8_t packed_ring; + uint8_t need_restart; + /** Only support 1 LUN for the example */ + struct vhost_block_dev *bdev; + /** VM memory region */ + struct rte_vhost_memory *mem; +} __rte_cache_aligned; + +#define VHOST_BLK_MAX_IOVS 128 + +enum blk_data_dir { + BLK_DIR_NONE = 0, + BLK_DIR_TO_DEV = 1, + BLK_DIR_FROM_DEV = 2, +}; + +struct vhost_blk_task { + uint8_t readtype; + uint8_t req_idx; + uint16_t head_idx; + uint16_t last_idx; + uint16_t inflight_idx; + uint16_t buffer_id; + uint32_t dxfer_dir; + uint32_t data_len; + struct virtio_blk_outhdr *req; + + volatile uint8_t *status; + + struct iovec iovs[VHOST_BLK_MAX_IOVS]; + uint32_t iovs_cnt; + struct vring_packed_desc *desc_packed; + struct vring_desc *desc_split; + struct rte_vhost_vring *vq; + struct vhost_block_dev *bdev; + struct vhost_blk_ctrlr *ctrlr; +}; + +struct inflight_blk_task { + struct vhost_blk_task blk_task; + struct rte_vhost_inflight_desc_packed *inflight_desc; + struct rte_vhost_inflight_info_packed *inflight_packed; +}; + +struct vhost_blk_ctrlr *g_vhost_ctrlr; +struct vhost_device_ops vhost_blk_device_ops; + +int vhost_bdev_process_blk_commands(struct vhost_block_dev *bdev, + struct vhost_blk_task *task); + +void vhost_session_install_rte_compat_hooks(uint32_t vid); + +void vhost_dev_install_rte_compat_hooks(const char *path); + +struct vhost_blk_ctrlr *vhost_blk_ctrlr_find(const char *ctrlr_name); + +#endif /* _VHOST_blk_H_ */ diff --git a/examples/vhost_blk/vhost_blk_compat.c b/examples/vhost_blk/vhost_blk_compat.c new file mode 100644 index 000000000..4accfa498 --- /dev/null +++ b/examples/vhost_blk/vhost_blk_compat.c @@ -0,0 +1,173 @@ +// SPDX-License-Identifier: BSD-3-Clause +// Copyright(c) 2010-2017 Intel Corporation + +#ifndef _VHOST_BLK_COMPAT_H_ +#define _VHOST_BLK_COMPAT_H_ + +#include +#include +#include +#include + +#include +#include "vhost_blk.h" +#include "blk_spec.h" + +#define VHOST_MAX_VQUEUES 256 +#define SPDK_VHOST_MAX_VQ_SIZE 1024 + +#define VHOST_USER_GET_CONFIG 24 +#define VHOST_USER_SET_CONFIG 25 + +static int +vhost_blk_get_config(struct vhost_block_dev *bdev, uint8_t *config, + uint32_t len) +{ + struct virtio_blk_config blkcfg; + uint32_t blk_size; + uint64_t blkcnt; + + if (bdev == NULL) { + /* We can't just return -1 here as this GET_CONFIG message might + * be caused by a QEMU VM reboot. Returning -1 will indicate an + * error to QEMU, who might then decide to terminate itself. + * We don't want that. A simple reboot shouldn't break the + * system. + * + * Presenting a block device with block size 0 and block count 0 + * doesn't cause any problems on QEMU side and the virtio-pci + * device is even still available inside the VM, but there will + * be no block device created for it - the kernel drivers will + * silently reject it. + */ + blk_size = 0; + blkcnt = 0; + } else { + blk_size = bdev->blocklen; + blkcnt = bdev->blockcnt; + } + + memset(&blkcfg, 0, sizeof(blkcfg)); + blkcfg.blk_size = blk_size; + /* minimum I/O size in blocks */ + blkcfg.min_io_size = 1; + /* expressed in 512 Bytes sectors */ + blkcfg.capacity = (blkcnt * blk_size) / 512; + /* QEMU can overwrite this value when started */ + blkcfg.num_queues = VHOST_MAX_VQUEUES; + + fprintf(stdout, "block device:blk_size = %d, blkcnt = %ld\n", blk_size, + blkcnt); + + memcpy(config, &blkcfg, min(len, sizeof(blkcfg))); + + return 0; +} + +static enum rte_vhost_msg_result +extern_vhost_pre_msg_handler(int vid, void *_msg) +{ + char path[PATH_MAX]; + struct vhost_blk_ctrlr *ctrlr; + struct vhost_user_msg *msg = _msg; + int ret; + + ret = rte_vhost_get_ifname(vid, path, PATH_MAX); + if (ret) { + fprintf(stderr, "Cannot get socket name\n"); + return -1; + } + + ctrlr = vhost_blk_ctrlr_find(path); + if (!ctrlr) { + fprintf(stderr, "Controller is not ready\n"); + return -1; + } + + switch ((int)msg->request) { + case VHOST_USER_GET_VRING_BASE: + case VHOST_USER_SET_VRING_BASE: + case VHOST_USER_SET_VRING_ADDR: + case VHOST_USER_SET_VRING_NUM: + case VHOST_USER_SET_VRING_KICK: + case VHOST_USER_SET_VRING_CALL: + case VHOST_USER_SET_MEM_TABLE: + break; + case VHOST_USER_GET_CONFIG: { + int rc = 0; + + rc = vhost_blk_get_config(ctrlr->bdev, + msg->payload.cfg.region, + msg->payload.cfg.size); + if (rc != 0) + msg->size = 0; + + return RTE_VHOST_MSG_RESULT_REPLY; + } + case VHOST_USER_SET_CONFIG: + default: + break; + } + + return RTE_VHOST_MSG_RESULT_NOT_HANDLED; +} + +static enum rte_vhost_msg_result +extern_vhost_post_msg_handler(int vid, void *_msg) +{ + char path[PATH_MAX]; + struct vhost_blk_ctrlr *ctrlr; + struct vhost_user_msg *msg = _msg; + int ret; + + ret = rte_vhost_get_ifname(vid, path, PATH_MAX); + if (ret) { + fprintf(stderr, "Cannot get socket name\n"); + return -1; + } + + ctrlr = vhost_blk_ctrlr_find(path); + if (!ctrlr) { + fprintf(stderr, "Controller is not ready\n"); + return -1; + } + + switch (msg->request) { + case VHOST_USER_SET_FEATURES: + case VHOST_USER_SET_VRING_KICK: + default: + break; + } + + return RTE_VHOST_MSG_RESULT_NOT_HANDLED; +} + +struct rte_vhost_user_extern_ops g_extern_vhost_ops = { + .pre_msg_handle = extern_vhost_pre_msg_handler, + .post_msg_handle = extern_vhost_post_msg_handler, +}; + +void +vhost_session_install_rte_compat_hooks(uint32_t vid) +{ + int rc; + + rc = rte_vhost_extern_callback_register(vid, &g_extern_vhost_ops, NULL); + if (rc != 0) + fprintf(stderr, + "rte_vhost_extern_callback_register() failed for vid = %d\n", + vid); +} + +void +vhost_dev_install_rte_compat_hooks(const char *path) +{ + uint64_t protocol_features = 0; + + rte_vhost_driver_get_protocol_features(path, &protocol_features); + protocol_features |= (1ULL << VHOST_USER_PROTOCOL_F_CONFIG); + protocol_features |= (1ULL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD); + rte_vhost_driver_set_protocol_features(path, protocol_features); +} + +#endif -- 2.17.2