From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 237E911C5 for ; Wed, 11 Jan 2017 03:42:36 +0100 (CET) Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7D5DB6DD86; Wed, 11 Jan 2017 02:42:36 +0000 (UTC) Received: from [10.72.5.47] (vpn1-5-47.pek2.redhat.com [10.72.5.47]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v0B2gUYJ001168 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 10 Jan 2017 21:42:33 -0500 To: "Tan, Jianfeng" , dev@dpdk.org References: <1480689075-66977-1-git-send-email-jianfeng.tan@intel.com> <1482477266-39199-1-git-send-email-jianfeng.tan@intel.com> <1482477266-39199-6-git-send-email-jianfeng.tan@intel.com> <46af618f-c01b-3571-78fc-12d10859a4a1@redhat.com> <4fa4d06c-d359-df12-a073-7c2c2540b634@intel.com> Cc: yuanhan.liu@linux.intel.com, ferruh.yigit@intel.com, cunming.liang@intel.com From: Jason Wang Message-ID: Date: Wed, 11 Jan 2017 10:42:28 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <4fa4d06c-d359-df12-a073-7c2c2540b634@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Wed, 11 Jan 2017 02:42:36 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH v2 5/7] net/virtio_user: add vhost kernel support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Jan 2017 02:42:36 -0000 On 2017年01月10日 14:11, Tan, Jianfeng wrote: > > Hi Jason, > > > On 1/9/2017 12:39 PM, Jason Wang wrote: >> >> >> On 2016年12月23日 15:14, Jianfeng Tan wrote: >>> This patch add support vhost kernel as the backend for virtio_user. >>> Three main hook functions are added: >>> - vhost_kernel_setup() to open char device, each vq pair needs one >>> vhostfd; >>> - vhost_kernel_ioctl() to communicate control messages with vhost >>> kernel module; >>> - vhost_kernel_enable_queue_pair() to open tap device and set it >>> as the backend of corresonding vhost fd (that is to say, vq pair). >>> >>> Signed-off-by: Jianfeng Tan >>> --- >>> > [...] >>> +/* TUNSETIFF ifr flags */ >>> +#define IFF_TAP 0x0002 >>> +#define IFF_NO_PI 0x1000 >>> +#define IFF_ONE_QUEUE 0x2000 >>> +#define IFF_VNET_HDR 0x4000 >>> +#define IFF_MULTI_QUEUE 0x0100 >>> +#define IFF_ATTACH_QUEUE 0x0200 >>> +#define IFF_DETACH_QUEUE 0x0400 >> >> Do we really want to duplicate those things which has been exposed by >> uapi here? > > You mean those defined in ? Redefine those common > macros, or include standard header file, with respective pros and > cons. DPDK prefers the redefinition way as far as I understand, > doesn't it? > Well, if you really want to do this, you may want to use an independent file. Then you can sync it with linux headers with a bash script. >> >>> + >>> +/* Constants */ >>> +#define TUN_DEF_SNDBUF (1ull << 20) >>> +#define PATH_NET_TUN "/dev/net/tun" >>> +#define VHOST_KERNEL_MAX_REGIONS 64 >> >> Unfortunate not a constant any more since c9ce42f72fd0 vhost: add >> max_mem_regions module parameter. > > Yes, I was considering to ignore this in the initial release. But it's > not a big effort, I'll try to fix it in latest version. > >> >>> + >>> +static uint64_t vhost_req_user_to_kernel[] = { >>> + [VHOST_USER_SET_OWNER] = VHOST_SET_OWNER, >>> + [VHOST_USER_RESET_OWNER] = VHOST_RESET_OWNER, >>> + [VHOST_USER_SET_FEATURES] = VHOST_SET_FEATURES, >>> + [VHOST_USER_GET_FEATURES] = VHOST_GET_FEATURES, >>> + [VHOST_USER_SET_VRING_CALL] = VHOST_SET_VRING_CALL, >>> + [VHOST_USER_SET_VRING_NUM] = VHOST_SET_VRING_NUM, >>> + [VHOST_USER_SET_VRING_BASE] = VHOST_SET_VRING_BASE, >>> + [VHOST_USER_GET_VRING_BASE] = VHOST_GET_VRING_BASE, >>> + [VHOST_USER_SET_VRING_ADDR] = VHOST_SET_VRING_ADDR, >>> + [VHOST_USER_SET_VRING_KICK] = VHOST_SET_VRING_KICK, >>> + [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE, >>> +}; >>> + >>> +/* By default, vhost kernel module allows 64 regions, but DPDK allows >>> + * 256 segments. As a relief, below function merges those virtually >>> + * adjacent memsegs into one region. >>> + */ >>> +static struct vhost_memory_kernel * >>> +prepare_vhost_memory_kernel(void) >>> +{ >>> + uint32_t i, j, k = 0; >>> + struct rte_memseg *seg; >>> + struct vhost_memory_region *mr; >>> + struct vhost_memory_kernel *vm; >>> + >>> + vm = malloc(sizeof(struct vhost_memory_kernel) + >>> + VHOST_KERNEL_MAX_REGIONS * >>> + sizeof(struct vhost_memory_region)); >>> + >>> + for (i = 0; i < RTE_MAX_MEMSEG; ++i) { >>> + seg = &rte_eal_get_configuration()->mem_config->memseg[i]; >>> + if (!seg->addr) >>> + break; >> >> If we're sure the number of regions is less than 64(or the module >> parameter read from /sys), can we avoid the iteration here? > > The "if" statement under the "for" statement can save us from all > RTE_MAX_MEMSEG iteration. But if we know the number of regions is short than the limit, there's even no need for this? > >> >>> + >>> + int new_region = 1; >>> + >>> + for (j = 0; j < k; ++j) { >>> + mr = &vm->regions[j]; >>> + >>> + if (mr->userspace_addr + mr->memory_size == >>> + (uint64_t)seg->addr) { >>> + mr->memory_size += seg->len; >>> + new_region = 0; >>> + break; >>> + } >>> + >>> + if ((uint64_t)seg->addr + seg->len == >>> + mr->userspace_addr) { >>> + mr->guest_phys_addr = (uint64_t)seg->addr; >>> + mr->userspace_addr = (uint64_t)seg->addr; >>> + mr->memory_size += seg->len; >>> + new_region = 0; >>> + break; >>> + } >>> + } >>> + >>> + if (new_region == 0) >>> + continue; >>> + >>> + mr = &vm->regions[k++]; >>> + mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr >>> here! */ >>> + mr->userspace_addr = (uint64_t)seg->addr; >>> + mr->memory_size = seg->len; >>> + mr->mmap_offset = 0; >>> + >>> + if (k >= VHOST_KERNEL_MAX_REGIONS) { >>> + free(vm); >>> + return NULL; >>> + } >>> + } >>> + >>> + vm->nregions = k; >>> + vm->padding = 0; >>> + return vm; >>> +} >>> + >>> +static int >>> +vhost_kernel_ioctl(struct virtio_user_dev *dev, >>> + enum vhost_user_request req, >>> + void *arg) >>> +{ >>> + int i, ret = -1; >>> + uint64_t req_kernel; >>> + struct vhost_memory_kernel *vm = NULL; >>> + >>> + req_kernel = vhost_req_user_to_kernel[req]; >>> + >>> + if (req_kernel == VHOST_SET_MEM_TABLE) { >>> + vm = prepare_vhost_memory_kernel(); >>> + if (!vm) >>> + return -1; >>> + arg = (void *)vm; >>> + } >>> + >>> + /* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */ >> >> I think the reason is when VIRTIO_F_IOMMU_PLATFORM is negotiated, all >> address should be iova instead of gpa. >> > > Yes, I agree. As we don't have to do memory protection in a single > process, so it's completely useless here, right? Yes if there's no IOMMU concept in this case. > >>> + if (req_kernel == VHOST_SET_FEATURES) >>> + *(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM); >>> + >>> + for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) { >>> + if (dev->vhostfds[i] < 0) >>> + continue; >>> + > [...] >>> + if (!enable) { >>> + if (dev->tapfds[pair_idx]) { >>> + close(dev->tapfds[pair_idx]); >>> + dev->tapfds[pair_idx] = -1; >>> + } >>> + return vhost_kernel_set_backend(vhostfd, -1); >> >> If this is used to for thing like ethtool -L in guest, we should use >> TUNSETQUEUE here. > > Oops, I was missing this ioctl operation. Let me fix it in next version. > >> >>> + } else if (dev->tapfds[pair_idx] >= 0) { >>> + return 0; >>> + } >>> + >>> + if ((dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) || >>> + (dev->features & (1ULL << VIRTIO_F_VERSION_1))) >>> + hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf); >>> + else >>> + hdr_size = sizeof(struct virtio_net_hdr); >>> + >>> + /* TODO: >>> + * 1. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len >>> + * 2. get number of memory regions from vhost module parameter >>> + * max_mem_regions, supported in newer version linux kernel >>> + */ >>> + tapfd = open(PATH_NET_TUN, O_RDWR); >>> + if (tapfd < 0) { >>> + PMD_DRV_LOG(ERR, "fail to open %s: %s", >>> + PATH_NET_TUN, strerror(errno)); >>> + return -1; >>> + } >>> + >>> + /* Construct ifr */ >>> + memset(&ifr, 0, sizeof(ifr)); >>> + ifr.ifr_flags = IFF_TAP | IFF_NO_PI; >>> + >>> + if (ioctl(tapfd, TUNGETFEATURES, &features) == -1) { >>> + PMD_DRV_LOG(ERR, "TUNGETFEATURES failed: %s", >>> strerror(errno)); >>> + goto error; >>> + } >>> + if (features & IFF_ONE_QUEUE) >>> + ifr.ifr_flags |= IFF_ONE_QUEUE; >>> + >>> + /* Let tap instead of vhost-net handle vnet header, as the >>> latter does >>> + * not support offloading. And in this case, we should not set >>> feature >>> + * bit VHOST_NET_F_VIRTIO_NET_HDR. >>> + */ >>> + if (features & IFF_VNET_HDR) { >>> + ifr.ifr_flags |= IFF_VNET_HDR; >>> + } else { >>> + PMD_DRV_LOG(ERR, "TAP does not support IFF_VNET_HDR"); >>> + goto error; >>> + } >>> + >>> + if (dev->ifname) >>> + strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ); >>> + else >>> + strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ); >>> + if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) { >>> + PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno)); >>> + goto error; >>> + } >> >> This requires CAP_NET_ADMIN, so we should really consider to accept a >> pre-created fd here. > > It sounds like a future work for me. So far, all DPDK apps are running > in privileged mode (including CAP_NET_ADMIN?). > That's not safe. Accepting a per-created fd can solve this, and can even support macvtap. >> >>> + >>> + fcntl(tapfd, F_SETFL, O_NONBLOCK); >>> + >>> + if (ioctl(tapfd, TUNSETVNETHDRSZ, &hdr_size) < 0) { >>> + PMD_DRV_LOG(ERR, "TUNSETVNETHDRSZ failed: %s", >>> strerror(errno)); >>> + goto error; >>> + } >>> + >>> + if (ioctl(tapfd, TUNSETSNDBUF, &sndbuf) < 0) { >>> + PMD_DRV_LOG(ERR, "TUNSETSNDBUF failed: %s", strerror(errno)); >>> + goto error; >>> + } >> >> Let's use INT_MAX as default here to survive from evil consumer here. > > Oh yes, I will fix it. > > Thanks, > Jianfeng