* [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path
@ 2020-08-19 3:24 Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
` (4 more replies)
0 siblings, 5 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions.
This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. Vhost
application can choose whether using vectorized acceleration, it is
like external buffer and zero copy features.
If platform or ring format not support vectorized function, vhost will
fallback to use default batch function. There will be no impact in current
data path.
Marvin Liu (5):
vhost: add vectorized data path
vhost: reuse packed ring functions
vhost: prepare memory regions addresses
vhost: add packed ring vectorized dequeue
vhost: add packed ring vectorized enqueue
doc/guides/nics/vhost.rst | 5 +
doc/guides/prog_guide/vhost_lib.rst | 12 ++
drivers/net/vhost/rte_eth_vhost.c | 17 +-
lib/librte_vhost/Makefile | 13 ++
lib/librte_vhost/meson.build | 16 ++
lib/librte_vhost/rte_vhost.h | 1 +
lib/librte_vhost/socket.c | 5 +
lib/librte_vhost/vhost.c | 11 ++
lib/librte_vhost/vhost.h | 235 ++++++++++++++++++++++
lib/librte_vhost/vhost_user.c | 11 ++
lib/librte_vhost/vhost_vec_avx.c | 292 ++++++++++++++++++++++++++++
lib/librte_vhost/virtio_net.c | 257 ++++--------------------
12 files changed, 659 insertions(+), 216 deletions(-)
create mode 100644 lib/librte_vhost/vhost_vec_avx.c
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 1/5] vhost: add vectorized data path
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
@ 2020-08-19 3:24 ` Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
` (3 subsequent siblings)
4 siblings, 2 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.
So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b2..efdaf4de09 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
It is used to enable external buffer support in vhost library.
(Default: 0 (disabled))
+#. ``vectorized``:
+
+ It is used to enable vectorized data path support in vhost library.
+ (Default: 0 (disabled))
+
Vhost PMD event handling
------------------------
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index b892eec67a..d5d421441c 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -162,6 +162,18 @@ The following is an overview of some key Vhost API functions:
It is disabled by default.
+ - ``RTE_VHOST_USER_VECTORIZED``
+ Vectorized data path will used when this flag is set. When packed ring
+ enabled, available descriptors are stored from frontend driver in sequence.
+ SIMD instructions like AVX can be used to handle multiple descriptors
+ simultaneously. Thus can accelerate the throughput of ring operations.
+
+ * Only packed ring has vectorized data path.
+
+ * Will fallback to normal datapath if no vectorization support.
+
+ It is disabled by default.
+
* ``rte_vhost_driver_set_features(path, features)``
This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index e55278af69..2ba5a2a076 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -35,6 +35,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
#define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
#define ETH_VHOST_LINEAR_BUF "linear-buffer"
#define ETH_VHOST_EXT_BUF "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
#define VHOST_MAX_PKT_BURST 32
static const char *valid_arguments[] = {
@@ -47,6 +48,7 @@ static const char *valid_arguments[] = {
ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
ETH_VHOST_LINEAR_BUF,
ETH_VHOST_EXT_BUF,
+ ETH_VHOST_VECTORIZED,
NULL
};
@@ -1507,6 +1509,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
int tso = 0;
int linear_buf = 0;
int ext_buf = 0;
+ int vectorized = 0;
struct rte_eth_dev *eth_dev;
const char *name = rte_vdev_device_name(dev);
@@ -1626,6 +1629,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
}
+ if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+ ret = rte_kvargs_process(kvlist,
+ ETH_VHOST_VECTORIZED,
+ &open_int, &vectorized);
+ if (ret < 0)
+ goto out_free;
+
+ if (vectorized == 1)
+ flags |= RTE_VHOST_USER_VECTORIZED;
+ }
+
if (dev->device.numa_node == SOCKET_ID_ANY)
dev->device.numa_node = rte_socket_id();
@@ -1679,4 +1693,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
"postcopy-support=<0|1> "
"tso=<0|1> "
"linear-buffer=<0|1> "
- "ext-buffer=<0|1>");
+ "ext-buffer=<0|1> "
+ "vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index a94c84134d..c7f946c6c1 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
/* support only linear buffers (no chained mbufs) */
#define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
#define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED (1ULL << 8)
/* Features. */
#ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 73e1dca95e..cc11244693 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -43,6 +43,7 @@ struct vhost_user_socket {
bool extbuf;
bool linearbuf;
bool async_copy;
+ bool vectorized;
/*
* The "supported_features" indicates the feature bits the
@@ -245,6 +246,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
dev->async_copy = 1;
}
+ if (vsocket->vectorized)
+ vhost_enable_vectorized(vid);
+
VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
if (vsocket->notify_ops->new_connection) {
@@ -881,6 +885,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+ vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
if (vsocket->dequeue_zero_copy &&
(flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 8f20a0818f..50bf033a9d 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -752,6 +752,17 @@ vhost_enable_linearbuf(int vid)
dev->linearbuf = 1;
}
+void
+vhost_enable_vectorized(int vid)
+{
+ struct virtio_net *dev = get_device(vid);
+
+ if (dev == NULL)
+ return;
+
+ dev->vectorized = 1;
+}
+
int
rte_vhost_get_mtu(int vid, uint16_t *mtu)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 632f66d532..b556eb3bf6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -383,6 +383,7 @@ struct virtio_net {
int async_copy;
int extbuf;
int linearbuf;
+ int vectorized;
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
struct inflight_mem_info *inflight_info;
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -721,6 +722,7 @@ void vhost_enable_dequeue_zero_copy(int vid);
void vhost_set_builtin_virtio_net(int vid, bool enable);
void vhost_enable_extbuf(int vid);
void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
int vhost_enable_guest_notification(struct virtio_net *dev,
struct vhost_virtqueue *vq, int enable);
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
@ 2020-08-19 3:24 ` Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
` (2 subsequent siblings)
4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b556eb3bf6..5a5c945551 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
#include <rte_rwlock.h>
#include <rte_malloc.h>
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
#include "rte_vhost.h"
#include "rte_vdpa.h"
#include "rte_vdpa_dev.h"
@@ -905,4 +909,215 @@ put_zmbuf(struct zcopy_mbuf *zmbuf)
zmbuf->in_use = 0;
}
+static __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+ return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static __rte_always_inline void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+ struct rte_ipv4_hdr *ipv4_hdr;
+ struct rte_ipv6_hdr *ipv6_hdr;
+ void *l3_hdr = NULL;
+ struct rte_ether_hdr *eth_hdr;
+ uint16_t ethertype;
+
+ eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+ m->l2_len = sizeof(struct rte_ether_hdr);
+ ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+ if (ethertype == RTE_ETHER_TYPE_VLAN) {
+ struct rte_vlan_hdr *vlan_hdr =
+ (struct rte_vlan_hdr *)(eth_hdr + 1);
+
+ m->l2_len += sizeof(struct rte_vlan_hdr);
+ ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+ }
+
+ l3_hdr = (char *)eth_hdr + m->l2_len;
+
+ switch (ethertype) {
+ case RTE_ETHER_TYPE_IPV4:
+ ipv4_hdr = l3_hdr;
+ *l4_proto = ipv4_hdr->next_proto_id;
+ m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV4;
+ break;
+ case RTE_ETHER_TYPE_IPV6:
+ ipv6_hdr = l3_hdr;
+ *l4_proto = ipv6_hdr->proto;
+ m->l3_len = sizeof(struct rte_ipv6_hdr);
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV6;
+ break;
+ default:
+ m->l3_len = 0;
+ *l4_proto = 0;
+ *l4_hdr = NULL;
+ break;
+ }
+}
+
+static __rte_always_inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+ if (dev->features &
+ ((1ULL << VIRTIO_NET_F_CSUM) |
+ (1ULL << VIRTIO_NET_F_HOST_ECN) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+ (1ULL << VIRTIO_NET_F_HOST_UFO)))
+ return true;
+
+ return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+ uint16_t l4_proto = 0;
+ void *l4_hdr = NULL;
+ struct rte_tcp_hdr *tcp_hdr = NULL;
+
+ if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+ return;
+
+ parse_ethernet(m, &l4_proto, &l4_hdr);
+ if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+ if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+ switch (hdr->csum_offset) {
+ case (offsetof(struct rte_tcp_hdr, cksum)):
+ if (l4_proto == IPPROTO_TCP)
+ m->ol_flags |= PKT_TX_TCP_CKSUM;
+ break;
+ case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+ if (l4_proto == IPPROTO_UDP)
+ m->ol_flags |= PKT_TX_UDP_CKSUM;
+ break;
+ case (offsetof(struct rte_sctp_hdr, cksum)):
+ if (l4_proto == IPPROTO_SCTP)
+ m->ol_flags |= PKT_TX_SCTP_CKSUM;
+ break;
+ default:
+ break;
+ }
+ }
+ }
+
+ if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+ switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+ case VIRTIO_NET_HDR_GSO_TCPV4:
+ case VIRTIO_NET_HDR_GSO_TCPV6:
+ tcp_hdr = l4_hdr;
+ m->ol_flags |= PKT_TX_TCP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+ break;
+ case VIRTIO_NET_HDR_GSO_UDP:
+ m->ol_flags |= PKT_TX_UDP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = sizeof(struct rte_udp_hdr);
+ break;
+ default:
+ VHOST_LOG_DATA(WARNING,
+ "unsupported gso type %u.\n", hdr->gso_type);
+ break;
+ }
+ }
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+ rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+ struct rte_mbuf_ext_shared_info *shinfo = NULL;
+ uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+ uint16_t buf_len;
+ rte_iova_t iova;
+ void *buf;
+
+ /* Try to use pkt buffer to store shinfo to reduce the amount of memory
+ * required, otherwise store shinfo in the new buffer.
+ */
+ if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+ shinfo = rte_pktmbuf_mtod(pkt,
+ struct rte_mbuf_ext_shared_info *);
+ else {
+ total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+ total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+ }
+
+ if (unlikely(total_len > UINT16_MAX))
+ return -ENOSPC;
+
+ buf_len = total_len;
+ buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+ if (unlikely(buf == NULL))
+ return -ENOMEM;
+
+ /* Initialize shinfo */
+ if (shinfo) {
+ shinfo->free_cb = virtio_dev_extbuf_free;
+ shinfo->fcb_opaque = buf;
+ rte_mbuf_ext_refcnt_set(shinfo, 1);
+ } else {
+ shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+ virtio_dev_extbuf_free, buf);
+ if (unlikely(shinfo == NULL)) {
+ rte_free(buf);
+ VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+ return -1;
+ }
+ }
+
+ iova = rte_malloc_virt2iova(buf);
+ rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+ rte_pktmbuf_reset_headroom(pkt);
+
+ return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+ uint32_t data_len)
+{
+ struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+ if (unlikely(pkt == NULL)) {
+ VHOST_LOG_DATA(ERR,
+ "Failed to allocate memory for mbuf.\n");
+ return NULL;
+ }
+
+ if (rte_pktmbuf_tailroom(pkt) >= data_len)
+ return pkt;
+
+ /* attach an external buffer if supported */
+ if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+ return pkt;
+
+ /* check if chained buffers are allowed */
+ if (!dev->linearbuf)
+ return pkt;
+
+ /* Data doesn't fit into the buffer and the host supports
+ * only linear buffers
+ */
+ rte_pktmbuf_free(pkt);
+
+ return NULL;
+}
+
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index bd9303c8a9..6107662685 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
}
-static __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
- return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
static bool
is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
{
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
}
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
- if (dev->features &
- ((1ULL << VIRTIO_NET_F_CSUM) |
- (1ULL << VIRTIO_NET_F_HOST_ECN) |
- (1ULL << VIRTIO_NET_F_HOST_TSO4) |
- (1ULL << VIRTIO_NET_F_HOST_TSO6) |
- (1ULL << VIRTIO_NET_F_HOST_UFO)))
- return true;
-
- return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- struct rte_ipv6_hdr *ipv6_hdr;
- void *l3_hdr = NULL;
- struct rte_ether_hdr *eth_hdr;
- uint16_t ethertype;
-
- eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
- m->l2_len = sizeof(struct rte_ether_hdr);
- ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
- if (ethertype == RTE_ETHER_TYPE_VLAN) {
- struct rte_vlan_hdr *vlan_hdr =
- (struct rte_vlan_hdr *)(eth_hdr + 1);
-
- m->l2_len += sizeof(struct rte_vlan_hdr);
- ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
- }
-
- l3_hdr = (char *)eth_hdr + m->l2_len;
-
- switch (ethertype) {
- case RTE_ETHER_TYPE_IPV4:
- ipv4_hdr = l3_hdr;
- *l4_proto = ipv4_hdr->next_proto_id;
- m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV4;
- break;
- case RTE_ETHER_TYPE_IPV6:
- ipv6_hdr = l3_hdr;
- *l4_proto = ipv6_hdr->proto;
- m->l3_len = sizeof(struct rte_ipv6_hdr);
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV6;
- break;
- default:
- m->l3_len = 0;
- *l4_proto = 0;
- *l4_hdr = NULL;
- break;
- }
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
- uint16_t l4_proto = 0;
- void *l4_hdr = NULL;
- struct rte_tcp_hdr *tcp_hdr = NULL;
-
- if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
- return;
-
- parse_ethernet(m, &l4_proto, &l4_hdr);
- if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
- if (hdr->csum_start == (m->l2_len + m->l3_len)) {
- switch (hdr->csum_offset) {
- case (offsetof(struct rte_tcp_hdr, cksum)):
- if (l4_proto == IPPROTO_TCP)
- m->ol_flags |= PKT_TX_TCP_CKSUM;
- break;
- case (offsetof(struct rte_udp_hdr, dgram_cksum)):
- if (l4_proto == IPPROTO_UDP)
- m->ol_flags |= PKT_TX_UDP_CKSUM;
- break;
- case (offsetof(struct rte_sctp_hdr, cksum)):
- if (l4_proto == IPPROTO_SCTP)
- m->ol_flags |= PKT_TX_SCTP_CKSUM;
- break;
- default:
- break;
- }
- }
- }
-
- if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
- switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
- case VIRTIO_NET_HDR_GSO_TCPV4:
- case VIRTIO_NET_HDR_GSO_TCPV6:
- tcp_hdr = l4_hdr;
- m->ol_flags |= PKT_TX_TCP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
- break;
- case VIRTIO_NET_HDR_GSO_UDP:
- m->ol_flags |= PKT_TX_UDP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = sizeof(struct rte_udp_hdr);
- break;
- default:
- VHOST_LOG_DATA(WARNING,
- "unsupported gso type %u.\n", hdr->gso_type);
- break;
- }
- }
-}
-
static __rte_noinline void
copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
struct buf_vector *buf_vec)
@@ -2145,96 +2024,6 @@ get_zmbuf(struct vhost_virtqueue *vq)
return NULL;
}
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
- rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
- struct rte_mbuf_ext_shared_info *shinfo = NULL;
- uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
- uint16_t buf_len;
- rte_iova_t iova;
- void *buf;
-
- /* Try to use pkt buffer to store shinfo to reduce the amount of memory
- * required, otherwise store shinfo in the new buffer.
- */
- if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
- shinfo = rte_pktmbuf_mtod(pkt,
- struct rte_mbuf_ext_shared_info *);
- else {
- total_len += sizeof(*shinfo) + sizeof(uintptr_t);
- total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
- }
-
- if (unlikely(total_len > UINT16_MAX))
- return -ENOSPC;
-
- buf_len = total_len;
- buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
- if (unlikely(buf == NULL))
- return -ENOMEM;
-
- /* Initialize shinfo */
- if (shinfo) {
- shinfo->free_cb = virtio_dev_extbuf_free;
- shinfo->fcb_opaque = buf;
- rte_mbuf_ext_refcnt_set(shinfo, 1);
- } else {
- shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
- virtio_dev_extbuf_free, buf);
- if (unlikely(shinfo == NULL)) {
- rte_free(buf);
- VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
- return -1;
- }
- }
-
- iova = rte_malloc_virt2iova(buf);
- rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
- rte_pktmbuf_reset_headroom(pkt);
-
- return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
- uint32_t data_len)
-{
- struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
- if (unlikely(pkt == NULL)) {
- VHOST_LOG_DATA(ERR,
- "Failed to allocate memory for mbuf.\n");
- return NULL;
- }
-
- if (rte_pktmbuf_tailroom(pkt) >= data_len)
- return pkt;
-
- /* attach an external buffer if supported */
- if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
- return pkt;
-
- /* check if chained buffers are allowed */
- if (!dev->linearbuf)
- return pkt;
-
- /* Data doesn't fit into the buffer and the host supports
- * only linear buffers
- */
- rte_pktmbuf_free(pkt);
-
- return NULL;
-}
-
static __rte_noinline uint16_t
virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-08-19 3:24 ` Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5a5c945551..4a81f18f01 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
#define ASYNC_MAX_POLL_SEG 255
+#define MAX_NREGIONS 8
+
#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
@@ -375,6 +377,8 @@ struct inflight_mem_info {
struct virtio_net {
/* Frontend (QEMU) memory and memory region information */
struct rte_vhost_memory *mem;
+ uint64_t regions_low_addrs[MAX_NREGIONS];
+ uint64_t regions_high_addrs[MAX_NREGIONS];
uint64_t features;
uint64_t protocol_features;
int vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index c3c924faec..89e75e9e71 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
}
}
+ RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+ if (dev->vectorized) {
+ for (i = 0; i < memory->nregions; i++) {
+ dev->regions_low_addrs[i] =
+ memory->regions[i].guest_phys_addr;
+ dev->regions_high_addrs[i] =
+ memory->regions[i].guest_phys_addr +
+ memory->regions[i].memory_size;
+ }
+ }
+
for (i = 0; i < dev->nr_vring; i++) {
struct vhost_virtqueue *vq = dev->virtqueue[i];
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
` (2 preceding siblings ...)
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-08-19 3:24 ` Marvin Liu
2020-09-18 13:44 ` Maxime Coquelin
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
4 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.
If platform or compiler not support vectorization, will fallback to
default path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
index 4f2f3e47da..c0cd7d498f 100644
--- a/lib/librte_vhost/Makefile
+++ b/lib/librte_vhost/Makefile
@@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
endif
endif
+ifneq ($(FORCE_DISABLE_AVX512), y)
+ CC_AVX512_SUPPORT=\
+ $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
+ sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
+ grep -q AVX512 && echo 1)
+endif
+
ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
LDLIBS += -lnuma
endif
@@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev -lrte_net
SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
vhost_user.c virtio_net.c vdpa.c
+ifeq ($(CC_AVX512_SUPPORT), 1)
+CFLAGS += -DCC_AVX512_SUPPORT
+SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
+CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
+endif
+
# install includes
SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
rte_vdpa_dev.h rte_vhost_async.h
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c67..c1481802d7 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
if has_libnuma == 1
dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
endif
+
+if arch_subdir == 'x86'
+ if not machine_args.contains('-mno-avx512f')
+ if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+ cflags += ['-DCC_AVX512_SUPPORT']
+ vhost_avx512_lib = static_library('vhost_avx512_lib',
+ 'vhost_vec_avx.c',
+ dependencies: [static_rte_eal, static_rte_mempool,
+ static_rte_mbuf, static_rte_ethdev, static_rte_net],
+ include_directories: includes,
+ c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+ objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
+ endif
+ endif
+endif
+
if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 4a81f18f01..fc7daf2145 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
return NULL;
}
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids);
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
new file mode 100644
index 0000000000..e8361d18fa
--- /dev/null
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+ 1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+ sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+ sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+ offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+ << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+ FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint32_t descs_status;
+ void *desc_addr;
+ uint16_t i;
+ uint8_t cmp_low, cmp_high, cmp_result;
+ uint64_t lens[PACKED_BATCH_SIZE];
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+
+ /* load 4 descs */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ /* burst check four status */
+ __m512i avail_flag_vec;
+ if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG);
+
+#endif
+ else
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+ 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+ descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+ _MM_CMPINT_NE);
+ if (descs_status & BATCH_FLAGS_MASK)
+ return -1;
+
+ /* check buffer fit into one region & translate address */
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low + descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ goto free_buf;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ lens[i] = descs[avail_idx + i].len;
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+ pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
+ if (!pkts[i])
+ goto free_buf;
+ }
+
+ if (unlikely(virtio_net_is_inorder(dev))) {
+ ids[PACKED_BATCH_SIZE - 1] =
+ descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+ } else {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+ ids[i] = descs[avail_idx + i].id;
+ }
+
+ uint64_t addrs[PACKED_BATCH_SIZE << 1];
+ /* store mbuf data_len, pkt_len */
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+ addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+ + sizeof(uint64_t);
+ }
+
+ /* save pkt_len and data_len into mbufs */
+ __m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+ 0xAA);
+ __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+ (uint32_t)-12);
+ value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+ __m512i vindex = _mm512_loadu_si512((void *)addrs);
+ _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+ return 0;
+free_buf:
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ rte_pktmbuf_free(pkts[i]);
+
+ return -1;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 6107662685..e4d2e2e7d6 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
return -1;
}
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+ if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+ return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+#else
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+
+#endif
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids);
+}
+
static __rte_always_inline int
virtio_dev_tx_batch_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq,
@@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
uint16_t ids[PACKED_BATCH_SIZE];
uint16_t i;
- if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
- avail_idx, desc_addrs, ids))
+
+ if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids))
return -1;
vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
` (3 preceding siblings ...)
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-08-19 3:24 ` Marvin Liu
4 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-08-19 3:24 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index fc7daf2145..b78b2c5c1b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
uint16_t avail_idx,
uintptr_t *desc_addrs,
uint16_t *ids);
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts);
+
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
index e8361d18fa..12b902253a 100644
--- a/lib/librte_vhost/vhost_vec_avx.c
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -35,9 +35,15 @@
#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
#define DESC_FLAGS_POS 0xaa
#define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
int
vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -150,3 +156,137 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
return -1;
}
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint16_t avail_idx = vq->last_avail_idx;
+ uint64_t desc_addrs[PACKED_BATCH_SIZE];
+ uint32_t buf_offset = dev->vhost_hlen;
+ uint32_t desc_status;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ uint16_t i;
+ void *desc_addr;
+ uint8_t cmp_low, cmp_high, cmp_result;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+
+ /* check refcnt and nb_segs */
+ __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+ /* load four mbufs rearm data */
+ __m256i mbufs = _mm256_set_epi64x(
+ *pkts[3]->rearm_data,
+ *pkts[2]->rearm_data,
+ *pkts[1]->rearm_data,
+ *pkts[0]->rearm_data);
+
+ uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+ if (cmp & MBUF_LENS_POS)
+ return -1;
+
+ /* check desc status */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ __m512i avail_flag_vec;
+ __m512i used_flag_vec;
+ if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+ 0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG);
+ used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_FLAGS_MASK);
+#endif
+ } else {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(
+ PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+ PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG_WRAP);
+ used_flag_vec = _mm512_setzero_epi32();
+#endif
+ }
+
+ desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+ avail_flag_vec, _MM_CMPINT_NE);
+ if (desc_status)
+ return -1;
+
+ /* check buffer fit into one region & translate address */
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low + descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ return -1;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *, 0));
+ }
+
+ /* check length is enough */
+ __m512i pkt_lens = _mm512_set_epi32(
+ 0, pkts[3]->pkt_len, 0, 0,
+ 0, pkts[2]->pkt_len, 0, 0,
+ 0, pkts[1]->pkt_len, 0, 0,
+ 0, pkts[0]->pkt_len, 0, 0);
+
+ __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+ dev->vhost_hlen);
+ __m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+ uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+ desc_vec, buf_len_vec, _MM_CMPINT_LT);
+ if (lens_cmp)
+ return -1;
+
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+ rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+ pkts[i]->pkt_len);
+ }
+
+ if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ lens[i] = descs[avail_idx + i].len;
+ vhost_log_cache_write_iova(dev, vq,
+ descs[avail_idx + i].addr, lens[i]);
+ }
+ }
+
+ vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+ vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+ /* save len and flags, skip addr and id */
+ __m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+ DESC_LENS_FLAGS_POS, buf_len_vec,
+ used_flag_vec);
+ _mm512_storeu_si512(desc_addr, desc_updated);
+
+ return 0;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index e4d2e2e7d6..5c56a8d6ff 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
return 0;
}
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+
+{
+ if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+ return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+#else
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+}
+
static __rte_noinline uint32_t
virtio_dev_rx_packed(struct virtio_net *dev,
struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
if (remained >= PACKED_BATCH_SIZE) {
- if (!virtio_dev_rx_batch_packed(dev, vq,
- &pkts[pkt_idx])) {
+ if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+ &pkts[pkt_idx])) {
pkt_idx += PACKED_BATCH_SIZE;
remained -= PACKED_BATCH_SIZE;
continue;
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-09-18 13:44 ` Maxime Coquelin
2020-09-21 6:26 ` Liu, Yong
0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-09-18 13:44 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
On 8/19/20 5:24 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
>
> If platform or compiler not support vectorization, will fallback to
> default path.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> index 4f2f3e47da..c0cd7d498f 100644
> --- a/lib/librte_vhost/Makefile
> +++ b/lib/librte_vhost/Makefile
> @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> endif
> endif
>
> +ifneq ($(FORCE_DISABLE_AVX512), y)
> + CC_AVX512_SUPPORT=\
> + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> + grep -q AVX512 && echo 1)
> +endif
> +
> ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> LDLIBS += -lnuma
> endif
> @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -lrte_ethdev -lrte_net
> SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
> vhost_user.c virtio_net.c vdpa.c
>
> +ifeq ($(CC_AVX512_SUPPORT), 1)
> +CFLAGS += -DCC_AVX512_SUPPORT
> +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> +endif
> +
> # install includes
> SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h rte_vdpa.h \
> rte_vdpa_dev.h rte_vhost_async.h
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
> if has_libnuma == 1
> dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> endif
> +
> +if arch_subdir == 'x86'
> + if not machine_args.contains('-mno-avx512f')
> + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> + cflags += ['-DCC_AVX512_SUPPORT']
> + vhost_avx512_lib = static_library('vhost_avx512_lib',
> + 'vhost_vec_avx.c',
> + dependencies: [static_rte_eal, static_rte_mempool,
> + static_rte_mbuf, static_rte_ethdev, static_rte_net],
> + include_directories: includes,
> + c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> + endif
> + endif
> +endif
> +
> if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
> return NULL;
> }
>
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids);
> #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..e8361d18fa
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -0,0 +1,152 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> + 1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> + sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> + sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> + << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> + FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids)
> +{
> + struct vring_packed_desc *descs = vq->desc_packed;
> + uint32_t descs_status;
> + void *desc_addr;
> + uint16_t i;
> + uint8_t cmp_low, cmp_high, cmp_result;
> + uint64_t lens[PACKED_BATCH_SIZE];
> +
> + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> + return -1;
> +
> + /* load 4 descs */
> + desc_addr = &vq->desc_packed[avail_idx];
> + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> + /* burst check four status */
> + __m512i avail_flag_vec;
> + if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> + PACKED_FLAGS_MASK, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG);
> +
> +#endif
> + else
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> + 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> + _MM_CMPINT_NE);
> + if (descs_status & BATCH_FLAGS_MASK)
> + return -1;
> +
> + /* check buffer fit into one region & translate address */
> + __m512i regions_low_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> + __m512i regions_high_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t addr_low = descs[avail_idx + i].addr;
> + uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> + regions_low_addrs, _MM_CMPINT_NLT);
> + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> + regions_high_addrs, _MM_CMPINT_LT);
> + cmp_result = cmp_low & cmp_high;
> + int index = __builtin_ctz(cmp_result);
> + if (unlikely((uint32_t)index >= dev->mem->nregions))
> + goto free_buf;
> +
> + desc_addrs[i] = addr_low +
> + dev->mem->regions[index].host_user_addr -
> + dev->mem->regions[index].guest_phys_addr;
> + lens[i] = descs[avail_idx + i].len;
> + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> + if (!pkts[i])
> + goto free_buf;
> + }
The above does not support vIOMMU, isn't it?
The more the packed datapath evolves, the more it gets optimized for a
very specific configuration.
In v19.11, indirect descriptors and chained buffers are handled as a
fallback. And now vIOMMU support is handled as a fallback.
I personnally don't like the path it is taking as it is adding a lot of
complexity on top of that.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
2020-09-18 13:44 ` Maxime Coquelin
@ 2020-09-21 6:26 ` Liu, Yong
2020-09-21 7:47 ` Liu, Yong
0 siblings, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-09-21 6:26 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Friday, September 18, 2020 9:45 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
>
>
>
> On 8/19/20 5:24 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > index 4f2f3e47da..c0cd7d498f 100644
> > --- a/lib/librte_vhost/Makefile
> > +++ b/lib/librte_vhost/Makefile
> > @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> > endif
> > endif
> >
> > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > + CC_AVX512_SUPPORT=\
> > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' | \
> > + grep -q AVX512 && echo 1)
> > +endif
> > +
> > ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> > LDLIBS += -lnuma
> > endif
> > @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> lrte_ethdev -lrte_net
> > SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c \
> > vhost_user.c virtio_net.c vdpa.c
> >
> > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > +CFLAGS += -DCC_AVX512_SUPPORT
> > +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> > +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> > +endif
> > +
> > # install includes
> > SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> rte_vdpa.h \
> > rte_vdpa_dev.h
> rte_vhost_async.h
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> > if has_libnuma == 1
> > dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> > endif
> > +
> > +if arch_subdir == 'x86'
> > + if not machine_args.contains('-mno-avx512f')
> > + if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > + cflags += ['-DCC_AVX512_SUPPORT']
> > + vhost_avx512_lib = static_library('vhost_avx512_lib',
> > + 'vhost_vec_avx.c',
> > + dependencies: [static_rte_eal,
> static_rte_mempool,
> > + static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > + include_directories: includes,
> > + c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > + endif
> > + endif
> > +endif
> > +
> > if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> > return NULL;
> > }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids);
> > #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..e8361d18fa
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -0,0 +1,152 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > + 1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > + sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > + sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) |
> \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > + << FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > + FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids)
> > +{
> > + struct vring_packed_desc *descs = vq->desc_packed;
> > + uint32_t descs_status;
> > + void *desc_addr;
> > + uint16_t i;
> > + uint8_t cmp_low, cmp_high, cmp_result;
> > + uint64_t lens[PACKED_BATCH_SIZE];
> > +
> > + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > + return -1;
> > +
> > + /* load 4 descs */
> > + desc_addr = &vq->desc_packed[avail_idx];
> > + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > + /* burst check four status */
> > + __m512i avail_flag_vec;
> > + if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > + PACKED_FLAGS_MASK, 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > + else
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > + 0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > + _MM_CMPINT_NE);
> > + if (descs_status & BATCH_FLAGS_MASK)
> > + return -1;
> > +
> > + /* check buffer fit into one region & translate address */
> > + __m512i regions_low_addrs =
> > + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> > + __m512i regions_high_addrs =
> > + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t addr_low = descs[avail_idx + i].addr;
> > + uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> > + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> > + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> > +
> > + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> > + regions_low_addrs, _MM_CMPINT_NLT);
> > + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> > + regions_high_addrs, _MM_CMPINT_LT);
> > + cmp_result = cmp_low & cmp_high;
> > + int index = __builtin_ctz(cmp_result);
> > + if (unlikely((uint32_t)index >= dev->mem->nregions))
> > + goto free_buf;
> > +
> > + desc_addrs[i] = addr_low +
> > + dev->mem->regions[index].host_user_addr -
> > + dev->mem->regions[index].guest_phys_addr;
> > + lens[i] = descs[avail_idx + i].len;
> > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> > + if (!pkts[i])
> > + goto free_buf;
> > + }
>
> The above does not support vIOMMU, isn't it?
>
> The more the packed datapath evolves, the more it gets optimized for a
> very specific configuration.
>
> In v19.11, indirect descriptors and chained buffers are handled as a
> fallback. And now vIOMMU support is handled as a fallback.
>
Hi Maxime,
Thanks for figuring out the feature miss. First version patch is lack of vIOMMU supporting.
V2 patch will fix the feature gap between vectorized function and original batch function.
So there will be no additional fallback introduced in vectorized patch set.
IMHO, current packed optimization introduced complexity is for handling that gap between performance aimed frontend (like PMD) and normal network traffic (like TCP).
Vectorized datapath is focusing in enhancing the performance of batched function. From function point of view, there will no difference between vectorized batched function and original batched function.
Current packed ring path will remain the same if vectorized option is not enable. So I think the complexity won't increase too much. If there's any concern, please let me known.
BTW, vectorized path can help performance a lot when vIOMMU enabled.
Regards,
Marvin
> I personnally don't like the path it is taking as it is adding a lot of
> complexity on top of that.
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
` (5 more replies)
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
1 sibling, 6 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions.
This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. Vhost
application can choose whether using vectorized acceleration, it is
like external buffer and zero copy features.
If platform or ring format not support vectorized function, vhost will
fallback to use default batch function. There will be no impact in current
data path.
v2:
* add vIOMMU support
* add dequeue offloading
* rebase code
Marvin Liu (5):
vhost: add vectorized data path
vhost: reuse packed ring functions
vhost: prepare memory regions addresses
vhost: add packed ring vectorized dequeue
vhost: add packed ring vectorized enqueue
doc/guides/nics/vhost.rst | 5 +
doc/guides/prog_guide/vhost_lib.rst | 12 +
drivers/net/vhost/rte_eth_vhost.c | 17 +-
lib/librte_vhost/meson.build | 16 ++
lib/librte_vhost/rte_vhost.h | 1 +
lib/librte_vhost/socket.c | 5 +
lib/librte_vhost/vhost.c | 11 +
lib/librte_vhost/vhost.h | 235 +++++++++++++++++++
lib/librte_vhost/vhost_user.c | 11 +
lib/librte_vhost/vhost_vec_avx.c | 338 ++++++++++++++++++++++++++++
lib/librte_vhost/virtio_net.c | 257 ++++-----------------
11 files changed, 692 insertions(+), 216 deletions(-)
create mode 100644 lib/librte_vhost/vhost_vec_avx.c
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 1/5] vhost: add vectorized data path
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
` (4 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.
So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b2..efdaf4de09 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
It is used to enable external buffer support in vhost library.
(Default: 0 (disabled))
+#. ``vectorized``:
+
+ It is used to enable vectorized data path support in vhost library.
+ (Default: 0 (disabled))
+
Vhost PMD event handling
------------------------
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index b892eec67a..d5d421441c 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -162,6 +162,18 @@ The following is an overview of some key Vhost API functions:
It is disabled by default.
+ - ``RTE_VHOST_USER_VECTORIZED``
+ Vectorized data path will used when this flag is set. When packed ring
+ enabled, available descriptors are stored from frontend driver in sequence.
+ SIMD instructions like AVX can be used to handle multiple descriptors
+ simultaneously. Thus can accelerate the throughput of ring operations.
+
+ * Only packed ring has vectorized data path.
+
+ * Will fallback to normal datapath if no vectorization support.
+
+ It is disabled by default.
+
* ``rte_vhost_driver_set_features(path, features)``
This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index e55278af69..2ba5a2a076 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -35,6 +35,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
#define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
#define ETH_VHOST_LINEAR_BUF "linear-buffer"
#define ETH_VHOST_EXT_BUF "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
#define VHOST_MAX_PKT_BURST 32
static const char *valid_arguments[] = {
@@ -47,6 +48,7 @@ static const char *valid_arguments[] = {
ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
ETH_VHOST_LINEAR_BUF,
ETH_VHOST_EXT_BUF,
+ ETH_VHOST_VECTORIZED,
NULL
};
@@ -1507,6 +1509,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
int tso = 0;
int linear_buf = 0;
int ext_buf = 0;
+ int vectorized = 0;
struct rte_eth_dev *eth_dev;
const char *name = rte_vdev_device_name(dev);
@@ -1626,6 +1629,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
}
+ if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+ ret = rte_kvargs_process(kvlist,
+ ETH_VHOST_VECTORIZED,
+ &open_int, &vectorized);
+ if (ret < 0)
+ goto out_free;
+
+ if (vectorized == 1)
+ flags |= RTE_VHOST_USER_VECTORIZED;
+ }
+
if (dev->device.numa_node == SOCKET_ID_ANY)
dev->device.numa_node = rte_socket_id();
@@ -1679,4 +1693,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
"postcopy-support=<0|1> "
"tso=<0|1> "
"linear-buffer=<0|1> "
- "ext-buffer=<0|1>");
+ "ext-buffer=<0|1> "
+ "vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index a94c84134d..c7f946c6c1 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
/* support only linear buffers (no chained mbufs) */
#define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
#define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED (1ULL << 8)
/* Features. */
#ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 73e1dca95e..cc11244693 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -43,6 +43,7 @@ struct vhost_user_socket {
bool extbuf;
bool linearbuf;
bool async_copy;
+ bool vectorized;
/*
* The "supported_features" indicates the feature bits the
@@ -245,6 +246,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
dev->async_copy = 1;
}
+ if (vsocket->vectorized)
+ vhost_enable_vectorized(vid);
+
VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
if (vsocket->notify_ops->new_connection) {
@@ -881,6 +885,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+ vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
if (vsocket->dequeue_zero_copy &&
(flags & RTE_VHOST_USER_IOMMU_SUPPORT)) {
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 8f20a0818f..50bf033a9d 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -752,6 +752,17 @@ vhost_enable_linearbuf(int vid)
dev->linearbuf = 1;
}
+void
+vhost_enable_vectorized(int vid)
+{
+ struct virtio_net *dev = get_device(vid);
+
+ if (dev == NULL)
+ return;
+
+ dev->vectorized = 1;
+}
+
int
rte_vhost_get_mtu(int vid, uint16_t *mtu)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 632f66d532..b556eb3bf6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -383,6 +383,7 @@ struct virtio_net {
int async_copy;
int extbuf;
int linearbuf;
+ int vectorized;
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
struct inflight_mem_info *inflight_info;
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -721,6 +722,7 @@ void vhost_enable_dequeue_zero_copy(int vid);
void vhost_set_builtin_virtio_net(int vid, bool enable);
void vhost_enable_extbuf(int vid);
void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
int vhost_enable_guest_notification(struct virtio_net *dev,
struct vhost_virtqueue *vq, int enable);
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
` (3 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b556eb3bf6..5a5c945551 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
#include <rte_rwlock.h>
#include <rte_malloc.h>
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
#include "rte_vhost.h"
#include "rte_vdpa.h"
#include "rte_vdpa_dev.h"
@@ -905,4 +909,215 @@ put_zmbuf(struct zcopy_mbuf *zmbuf)
zmbuf->in_use = 0;
}
+static __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+ return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static __rte_always_inline void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+ struct rte_ipv4_hdr *ipv4_hdr;
+ struct rte_ipv6_hdr *ipv6_hdr;
+ void *l3_hdr = NULL;
+ struct rte_ether_hdr *eth_hdr;
+ uint16_t ethertype;
+
+ eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+ m->l2_len = sizeof(struct rte_ether_hdr);
+ ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+ if (ethertype == RTE_ETHER_TYPE_VLAN) {
+ struct rte_vlan_hdr *vlan_hdr =
+ (struct rte_vlan_hdr *)(eth_hdr + 1);
+
+ m->l2_len += sizeof(struct rte_vlan_hdr);
+ ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+ }
+
+ l3_hdr = (char *)eth_hdr + m->l2_len;
+
+ switch (ethertype) {
+ case RTE_ETHER_TYPE_IPV4:
+ ipv4_hdr = l3_hdr;
+ *l4_proto = ipv4_hdr->next_proto_id;
+ m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV4;
+ break;
+ case RTE_ETHER_TYPE_IPV6:
+ ipv6_hdr = l3_hdr;
+ *l4_proto = ipv6_hdr->proto;
+ m->l3_len = sizeof(struct rte_ipv6_hdr);
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV6;
+ break;
+ default:
+ m->l3_len = 0;
+ *l4_proto = 0;
+ *l4_hdr = NULL;
+ break;
+ }
+}
+
+static __rte_always_inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+ if (dev->features &
+ ((1ULL << VIRTIO_NET_F_CSUM) |
+ (1ULL << VIRTIO_NET_F_HOST_ECN) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+ (1ULL << VIRTIO_NET_F_HOST_UFO)))
+ return true;
+
+ return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+ uint16_t l4_proto = 0;
+ void *l4_hdr = NULL;
+ struct rte_tcp_hdr *tcp_hdr = NULL;
+
+ if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+ return;
+
+ parse_ethernet(m, &l4_proto, &l4_hdr);
+ if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+ if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+ switch (hdr->csum_offset) {
+ case (offsetof(struct rte_tcp_hdr, cksum)):
+ if (l4_proto == IPPROTO_TCP)
+ m->ol_flags |= PKT_TX_TCP_CKSUM;
+ break;
+ case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+ if (l4_proto == IPPROTO_UDP)
+ m->ol_flags |= PKT_TX_UDP_CKSUM;
+ break;
+ case (offsetof(struct rte_sctp_hdr, cksum)):
+ if (l4_proto == IPPROTO_SCTP)
+ m->ol_flags |= PKT_TX_SCTP_CKSUM;
+ break;
+ default:
+ break;
+ }
+ }
+ }
+
+ if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+ switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+ case VIRTIO_NET_HDR_GSO_TCPV4:
+ case VIRTIO_NET_HDR_GSO_TCPV6:
+ tcp_hdr = l4_hdr;
+ m->ol_flags |= PKT_TX_TCP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+ break;
+ case VIRTIO_NET_HDR_GSO_UDP:
+ m->ol_flags |= PKT_TX_UDP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = sizeof(struct rte_udp_hdr);
+ break;
+ default:
+ VHOST_LOG_DATA(WARNING,
+ "unsupported gso type %u.\n", hdr->gso_type);
+ break;
+ }
+ }
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+ rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+ struct rte_mbuf_ext_shared_info *shinfo = NULL;
+ uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+ uint16_t buf_len;
+ rte_iova_t iova;
+ void *buf;
+
+ /* Try to use pkt buffer to store shinfo to reduce the amount of memory
+ * required, otherwise store shinfo in the new buffer.
+ */
+ if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+ shinfo = rte_pktmbuf_mtod(pkt,
+ struct rte_mbuf_ext_shared_info *);
+ else {
+ total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+ total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+ }
+
+ if (unlikely(total_len > UINT16_MAX))
+ return -ENOSPC;
+
+ buf_len = total_len;
+ buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+ if (unlikely(buf == NULL))
+ return -ENOMEM;
+
+ /* Initialize shinfo */
+ if (shinfo) {
+ shinfo->free_cb = virtio_dev_extbuf_free;
+ shinfo->fcb_opaque = buf;
+ rte_mbuf_ext_refcnt_set(shinfo, 1);
+ } else {
+ shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+ virtio_dev_extbuf_free, buf);
+ if (unlikely(shinfo == NULL)) {
+ rte_free(buf);
+ VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+ return -1;
+ }
+ }
+
+ iova = rte_malloc_virt2iova(buf);
+ rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+ rte_pktmbuf_reset_headroom(pkt);
+
+ return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+ uint32_t data_len)
+{
+ struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+ if (unlikely(pkt == NULL)) {
+ VHOST_LOG_DATA(ERR,
+ "Failed to allocate memory for mbuf.\n");
+ return NULL;
+ }
+
+ if (rte_pktmbuf_tailroom(pkt) >= data_len)
+ return pkt;
+
+ /* attach an external buffer if supported */
+ if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+ return pkt;
+
+ /* check if chained buffers are allowed */
+ if (!dev->linearbuf)
+ return pkt;
+
+ /* Data doesn't fit into the buffer and the host supports
+ * only linear buffers
+ */
+ rte_pktmbuf_free(pkt);
+
+ return NULL;
+}
+
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index bd9303c8a9..6107662685 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
}
-static __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
- return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
static bool
is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
{
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
}
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
- if (dev->features &
- ((1ULL << VIRTIO_NET_F_CSUM) |
- (1ULL << VIRTIO_NET_F_HOST_ECN) |
- (1ULL << VIRTIO_NET_F_HOST_TSO4) |
- (1ULL << VIRTIO_NET_F_HOST_TSO6) |
- (1ULL << VIRTIO_NET_F_HOST_UFO)))
- return true;
-
- return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- struct rte_ipv6_hdr *ipv6_hdr;
- void *l3_hdr = NULL;
- struct rte_ether_hdr *eth_hdr;
- uint16_t ethertype;
-
- eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
- m->l2_len = sizeof(struct rte_ether_hdr);
- ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
- if (ethertype == RTE_ETHER_TYPE_VLAN) {
- struct rte_vlan_hdr *vlan_hdr =
- (struct rte_vlan_hdr *)(eth_hdr + 1);
-
- m->l2_len += sizeof(struct rte_vlan_hdr);
- ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
- }
-
- l3_hdr = (char *)eth_hdr + m->l2_len;
-
- switch (ethertype) {
- case RTE_ETHER_TYPE_IPV4:
- ipv4_hdr = l3_hdr;
- *l4_proto = ipv4_hdr->next_proto_id;
- m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV4;
- break;
- case RTE_ETHER_TYPE_IPV6:
- ipv6_hdr = l3_hdr;
- *l4_proto = ipv6_hdr->proto;
- m->l3_len = sizeof(struct rte_ipv6_hdr);
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV6;
- break;
- default:
- m->l3_len = 0;
- *l4_proto = 0;
- *l4_hdr = NULL;
- break;
- }
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
- uint16_t l4_proto = 0;
- void *l4_hdr = NULL;
- struct rte_tcp_hdr *tcp_hdr = NULL;
-
- if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
- return;
-
- parse_ethernet(m, &l4_proto, &l4_hdr);
- if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
- if (hdr->csum_start == (m->l2_len + m->l3_len)) {
- switch (hdr->csum_offset) {
- case (offsetof(struct rte_tcp_hdr, cksum)):
- if (l4_proto == IPPROTO_TCP)
- m->ol_flags |= PKT_TX_TCP_CKSUM;
- break;
- case (offsetof(struct rte_udp_hdr, dgram_cksum)):
- if (l4_proto == IPPROTO_UDP)
- m->ol_flags |= PKT_TX_UDP_CKSUM;
- break;
- case (offsetof(struct rte_sctp_hdr, cksum)):
- if (l4_proto == IPPROTO_SCTP)
- m->ol_flags |= PKT_TX_SCTP_CKSUM;
- break;
- default:
- break;
- }
- }
- }
-
- if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
- switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
- case VIRTIO_NET_HDR_GSO_TCPV4:
- case VIRTIO_NET_HDR_GSO_TCPV6:
- tcp_hdr = l4_hdr;
- m->ol_flags |= PKT_TX_TCP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
- break;
- case VIRTIO_NET_HDR_GSO_UDP:
- m->ol_flags |= PKT_TX_UDP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = sizeof(struct rte_udp_hdr);
- break;
- default:
- VHOST_LOG_DATA(WARNING,
- "unsupported gso type %u.\n", hdr->gso_type);
- break;
- }
- }
-}
-
static __rte_noinline void
copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
struct buf_vector *buf_vec)
@@ -2145,96 +2024,6 @@ get_zmbuf(struct vhost_virtqueue *vq)
return NULL;
}
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
- rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
- struct rte_mbuf_ext_shared_info *shinfo = NULL;
- uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
- uint16_t buf_len;
- rte_iova_t iova;
- void *buf;
-
- /* Try to use pkt buffer to store shinfo to reduce the amount of memory
- * required, otherwise store shinfo in the new buffer.
- */
- if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
- shinfo = rte_pktmbuf_mtod(pkt,
- struct rte_mbuf_ext_shared_info *);
- else {
- total_len += sizeof(*shinfo) + sizeof(uintptr_t);
- total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
- }
-
- if (unlikely(total_len > UINT16_MAX))
- return -ENOSPC;
-
- buf_len = total_len;
- buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
- if (unlikely(buf == NULL))
- return -ENOMEM;
-
- /* Initialize shinfo */
- if (shinfo) {
- shinfo->free_cb = virtio_dev_extbuf_free;
- shinfo->fcb_opaque = buf;
- rte_mbuf_ext_refcnt_set(shinfo, 1);
- } else {
- shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
- virtio_dev_extbuf_free, buf);
- if (unlikely(shinfo == NULL)) {
- rte_free(buf);
- VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
- return -1;
- }
- }
-
- iova = rte_malloc_virt2iova(buf);
- rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
- rte_pktmbuf_reset_headroom(pkt);
-
- return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
- uint32_t data_len)
-{
- struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
- if (unlikely(pkt == NULL)) {
- VHOST_LOG_DATA(ERR,
- "Failed to allocate memory for mbuf.\n");
- return NULL;
- }
-
- if (rte_pktmbuf_tailroom(pkt) >= data_len)
- return pkt;
-
- /* attach an external buffer if supported */
- if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
- return pkt;
-
- /* check if chained buffers are allowed */
- if (!dev->linearbuf)
- return pkt;
-
- /* Data doesn't fit into the buffer and the host supports
- * only linear buffers
- */
- rte_pktmbuf_free(pkt);
-
- return NULL;
-}
-
static __rte_noinline uint16_t
virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-10-06 15:06 ` Maxime Coquelin
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
` (2 subsequent siblings)
5 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5a5c945551..4a81f18f01 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
#define ASYNC_MAX_POLL_SEG 255
+#define MAX_NREGIONS 8
+
#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
@@ -375,6 +377,8 @@ struct inflight_mem_info {
struct virtio_net {
/* Frontend (QEMU) memory and memory region information */
struct rte_vhost_memory *mem;
+ uint64_t regions_low_addrs[MAX_NREGIONS];
+ uint64_t regions_high_addrs[MAX_NREGIONS];
uint64_t features;
uint64_t protocol_features;
int vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index c3c924faec..89e75e9e71 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
}
}
+ RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+ if (dev->vectorized) {
+ for (i = 0; i < memory->nregions; i++) {
+ dev->regions_low_addrs[i] =
+ memory->regions[i].guest_phys_addr;
+ dev->regions_high_addrs[i] =
+ memory->regions[i].guest_phys_addr +
+ memory->regions[i].memory_size;
+ }
+ }
+
for (i = 0; i < dev->nr_vring; i++) {
struct vhost_virtqueue *vq = dev->virtqueue[i];
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
` (2 preceding siblings ...)
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-10-06 14:59 ` Maxime Coquelin
2020-10-06 15:18 ` Maxime Coquelin
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-06 13:34 ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
5 siblings, 2 replies; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.
If platform or compiler not support vectorization, will fallback to
default path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c67..c1481802d7 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
if has_libnuma == 1
dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
endif
+
+if arch_subdir == 'x86'
+ if not machine_args.contains('-mno-avx512f')
+ if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+ cflags += ['-DCC_AVX512_SUPPORT']
+ vhost_avx512_lib = static_library('vhost_avx512_lib',
+ 'vhost_vec_avx.c',
+ dependencies: [static_rte_eal, static_rte_mempool,
+ static_rte_mbuf, static_rte_ethdev, static_rte_net],
+ include_directories: includes,
+ c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+ objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
+ endif
+ endif
+endif
+
if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 4a81f18f01..fc7daf2145 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
return NULL;
}
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids);
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
new file mode 100644
index 0000000000..dc5322d002
--- /dev/null
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+ 1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+ sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+ sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+ offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+ << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+ FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint32_t descs_status;
+ void *desc_addr;
+ uint16_t i;
+ uint8_t cmp_low, cmp_high, cmp_result;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ struct virtio_net_hdr *hdr;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+
+ /* load 4 descs */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ /* burst check four status */
+ __m512i avail_flag_vec;
+ if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG);
+
+#endif
+ else
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+ 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+ descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+ _MM_CMPINT_NE);
+ if (descs_status & BATCH_FLAGS_MASK)
+ return -1;
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t size = (uint64_t)descs[avail_idx + i].len;
+ desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+ descs[avail_idx + i].addr, &size,
+ VHOST_ACCESS_RO);
+
+ if (!desc_addrs[i])
+ goto free_buf;
+ lens[i] = descs[avail_idx + i].len;
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+ pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+ lens[i]);
+ if (!pkts[i])
+ goto free_buf;
+ }
+ } else {
+ /* check buffer fit into one region & translate address */
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low +
+ descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ goto free_buf;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ lens[i] = descs[avail_idx + i].len;
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+ pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+ lens[i]);
+ if (!pkts[i])
+ goto free_buf;
+ }
+ }
+
+ if (virtio_net_with_host_offload(dev)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
+ vhost_dequeue_offload(hdr, pkts[i]);
+ }
+ }
+
+ if (unlikely(virtio_net_is_inorder(dev))) {
+ ids[PACKED_BATCH_SIZE - 1] =
+ descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+ } else {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+ ids[i] = descs[avail_idx + i].id;
+ }
+
+ uint64_t addrs[PACKED_BATCH_SIZE << 1];
+ /* store mbuf data_len, pkt_len */
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+ addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+ + sizeof(uint64_t);
+ }
+
+ /* save pkt_len and data_len into mbufs */
+ __m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+ 0xAA);
+ __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+ (uint32_t)-12);
+ value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+ __m512i vindex = _mm512_loadu_si512((void *)addrs);
+ _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+ return 0;
+free_buf:
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ rte_pktmbuf_free(pkts[i]);
+
+ return -1;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 6107662685..e4d2e2e7d6 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
return -1;
}
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+ if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+ return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+#else
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+
+#endif
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids);
+}
+
static __rte_always_inline int
virtio_dev_tx_batch_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq,
@@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
uint16_t ids[PACKED_BATCH_SIZE];
uint16_t i;
- if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
- avail_idx, desc_addrs, ids))
+
+ if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids))
return -1;
vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
` (3 preceding siblings ...)
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-09-21 6:48 ` Marvin Liu
2020-10-06 15:00 ` Maxime Coquelin
2020-10-06 13:34 ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
5 siblings, 1 reply; 36+ messages in thread
From: Marvin Liu @ 2020-09-21 6:48 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index fc7daf2145..b78b2c5c1b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
uint16_t avail_idx,
uintptr_t *desc_addrs,
uint16_t *ids);
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts);
+
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
index dc5322d002..7d2250ed86 100644
--- a/lib/librte_vhost/vhost_vec_avx.c
+++ b/lib/librte_vhost/vhost_vec_avx.c
@@ -35,9 +35,15 @@
#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
#define DESC_FLAGS_POS 0xaa
#define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
int
vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
return -1;
}
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint16_t avail_idx = vq->last_avail_idx;
+ uint64_t desc_addrs[PACKED_BATCH_SIZE];
+ uint32_t buf_offset = dev->vhost_hlen;
+ uint32_t desc_status;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ uint16_t i;
+ void *desc_addr;
+ uint8_t cmp_low, cmp_high, cmp_result;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+
+ /* check refcnt and nb_segs */
+ __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+ /* load four mbufs rearm data */
+ __m256i mbufs = _mm256_set_epi64x(
+ *pkts[3]->rearm_data,
+ *pkts[2]->rearm_data,
+ *pkts[1]->rearm_data,
+ *pkts[0]->rearm_data);
+
+ uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+ if (cmp & MBUF_LENS_POS)
+ return -1;
+
+ /* check desc status */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ __m512i avail_flag_vec;
+ __m512i used_flag_vec;
+ if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+ 0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG);
+ used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_FLAGS_MASK);
+#endif
+ } else {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(
+ PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+ PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG_WRAP);
+ used_flag_vec = _mm512_setzero_epi32();
+#endif
+ }
+
+ desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+ avail_flag_vec, _MM_CMPINT_NE);
+ if (desc_status)
+ return -1;
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t size = (uint64_t)descs[avail_idx + i].len;
+ desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+ descs[avail_idx + i].addr, &size,
+ VHOST_ACCESS_RW);
+
+ if (!desc_addrs[i])
+ return -1;
+
+ rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+ 0));
+ }
+ } else {
+ /* check buffer fit into one region & translate address */
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&dev->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low +
+ descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ return -1;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+ 0));
+ }
+ }
+
+ /* check length is enough */
+ __m512i pkt_lens = _mm512_set_epi32(
+ 0, pkts[3]->pkt_len, 0, 0,
+ 0, pkts[2]->pkt_len, 0, 0,
+ 0, pkts[1]->pkt_len, 0, 0,
+ 0, pkts[0]->pkt_len, 0, 0);
+
+ __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+ dev->vhost_hlen);
+ __m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+ uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+ desc_vec, buf_len_vec, _MM_CMPINT_LT);
+ if (lens_cmp)
+ return -1;
+
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+ rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+ pkts[i]->pkt_len);
+ }
+
+ if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ lens[i] = descs[avail_idx + i].len;
+ vhost_log_cache_write_iova(dev, vq,
+ descs[avail_idx + i].addr, lens[i]);
+ }
+ }
+
+ vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+ vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+ /* save len and flags, skip addr and id */
+ __m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+ DESC_LENS_FLAGS_POS, buf_len_vec,
+ used_flag_vec);
+ _mm512_storeu_si512(desc_addr, desc_updated);
+
+ return 0;
+}
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index e4d2e2e7d6..5c56a8d6ff 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
return 0;
}
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+
+{
+ if (unlikely(dev->vectorized))
+#ifdef CC_AVX512_SUPPORT
+ return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+#else
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+}
+
static __rte_noinline uint32_t
virtio_dev_rx_packed(struct virtio_net *dev,
struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
if (remained >= PACKED_BATCH_SIZE) {
- if (!virtio_dev_rx_batch_packed(dev, vq,
- &pkts[pkt_idx])) {
+ if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+ &pkts[pkt_idx])) {
pkt_idx += PACKED_BATCH_SIZE;
remained -= PACKED_BATCH_SIZE;
continue;
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
2020-09-21 6:26 ` Liu, Yong
@ 2020-09-21 7:47 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-09-21 7:47 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Liu, Yong
> Sent: Monday, September 21, 2020 2:27 PM
> To: 'Maxime Coquelin' <maxime.coquelin@redhat.com>; Xia, Chenbo
> <chenbo.xia@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
>
>
>
> > -----Original Message-----
> > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Sent: Friday, September 18, 2020 9:45 PM
> > To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> > Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH v1 4/5] vhost: add packed ring vectorized dequeue
> >
> >
> >
> > On 8/19/20 5:24 AM, Marvin Liu wrote:
> > > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > > descriptors status check and writeback are batched handled with
> AVX512
> > > instructions. Address translation operations are also accelerated by
> > > AVX512 instructions.
> > >
> > > If platform or compiler not support vectorization, will fallback to
> > > default path.
> > >
> > > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> > >
> > > diff --git a/lib/librte_vhost/Makefile b/lib/librte_vhost/Makefile
> > > index 4f2f3e47da..c0cd7d498f 100644
> > > --- a/lib/librte_vhost/Makefile
> > > +++ b/lib/librte_vhost/Makefile
> > > @@ -31,6 +31,13 @@ CFLAGS += -DVHOST_ICC_UNROLL_PRAGMA
> > > endif
> > > endif
> > >
> > > +ifneq ($(FORCE_DISABLE_AVX512), y)
> > > + CC_AVX512_SUPPORT=\
> > > + $(shell $(CC) -march=native -dM -E - </dev/null 2>&1 | \
> > > + sed '/./{H;$$!d} ; x ; /AVX512F/!d; /AVX512BW/!d; /AVX512VL/!d' |
> \
> > > + grep -q AVX512 && echo 1)
> > > +endif
> > > +
> > > ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
> > > LDLIBS += -lnuma
> > > endif
> > > @@ -40,6 +47,12 @@ LDLIBS += -lrte_eal -lrte_mempool -lrte_mbuf -
> > lrte_ethdev -lrte_net
> > > SRCS-$(CONFIG_RTE_LIBRTE_VHOST) := fd_man.c iotlb.c socket.c vhost.c
> \
> > > vhost_user.c virtio_net.c vdpa.c
> > >
> > > +ifeq ($(CC_AVX512_SUPPORT), 1)
> > > +CFLAGS += -DCC_AVX512_SUPPORT
> > > +SRCS-$(CONFIG_RTE_LIBRTE_VHOST) += vhost_vec_avx.c
> > > +CFLAGS_vhost_vec_avx.o += -mavx512f -mavx512bw -mavx512vl
> > > +endif
> > > +
> > > # install includes
> > > SYMLINK-$(CONFIG_RTE_LIBRTE_VHOST)-include += rte_vhost.h
> > rte_vdpa.h \
> > > rte_vdpa_dev.h
> > rte_vhost_async.h
> > > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > > index cc9aa65c67..c1481802d7 100644
> > > --- a/lib/librte_vhost/meson.build
> > > +++ b/lib/librte_vhost/meson.build
> > > @@ -8,6 +8,22 @@ endif
> > > if has_libnuma == 1
> > > dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> > > endif
> > > +
> > > +if arch_subdir == 'x86'
> > > + if not machine_args.contains('-mno-avx512f')
> > > + if cc.has_argument('-mavx512f') and cc.has_argument('-
> > mavx512vl') and cc.has_argument('-mavx512bw')
> > > + cflags += ['-DCC_AVX512_SUPPORT']
> > > + vhost_avx512_lib = static_library('vhost_avx512_lib',
> > > + 'vhost_vec_avx.c',
> > > + dependencies: [static_rte_eal,
> > static_rte_mempool,
> > > + static_rte_mbuf, static_rte_ethdev,
> > static_rte_net],
> > > + include_directories: includes,
> > > + c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> > mavx512vl'])
> > > + objs +=
> vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > > + endif
> > > + endif
> > > +endif
> > > +
> > > if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > > cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > > elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > > index 4a81f18f01..fc7daf2145 100644
> > > --- a/lib/librte_vhost/vhost.h
> > > +++ b/lib/librte_vhost/vhost.h
> > > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> > *dev, struct rte_mempool *mp,
> > > return NULL;
> > > }
> > >
> > > +int
> > > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > > + struct vhost_virtqueue *vq,
> > > + struct rte_mempool *mbuf_pool,
> > > + struct rte_mbuf **pkts,
> > > + uint16_t avail_idx,
> > > + uintptr_t *desc_addrs,
> > > + uint16_t *ids);
> > > #endif /* _VHOST_NET_CDEV_H_ */
> > > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> > b/lib/librte_vhost/vhost_vec_avx.c
> > > new file mode 100644
> > > index 0000000000..e8361d18fa
> > > --- /dev/null
> > > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > > @@ -0,0 +1,152 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2010-2016 Intel Corporation
> > > + */
> > > +#include <stdint.h>
> > > +
> > > +#include "vhost.h"
> > > +
> > > +#define BYTE_SIZE 8
> > > +/* reference count offset in mbuf rearm data */
> > > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > > +/* segment number offset in mbuf rearm data */
> > > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > > +
> > > +/* default rearm data */
> > > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > > + 1ULL << REFCNT_BITS_OFFSET)
> > > +
> > > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct
> vring_packed_desc,
> > flags) / \
> > > + sizeof(uint16_t))
> > > +
> > > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > > + sizeof(uint16_t))
> > > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) |
> > \
> > > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > > +
> > > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) -
> \
> > > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > > +
> > > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> > VRING_DESC_F_USED) \
> > > + << FLAGS_BITS_OFFSET)
> > > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> > FLAGS_BITS_OFFSET)
> > > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) <<
> \
> > > + FLAGS_BITS_OFFSET)
> > > +
> > > +#define DESC_FLAGS_POS 0xaa
> > > +#define MBUF_LENS_POS 0x6666
> > > +
> > > +int
> > > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > > + struct vhost_virtqueue *vq,
> > > + struct rte_mempool *mbuf_pool,
> > > + struct rte_mbuf **pkts,
> > > + uint16_t avail_idx,
> > > + uintptr_t *desc_addrs,
> > > + uint16_t *ids)
> > > +{
> > > + struct vring_packed_desc *descs = vq->desc_packed;
> > > + uint32_t descs_status;
> > > + void *desc_addr;
> > > + uint16_t i;
> > > + uint8_t cmp_low, cmp_high, cmp_result;
> > > + uint64_t lens[PACKED_BATCH_SIZE];
> > > +
> > > + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > > + return -1;
> > > +
> > > + /* load 4 descs */
> > > + desc_addr = &vq->desc_packed[avail_idx];
> > > + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > > +
> > > + /* burst check four status */
> > > + __m512i avail_flag_vec;
> > > + if (vq->avail_wrap_counter)
> > > +#if defined(RTE_ARCH_I686)
> > > + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> > 0x0,
> > > + PACKED_FLAGS_MASK, 0x0);
> > > +#else
> > > + avail_flag_vec =
> > _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > > + PACKED_AVAIL_FLAG);
> > > +
> > > +#endif
> > > + else
> > > +#if defined(RTE_ARCH_I686)
> > > + avail_flag_vec =
> > _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > > + 0x0, PACKED_AVAIL_FLAG_WRAP,
> > 0x0);
> > > +#else
> > > + avail_flag_vec =
> > _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > > + PACKED_AVAIL_FLAG_WRAP);
> > > +#endif
> > > +
> > > + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > > + _MM_CMPINT_NE);
> > > + if (descs_status & BATCH_FLAGS_MASK)
> > > + return -1;
> > > +
> > > + /* check buffer fit into one region & translate address */
> > > + __m512i regions_low_addrs =
> > > + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> > > + __m512i regions_high_addrs =
> > > + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> > > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > > + uint64_t addr_low = descs[avail_idx + i].addr;
> > > + uint64_t addr_high = addr_low + descs[avail_idx + i].len;
> > > + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> > > + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> > > +
> > > + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> > > + regions_low_addrs, _MM_CMPINT_NLT);
> > > + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> > > + regions_high_addrs, _MM_CMPINT_LT);
> > > + cmp_result = cmp_low & cmp_high;
> > > + int index = __builtin_ctz(cmp_result);
> > > + if (unlikely((uint32_t)index >= dev->mem->nregions))
> > > + goto free_buf;
> > > +
> > > + desc_addrs[i] = addr_low +
> > > + dev->mem->regions[index].host_user_addr -
> > > + dev->mem->regions[index].guest_phys_addr;
> > > + lens[i] = descs[avail_idx + i].len;
> > > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > > +
> > > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, lens[i]);
> > > + if (!pkts[i])
> > > + goto free_buf;
> > > + }
> >
> > The above does not support vIOMMU, isn't it?
> >
> > The more the packed datapath evolves, the more it gets optimized for a
> > very specific configuration.
> >
> > In v19.11, indirect descriptors and chained buffers are handled as a
> > fallback. And now vIOMMU support is handled as a fallback.
> >
>
> Hi Maxime,
> Thanks for figuring out the feature miss. First version patch is lack of
> vIOMMU supporting.
> V2 patch will fix the feature gap between vectorized function and original
> batch function.
> So there will be no additional fallback introduced in vectorized patch set.
>
> IMHO, current packed optimization introduced complexity is for handling
> that gap between performance aimed frontend (like PMD) and normal
> network traffic (like TCP).
> Vectorized datapath is focusing in enhancing the performance of batched
> function. From function point of view, there will no difference between
> vectorized batched function and original batched function.
> Current packed ring path will remain the same if vectorized option is not
> enable. So I think the complexity won't increase too much. If there's any
> concern, please let me known.
>
> BTW, vectorized path can help performance a lot when vIOMMU enabled.
>
After double check, most performance difference came from runtime setting.
Performance gain of vectorized path is not so obvious.
> Regards,
> Marvin
>
> > I personnally don't like the path it is taking as it is adding a lot of
> > complexity on top of that.
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
` (4 preceding siblings ...)
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-06 13:34 ` Maxime Coquelin
2020-10-08 6:20 ` Liu, Yong
5 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 13:34 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
Hi,
On 9/21/20 8:48 AM, Marvin Liu wrote:
> Packed ring format is imported since virtio spec 1.1. All descriptors
> are compacted into one single ring when packed ring format is on. It is
> straight forward that ring operations can be accelerated by utilizing
> SIMD instructions.
>
> This patch set will introduce vectorized data path in vhost library. If
> vectorized option is on, operations like descs check, descs writeback,
> address translation will be accelerated by SIMD instructions. Vhost
> application can choose whether using vectorized acceleration, it is
> like external buffer and zero copy features.
>
> If platform or ring format not support vectorized function, vhost will
> fallback to use default batch function. There will be no impact in current
> data path.
As a pre-requisite, I'd like some performance numbers in both loopback
and PVP to figure out if adding such complexity is worth it, given we
will have to support it for at least one year.
Thanks,
Maxime
> v2:
> * add vIOMMU support
> * add dequeue offloading
> * rebase code
>
> Marvin Liu (5):
> vhost: add vectorized data path
> vhost: reuse packed ring functions
> vhost: prepare memory regions addresses
> vhost: add packed ring vectorized dequeue
> vhost: add packed ring vectorized enqueue
>
> doc/guides/nics/vhost.rst | 5 +
> doc/guides/prog_guide/vhost_lib.rst | 12 +
> drivers/net/vhost/rte_eth_vhost.c | 17 +-
> lib/librte_vhost/meson.build | 16 ++
> lib/librte_vhost/rte_vhost.h | 1 +
> lib/librte_vhost/socket.c | 5 +
> lib/librte_vhost/vhost.c | 11 +
> lib/librte_vhost/vhost.h | 235 +++++++++++++++++++
> lib/librte_vhost/vhost_user.c | 11 +
> lib/librte_vhost/vhost_vec_avx.c | 338 ++++++++++++++++++++++++++++
> lib/librte_vhost/virtio_net.c | 257 ++++-----------------
> 11 files changed, 692 insertions(+), 216 deletions(-)
> create mode 100644 lib/librte_vhost/vhost_vec_avx.c
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-10-06 14:59 ` Maxime Coquelin
2020-10-08 7:05 ` Liu, Yong
2020-10-06 15:18 ` Maxime Coquelin
1 sibling, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 14:59 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
>
> If platform or compiler not support vectorization, will fallback to
> default path.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
> if has_libnuma == 1
> dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> endif
> +
> +if arch_subdir == 'x86'
> + if not machine_args.contains('-mno-avx512f')
> + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> + cflags += ['-DCC_AVX512_SUPPORT']
> + vhost_avx512_lib = static_library('vhost_avx512_lib',
> + 'vhost_vec_avx.c',
> + dependencies: [static_rte_eal, static_rte_mempool,
> + static_rte_mbuf, static_rte_ethdev, static_rte_net],
> + include_directories: includes,
> + c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> + endif
> + endif
> +endif
Not a Meson expert, but wonder how I can disable CC_AVX512_SUPPORT.
I checked the DPDK doc, but I could not find how to pass -mno-avx512f to
the machine_args.
> +
> if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
> return NULL;
> }
>
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids);
> #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..dc5322d002
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c
For consistency it should be prefixed with virtio_net, not vhost.
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> + 1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> + sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> + sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> + << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> + FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids)
> +{
> + struct vring_packed_desc *descs = vq->desc_packed;
> + uint32_t descs_status;
> + void *desc_addr;
> + uint16_t i;
> + uint8_t cmp_low, cmp_high, cmp_result;
> + uint64_t lens[PACKED_BATCH_SIZE];
> + struct virtio_net_hdr *hdr;
> +
> + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> + return -1;
> +
> + /* load 4 descs */
> + desc_addr = &vq->desc_packed[avail_idx];
> + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
Unlike split ring, packed ring specification does not mandate the ring
size to be a power of two. So checking avail_idx is aligned on 64 bytes
is not enough given a descriptor is 16 bytes.
You need to also check against ring size to prevent out of bounds
accesses.
I see the non vectorized batch processing you introduced in v19.11 also
do that wrong assumption. Please fix it.
Also, I wonder whether it is assumed that &vq->desc_packed[avail_idx];
is aligned on a cache-line. Meaning, does below intrinsics have such a
requirement?
> + /* burst check four status */
> + __m512i avail_flag_vec;
> + if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> + PACKED_FLAGS_MASK, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG);
> +
> +#endif
> + else
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> + 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> + _MM_CMPINT_NE);
> + if (descs_status & BATCH_FLAGS_MASK)
> + return -1;
> +
> + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> + descs[avail_idx + i].addr, &size,
> + VHOST_ACCESS_RO);
> +
> + if (!desc_addrs[i])
> + goto free_buf;
> + lens[i] = descs[avail_idx + i].len;
> + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> + lens[i]);
> + if (!pkts[i])
> + goto free_buf;
> + }
> + } else {
> + /* check buffer fit into one region & translate address */
> + __m512i regions_low_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> + __m512i regions_high_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t addr_low = descs[avail_idx + i].addr;
> + uint64_t addr_high = addr_low +
> + descs[avail_idx + i].len;
> + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> + regions_low_addrs, _MM_CMPINT_NLT);
> + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> + regions_high_addrs, _MM_CMPINT_LT);
> + cmp_result = cmp_low & cmp_high;
> + int index = __builtin_ctz(cmp_result);
> + if (unlikely((uint32_t)index >= dev->mem->nregions))
> + goto free_buf;
> +
> + desc_addrs[i] = addr_low +
> + dev->mem->regions[index].host_user_addr -
> + dev->mem->regions[index].guest_phys_addr;
> + lens[i] = descs[avail_idx + i].len;
> + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> + lens[i]);
> + if (!pkts[i])
> + goto free_buf;
> + }
> + }
> +
> + if (virtio_net_with_host_offload(dev)) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> + vhost_dequeue_offload(hdr, pkts[i]);
> + }
> + }
> +
> + if (unlikely(virtio_net_is_inorder(dev))) {
> + ids[PACKED_BATCH_SIZE - 1] =
> + descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
Isn't in-order a likely case? Maybe just remove the unlikely.
> + } else {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> + ids[i] = descs[avail_idx + i].id;
> + }
> +
> + uint64_t addrs[PACKED_BATCH_SIZE << 1];
> + /* store mbuf data_len, pkt_len */
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> + addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> + + sizeof(uint64_t);
> + }
> +
> + /* save pkt_len and data_len into mbufs */
> + __m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> + 0xAA);
> + __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> + (uint32_t)-12);
> + value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> + __m512i vindex = _mm512_loadu_si512((void *)addrs);
> + _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> +
> + return 0;
> +free_buf:
> + for (i = 0; i < PACKED_BATCH_SIZE; i++)
> + rte_pktmbuf_free(pkts[i]);
> +
> + return -1;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 6107662685..e4d2e2e7d6 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
> return -1;
> }
>
> +static __rte_always_inline int
> +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids)
> +{
> + if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> + return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
> + pkts, avail_idx, desc_addrs, ids);
> +#else
> + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
> + pkts, avail_idx, desc_addrs, ids);
> +
> +#endif
> + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> + avail_idx, desc_addrs, ids);
> +}
It should be as below to not have any performance impact when
CC_AVX512_SUPPORT is not set:
#ifdef CC_AVX512_SUPPORT
if (unlikely(dev->vectorized))
return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
pkts, avail_idx, desc_addrs, ids);
#else
return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
avail_idx, desc_addrs, ids);
#endif
> +
> static __rte_always_inline int
> virtio_dev_tx_batch_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
> uint16_t ids[PACKED_BATCH_SIZE];
> uint16_t i;
>
> - if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> - avail_idx, desc_addrs, ids))
> +
> + if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> + avail_idx, desc_addrs, ids))
> return -1;
>
> vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-06 15:00 ` Maxime Coquelin
2020-10-08 7:09 ` Liu, Yong
0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:00 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring enqueue path with SIMD instructions. Four
> descriptors status and length are batched handled with AVX512
> instructions. Address translation operations are also accelerated
> by AVX512 instructions.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index fc7daf2145..b78b2c5c1b 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> uint16_t avail_idx,
> uintptr_t *desc_addrs,
> uint16_t *ids);
> +
> +int
> +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mbuf **pkts);
> +
> #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> index dc5322d002..7d2250ed86 100644
> --- a/lib/librte_vhost/vhost_vec_avx.c
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -35,9 +35,15 @@
> #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> FLAGS_BITS_OFFSET)
> +#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
> + ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> +#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
> + ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
>
> #define DESC_FLAGS_POS 0xaa
> #define MBUF_LENS_POS 0x6666
> +#define DESC_LENS_POS 0x4444
> +#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
>
> int
> vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> @@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
>
> return -1;
> }
> +
> +int
> +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mbuf **pkts)
> +{
> + struct vring_packed_desc *descs = vq->desc_packed;
> + uint16_t avail_idx = vq->last_avail_idx;
> + uint64_t desc_addrs[PACKED_BATCH_SIZE];
> + uint32_t buf_offset = dev->vhost_hlen;
> + uint32_t desc_status;
> + uint64_t lens[PACKED_BATCH_SIZE];
> + uint16_t i;
> + void *desc_addr;
> + uint8_t cmp_low, cmp_high, cmp_result;
> +
> + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> + return -1;
Same comment as for patch 4. Packed ring size may not be a pow2.
> + /* check refcnt and nb_segs */
> + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
> +
> + /* load four mbufs rearm data */
> + __m256i mbufs = _mm256_set_epi64x(
> + *pkts[3]->rearm_data,
> + *pkts[2]->rearm_data,
> + *pkts[1]->rearm_data,
> + *pkts[0]->rearm_data);
> +
> + uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
> + if (cmp & MBUF_LENS_POS)
> + return -1;
> +
> + /* check desc status */
> + desc_addr = &vq->desc_packed[avail_idx];
> + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> + __m512i avail_flag_vec;
> + __m512i used_flag_vec;
> + if (vq->avail_wrap_counter) {
> +#if defined(RTE_ARCH_I686)
Is supporting AVX512 on i686 really useful/necessary?
> + avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
> + 0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
> + used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
> + PACKED_FLAGS_MASK, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_WRITE_AVAIL_FLAG);
> + used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_FLAGS_MASK);
> +#endif
> + } else {
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(
> + PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
> + PACKED_WRITE_AVAIL_FLAG, 0x0);
> + used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_WRITE_AVAIL_FLAG_WRAP);
> + used_flag_vec = _mm512_setzero_epi32();
> +#endif
> + }
> +
> + desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
> + avail_flag_vec, _MM_CMPINT_NE);
> + if (desc_status)
> + return -1;
> +
> + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> + descs[avail_idx + i].addr, &size,
> + VHOST_ACCESS_RW);
> +
> + if (!desc_addrs[i])
> + return -1;
> +
> + rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
> + 0));
> + }
> + } else {
> + /* check buffer fit into one region & translate address */
> + __m512i regions_low_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> + __m512i regions_high_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t addr_low = descs[avail_idx + i].addr;
> + uint64_t addr_high = addr_low +
> + descs[avail_idx + i].len;
> + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> + regions_low_addrs, _MM_CMPINT_NLT);
> + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> + regions_high_addrs, _MM_CMPINT_LT);
> + cmp_result = cmp_low & cmp_high;
> + int index = __builtin_ctz(cmp_result);
> + if (unlikely((uint32_t)index >= dev->mem->nregions))
> + return -1;
> +
> + desc_addrs[i] = addr_low +
> + dev->mem->regions[index].host_user_addr -
> + dev->mem->regions[index].guest_phys_addr;
> + rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
> + 0));
> + }
> + }
> +
> + /* check length is enough */
> + __m512i pkt_lens = _mm512_set_epi32(
> + 0, pkts[3]->pkt_len, 0, 0,
> + 0, pkts[2]->pkt_len, 0, 0,
> + 0, pkts[1]->pkt_len, 0, 0,
> + 0, pkts[0]->pkt_len, 0, 0);
> +
> + __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
> + dev->vhost_hlen);
> + __m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
> + uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
> + desc_vec, buf_len_vec, _MM_CMPINT_LT);
> + if (lens_cmp)
> + return -1;
> +
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
> + rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> + pkts[i]->pkt_len);
> + }
> +
> + if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + lens[i] = descs[avail_idx + i].len;
> + vhost_log_cache_write_iova(dev, vq,
> + descs[avail_idx + i].addr, lens[i]);
> + }
> + }
> +
> + vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
> + vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
> + /* save len and flags, skip addr and id */
> + __m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
> + DESC_LENS_FLAGS_POS, buf_len_vec,
> + used_flag_vec);
> + _mm512_storeu_si512(desc_addr, desc_updated);
> +
> + return 0;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index e4d2e2e7d6..5c56a8d6ff 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
> return 0;
> }
>
> +static __rte_always_inline int
> +virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mbuf **pkts)
> +
> +{
> + if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> + return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> +#else
> + return virtio_dev_rx_batch_packed(dev, vq, pkts);
> +#endif
> + return virtio_dev_rx_batch_packed(dev, vq, pkts);
It should be as below to not have any performance impact when
CC_AVX512_SUPPORT is not set:
#ifdef CC_AVX512_SUPPORT
if (unlikely(dev->vectorized))
return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
#else
return virtio_dev_rx_batch_packed(dev, vq, pkts);
#endif
> +}
> +
> static __rte_noinline uint32_t
> virtio_dev_rx_packed(struct virtio_net *dev,
> struct vhost_virtqueue *__rte_restrict vq,
> @@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
> rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
>
> if (remained >= PACKED_BATCH_SIZE) {
> - if (!virtio_dev_rx_batch_packed(dev, vq,
> - &pkts[pkt_idx])) {
> + if (!virtio_dev_rx_handle_batch_packed(dev, vq,
> + &pkts[pkt_idx])) {
> pkt_idx += PACKED_BATCH_SIZE;
> remained -= PACKED_BATCH_SIZE;
> continue;
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-10-06 15:06 ` Maxime Coquelin
0 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:06 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
On 9/21/20 8:48 AM, Marvin Liu wrote:
> Prepare memory regions guest physical addresses for vectorized data
> path. These information will be utilized by SIMD instructions to find
> matched region index.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 5a5c945551..4a81f18f01 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -52,6 +52,8 @@
>
> #define ASYNC_MAX_POLL_SEG 255
>
> +#define MAX_NREGIONS 8
> +
> #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
> #define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
>
> @@ -375,6 +377,8 @@ struct inflight_mem_info {
> struct virtio_net {
> /* Frontend (QEMU) memory and memory region information */
> struct rte_vhost_memory *mem;
> + uint64_t regions_low_addrs[MAX_NREGIONS];
> + uint64_t regions_high_addrs[MAX_NREGIONS];
It eats two cache lines, so it would be better to have it in a dedicated
structure dynamically allocated.
It would be better for non-vectorized path, as it will avoid polluting
cache with useless data in its case. And it would be better for
vectorized path too, as when the DP will need to use it, it will use
exactly two cache lines instead of 3three.
> uint64_t features;
> uint64_t protocol_features;
> int vid;
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index c3c924faec..89e75e9e71 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -1291,6 +1291,17 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
> }
> }
>
> + RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
> + if (dev->vectorized) {
> + for (i = 0; i < memory->nregions; i++) {
> + dev->regions_low_addrs[i] =
> + memory->regions[i].guest_phys_addr;
> + dev->regions_high_addrs[i] =
> + memory->regions[i].guest_phys_addr +
> + memory->regions[i].memory_size;
> + }
> + }
> +
> for (i = 0; i < dev->nr_vring; i++) {
> struct vhost_virtqueue *vq = dev->virtqueue[i];
>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-10-06 14:59 ` Maxime Coquelin
@ 2020-10-06 15:18 ` Maxime Coquelin
2020-10-09 7:59 ` Liu, Yong
1 sibling, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-06 15:18 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
On 9/21/20 8:48 AM, Marvin Liu wrote:
> Optimize vhost packed ring dequeue path with SIMD instructions. Four
> descriptors status check and writeback are batched handled with AVX512
> instructions. Address translation operations are also accelerated by
> AVX512 instructions.
>
> If platform or compiler not support vectorization, will fallback to
> default path.
>
> Signed-off-by: Marvin Liu <yong.liu@intel.com>
>
> diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> index cc9aa65c67..c1481802d7 100644
> --- a/lib/librte_vhost/meson.build
> +++ b/lib/librte_vhost/meson.build
> @@ -8,6 +8,22 @@ endif
> if has_libnuma == 1
> dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> endif
> +
> +if arch_subdir == 'x86'
> + if not machine_args.contains('-mno-avx512f')
> + if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
> + cflags += ['-DCC_AVX512_SUPPORT']
> + vhost_avx512_lib = static_library('vhost_avx512_lib',
> + 'vhost_vec_avx.c',
> + dependencies: [static_rte_eal, static_rte_mempool,
> + static_rte_mbuf, static_rte_ethdev, static_rte_net],
> + include_directories: includes,
> + c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
> + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> + endif
> + endif
> +endif
> +
> if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 4a81f18f01..fc7daf2145 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
> return NULL;
> }
>
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids);
> #endif /* _VHOST_NET_CDEV_H_ */
> diff --git a/lib/librte_vhost/vhost_vec_avx.c b/lib/librte_vhost/vhost_vec_avx.c
> new file mode 100644
> index 0000000000..dc5322d002
> --- /dev/null
> +++ b/lib/librte_vhost/vhost_vec_avx.c
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2016 Intel Corporation
> + */
> +#include <stdint.h>
> +
> +#include "vhost.h"
> +
> +#define BYTE_SIZE 8
> +/* reference count offset in mbuf rearm data */
> +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +/* segment number offset in mbuf rearm data */
> +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> +
> +/* default rearm data */
> +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> + 1ULL << REFCNT_BITS_OFFSET)
> +
> +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
> + sizeof(uint16_t))
> +
> +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> + sizeof(uint16_t))
> +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
> + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> +
> +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> +
> +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
> + << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
> +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> + FLAGS_BITS_OFFSET)
> +
> +#define DESC_FLAGS_POS 0xaa
> +#define MBUF_LENS_POS 0x6666
> +
> +int
> +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids)
> +{
> + struct vring_packed_desc *descs = vq->desc_packed;
> + uint32_t descs_status;
> + void *desc_addr;
> + uint16_t i;
> + uint8_t cmp_low, cmp_high, cmp_result;
> + uint64_t lens[PACKED_BATCH_SIZE];
> + struct virtio_net_hdr *hdr;
> +
> + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> + return -1;
> +
> + /* load 4 descs */
> + desc_addr = &vq->desc_packed[avail_idx];
> + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> +
> + /* burst check four status */
> + __m512i avail_flag_vec;
> + if (vq->avail_wrap_counter)
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
> + PACKED_FLAGS_MASK, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG);
> +
> +#endif
> + else
> +#if defined(RTE_ARCH_I686)
> + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> + 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
> +#else
> + avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> + PACKED_AVAIL_FLAG_WRAP);
> +#endif
> +
> + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> + _MM_CMPINT_NE);
> + if (descs_status & BATCH_FLAGS_MASK)
> + return -1;
> +
Also, please try to factorize code to avoid duplication between Tx and
Rx paths for desc address translation:
> + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> + descs[avail_idx + i].addr, &size,
> + VHOST_ACCESS_RO);
> +
> + if (!desc_addrs[i])
> + goto free_buf;
> + lens[i] = descs[avail_idx + i].len;
> + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> + lens[i]);
> + if (!pkts[i])
> + goto free_buf;
> + }
> + } else {> + /* check buffer fit into one region & translate address */
> + __m512i regions_low_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_low_addrs);
> + __m512i regions_high_addrs =
> + _mm512_loadu_si512((void *)&dev->regions_high_addrs);
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + uint64_t addr_low = descs[avail_idx + i].addr;
> + uint64_t addr_high = addr_low +
> + descs[avail_idx + i].len;
> + __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
> + __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
> +
> + cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
> + regions_low_addrs, _MM_CMPINT_NLT);
> + cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
> + regions_high_addrs, _MM_CMPINT_LT);
> + cmp_result = cmp_low & cmp_high;
> + int index = __builtin_ctz(cmp_result);
> + if (unlikely((uint32_t)index >= dev->mem->nregions))
> + goto free_buf;
> +
> + desc_addrs[i] = addr_low +
> + dev->mem->regions[index].host_user_addr -
> + dev->mem->regions[index].guest_phys_addr;
> + lens[i] = descs[avail_idx + i].len;
> + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> +
> + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> + lens[i]);
> + if (!pkts[i])
> + goto free_buf;
> + }
> + }
> +
> + if (virtio_net_with_host_offload(dev)) {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> + vhost_dequeue_offload(hdr, pkts[i]);
> + }
> + }
> +
> + if (unlikely(virtio_net_is_inorder(dev))) {
> + ids[PACKED_BATCH_SIZE - 1] =
> + descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
> + } else {
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> + ids[i] = descs[avail_idx + i].id;
> + }
> +
> + uint64_t addrs[PACKED_BATCH_SIZE << 1];
> + /* store mbuf data_len, pkt_len */
> + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> + addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> + addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> + + sizeof(uint64_t);
> + }
> +
> + /* save pkt_len and data_len into mbufs */
> + __m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> + 0xAA);
> + __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> + (uint32_t)-12);
> + value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> + __m512i vindex = _mm512_loadu_si512((void *)addrs);
> + _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> +
> + return 0;
> +free_buf:
> + for (i = 0; i < PACKED_BATCH_SIZE; i++)
> + rte_pktmbuf_free(pkts[i]);
> +
> + return -1;
> +}
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 6107662685..e4d2e2e7d6 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
> return -1;
> }
>
> +static __rte_always_inline int
> +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> + struct vhost_virtqueue *vq,
> + struct rte_mempool *mbuf_pool,
> + struct rte_mbuf **pkts,
> + uint16_t avail_idx,
> + uintptr_t *desc_addrs,
> + uint16_t *ids)
> +{
> + if (unlikely(dev->vectorized))
> +#ifdef CC_AVX512_SUPPORT
> + return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
> + pkts, avail_idx, desc_addrs, ids);
> +#else
> + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
> + pkts, avail_idx, desc_addrs, ids);
> +
> +#endif
> + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> + avail_idx, desc_addrs, ids);
> +}
> +
> static __rte_always_inline int
> virtio_dev_tx_batch_packed(struct virtio_net *dev,
> struct vhost_virtqueue *vq,
> @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
> uint16_t ids[PACKED_BATCH_SIZE];
> uint16_t i;
>
> - if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> - avail_idx, desc_addrs, ids))
> +
> + if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> + avail_idx, desc_addrs, ids))
> return -1;
>
> vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path
2020-10-06 13:34 ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
@ 2020-10-08 6:20 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08 6:20 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 9:34 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 0/5] vhost add vectorized data path
>
> Hi,
>
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. Vhost
> > application can choose whether using vectorized acceleration, it is
> > like external buffer and zero copy features.
> >
> > If platform or ring format not support vectorized function, vhost will
> > fallback to use default batch function. There will be no impact in current
> > data path.
>
> As a pre-requisite, I'd like some performance numbers in both loopback
> and PVP to figure out if adding such complexity is worth it, given we
> will have to support it for at least one year.
>
Thanks for suggestion, will add some reference numbers in next version.
> Thanks,
> Maxime
>
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> > vhost: add vectorized data path
> > vhost: reuse packed ring functions
> > vhost: prepare memory regions addresses
> > vhost: add packed ring vectorized dequeue
> > vhost: add packed ring vectorized enqueue
> >
> > doc/guides/nics/vhost.rst | 5 +
> > doc/guides/prog_guide/vhost_lib.rst | 12 +
> > drivers/net/vhost/rte_eth_vhost.c | 17 +-
> > lib/librte_vhost/meson.build | 16 ++
> > lib/librte_vhost/rte_vhost.h | 1 +
> > lib/librte_vhost/socket.c | 5 +
> > lib/librte_vhost/vhost.c | 11 +
> > lib/librte_vhost/vhost.h | 235 +++++++++++++++++++
> > lib/librte_vhost/vhost_user.c | 11 +
> > lib/librte_vhost/vhost_vec_avx.c | 338
> ++++++++++++++++++++++++++++
> > lib/librte_vhost/virtio_net.c | 257 ++++-----------------
> > 11 files changed, 692 insertions(+), 216 deletions(-)
> > create mode 100644 lib/librte_vhost/vhost_vec_avx.c
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
2020-10-06 14:59 ` Maxime Coquelin
@ 2020-10-08 7:05 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08 7:05 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 10:59 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
>
>
>
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> > if has_libnuma == 1
> > dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> > endif
> > +
> > +if arch_subdir == 'x86'
> > + if not machine_args.contains('-mno-avx512f')
> > + if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > + cflags += ['-DCC_AVX512_SUPPORT']
> > + vhost_avx512_lib = static_library('vhost_avx512_lib',
> > + 'vhost_vec_avx.c',
> > + dependencies: [static_rte_eal,
> static_rte_mempool,
> > + static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > + include_directories: includes,
> > + c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > + endif
> > + endif
> > +endif
>
> Not a Meson expert, but wonder how I can disable CC_AVX512_SUPPORT.
> I checked the DPDK doc, but I could not find how to pass -mno-avx512f to
> the machine_args.
Hi Maxime,
By now mno-avx512f flag will be set only if binutils check script found issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90028.
So avx512 code will be built-in if compiler support that. There's alternative way is that introduce one new option in meson build.
Thanks,
Marvin
>
> > +
> > if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> > return NULL;
> > }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids);
> > #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..dc5322d002
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
>
> For consistency it should be prefixed with virtio_net, not vhost.
>
> > @@ -0,0 +1,181 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > + 1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > + sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > + sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) |
> \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > + << FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > + FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids)
> > +{
> > + struct vring_packed_desc *descs = vq->desc_packed;
> > + uint32_t descs_status;
> > + void *desc_addr;
> > + uint16_t i;
> > + uint8_t cmp_low, cmp_high, cmp_result;
> > + uint64_t lens[PACKED_BATCH_SIZE];
> > + struct virtio_net_hdr *hdr;
> > +
> > + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > + return -1;
> > +
> > + /* load 4 descs */
> > + desc_addr = &vq->desc_packed[avail_idx];
> > + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
>
> Unlike split ring, packed ring specification does not mandate the ring
> size to be a power of two. So checking avail_idx is aligned on 64 bytes
> is not enough given a descriptor is 16 bytes.
>
> You need to also check against ring size to prevent out of bounds
> accesses.
>
> I see the non vectorized batch processing you introduced in v19.11 also
> do that wrong assumption. Please fix it.
>
> Also, I wonder whether it is assumed that &vq->desc_packed[avail_idx];
> is aligned on a cache-line. Meaning, does below intrinsics have such a
> requirement?
>
Got, packed ring size may arbitrary number. In v19.11 batch handling function has already checked available index not oversized.
I forgot that in vectorized path, will fix it in next release.
In vectorized path, loading function mm512_loadu_si512 do not need cache-aligned memory. So no special requirement is needed.
> > + /* burst check four status */
> > + __m512i avail_flag_vec;
> > + if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > + PACKED_FLAGS_MASK, 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > + else
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > + 0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > + _MM_CMPINT_NE);
> > + if (descs_status & BATCH_FLAGS_MASK)
> > + return -1;
> > +
> > + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > + descs[avail_idx + i].addr, &size,
> > + VHOST_ACCESS_RO);
> > +
> > + if (!desc_addrs[i])
> > + goto free_buf;
> > + lens[i] = descs[avail_idx + i].len;
> > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > + lens[i]);
> > + if (!pkts[i])
> > + goto free_buf;
> > + }
> > + } else {
> > + /* check buffer fit into one region & translate address */
> > + __m512i regions_low_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > + __m512i regions_high_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t addr_low = descs[avail_idx + i].addr;
> > + uint64_t addr_high = addr_low +
> > + descs[avail_idx + i].len;
> > + __m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > + __m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > + cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > + regions_low_addrs,
> _MM_CMPINT_NLT);
> > + cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > + regions_high_addrs,
> _MM_CMPINT_LT);
> > + cmp_result = cmp_low & cmp_high;
> > + int index = __builtin_ctz(cmp_result);
> > + if (unlikely((uint32_t)index >= dev->mem->nregions))
> > + goto free_buf;
> > +
> > + desc_addrs[i] = addr_low +
> > + dev->mem->regions[index].host_user_addr -
> > + dev->mem->regions[index].guest_phys_addr;
> > + lens[i] = descs[avail_idx + i].len;
> > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > + lens[i]);
> > + if (!pkts[i])
> > + goto free_buf;
> > + }
> > + }
> > +
> > + if (virtio_net_with_host_offload(dev)) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> > + vhost_dequeue_offload(hdr, pkts[i]);
> > + }
> > + }
> > +
> > + if (unlikely(virtio_net_is_inorder(dev))) {
> > + ids[PACKED_BATCH_SIZE - 1] =
> > + descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
>
> Isn't in-order a likely case? Maybe just remove the unlikely.
>
In_order option is depended on feature negotiation , will remove unlikely.
> > + } else {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> > + ids[i] = descs[avail_idx + i].id;
> > + }
> > +
> > + uint64_t addrs[PACKED_BATCH_SIZE << 1];
> > + /* store mbuf data_len, pkt_len */
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> > + addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> > + + sizeof(uint64_t);
> > + }
> > +
> > + /* save pkt_len and data_len into mbufs */
> > + __m512i value_vec =
> _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> > + 0xAA);
> > + __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> > + (uint32_t)-12);
> > + value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> > + __m512i vindex = _mm512_loadu_si512((void *)addrs);
> > + _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> > +
> > + return 0;
> > +free_buf:
> > + for (i = 0; i < PACKED_BATCH_SIZE; i++)
> > + rte_pktmbuf_free(pkts[i]);
> > +
> > + return -1;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index 6107662685..e4d2e2e7d6 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct
> virtio_net *dev,
> > return -1;
> > }
> >
> > +static __rte_always_inline int
> > +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids)
> > +{
> > + if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > + return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> > + pkts, avail_idx, desc_addrs, ids);
> > +#else
> > + return vhost_reserve_avail_batch_packed(dev, vq,
> mbuf_pool,
> > + pkts, avail_idx, desc_addrs, ids);
> > +
> > +#endif
> > + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > + avail_idx, desc_addrs, ids);
> > +}
>
>
> It should be as below to not have any performance impact when
> CC_AVX512_SUPPORT is not set:
>
> #ifdef CC_AVX512_SUPPORT
> if (unlikely(dev->vectorized))
> return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> pkts, avail_idx, desc_addrs, ids);
> #else
> return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> avail_idx, desc_addrs, ids);
> #endif
Got, will change in next release.
> > +
> > static __rte_always_inline int
> > virtio_dev_tx_batch_packed(struct virtio_net *dev,
> > struct vhost_virtqueue *vq,
> > @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net
> *dev,
> > uint16_t ids[PACKED_BATCH_SIZE];
> > uint16_t i;
> >
> > - if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > - avail_idx, desc_addrs, ids))
> > +
> > + if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > + avail_idx, desc_addrs, ids))
> > return -1;
> >
> > vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
2020-10-06 15:00 ` Maxime Coquelin
@ 2020-10-08 7:09 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-08 7:09 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 11:00 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 5/5] vhost: add packed ring vectorized enqueue
>
>
>
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring enqueue path with SIMD instructions. Four
> > descriptors status and length are batched handled with AVX512
> > instructions. Address translation operations are also accelerated
> > by AVX512 instructions.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index fc7daf2145..b78b2c5c1b 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1132,4 +1132,10 @@ vhost_reserve_avail_batch_packed_avx(struct
> virtio_net *dev,
> > uint16_t avail_idx,
> > uintptr_t *desc_addrs,
> > uint16_t *ids);
> > +
> > +int
> > +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mbuf **pkts);
> > +
> > #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > index dc5322d002..7d2250ed86 100644
> > --- a/lib/librte_vhost/vhost_vec_avx.c
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -35,9 +35,15 @@
> > #define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > #define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > FLAGS_BITS_OFFSET)
> > +#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
> > + ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> > +#define PACKED_WRITE_AVAIL_FLAG_WRAP
> (PACKED_AVAIL_FLAG_WRAP | \
> > + ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
> >
> > #define DESC_FLAGS_POS 0xaa
> > #define MBUF_LENS_POS 0x6666
> > +#define DESC_LENS_POS 0x4444
> > +#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
> >
> > int
> > vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > @@ -179,3 +185,154 @@ vhost_reserve_avail_batch_packed_avx(struct
> virtio_net *dev,
> >
> > return -1;
> > }
> > +
> > +int
> > +virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mbuf **pkts)
> > +{
> > + struct vring_packed_desc *descs = vq->desc_packed;
> > + uint16_t avail_idx = vq->last_avail_idx;
> > + uint64_t desc_addrs[PACKED_BATCH_SIZE];
> > + uint32_t buf_offset = dev->vhost_hlen;
> > + uint32_t desc_status;
> > + uint64_t lens[PACKED_BATCH_SIZE];
> > + uint16_t i;
> > + void *desc_addr;
> > + uint8_t cmp_low, cmp_high, cmp_result;
> > +
> > + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > + return -1;
>
> Same comment as for patch 4. Packed ring size may not be a pow2.
>
Thanks, will fix in next version.
> > + /* check refcnt and nb_segs */
> > + __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
> > +
> > + /* load four mbufs rearm data */
> > + __m256i mbufs = _mm256_set_epi64x(
> > + *pkts[3]->rearm_data,
> > + *pkts[2]->rearm_data,
> > + *pkts[1]->rearm_data,
> > + *pkts[0]->rearm_data);
> > +
> > + uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
> > + if (cmp & MBUF_LENS_POS)
> > + return -1;
> > +
> > + /* check desc status */
> > + desc_addr = &vq->desc_packed[avail_idx];
> > + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > + __m512i avail_flag_vec;
> > + __m512i used_flag_vec;
> > + if (vq->avail_wrap_counter) {
> > +#if defined(RTE_ARCH_I686)
>
> Is supporting AVX512 on i686 really useful/necessary?
>
It is useless for function point of view. Here is for successful compilation if enabled i686 build.
> > + avail_flag_vec =
> _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
> > + 0x0, PACKED_WRITE_AVAIL_FLAG,
> 0x0);
> > + used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK,
> 0x0,
> > + PACKED_FLAGS_MASK, 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_WRITE_AVAIL_FLAG);
> > + used_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_FLAGS_MASK);
> > +#endif
> > + } else {
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec = _mm512_set4_epi64(
> > + PACKED_WRITE_AVAIL_FLAG_WRAP,
> 0x0,
> > + PACKED_WRITE_AVAIL_FLAG, 0x0);
> > + used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_WRITE_AVAIL_FLAG_WRAP);
> > + used_flag_vec = _mm512_setzero_epi32();
> > +#endif
> > + }
> > +
> > + desc_status =
> _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
> > + avail_flag_vec, _MM_CMPINT_NE);
> > + if (desc_status)
> > + return -1;
> > +
> > + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > + descs[avail_idx + i].addr, &size,
> > + VHOST_ACCESS_RW);
> > +
> > + if (!desc_addrs[i])
> > + return -1;
> > +
> > + rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void
> *,
> > + 0));
> > + }
> > + } else {
> > + /* check buffer fit into one region & translate address */
> > + __m512i regions_low_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > + __m512i regions_high_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t addr_low = descs[avail_idx + i].addr;
> > + uint64_t addr_high = addr_low +
> > + descs[avail_idx + i].len;
> > + __m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > + __m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > + cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > + regions_low_addrs,
> _MM_CMPINT_NLT);
> > + cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > + regions_high_addrs,
> _MM_CMPINT_LT);
> > + cmp_result = cmp_low & cmp_high;
> > + int index = __builtin_ctz(cmp_result);
> > + if (unlikely((uint32_t)index >= dev->mem->nregions))
> > + return -1;
> > +
> > + desc_addrs[i] = addr_low +
> > + dev->mem->regions[index].host_user_addr -
> > + dev->mem->regions[index].guest_phys_addr;
> > + rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void
> *,
> > + 0));
> > + }
> > + }
> > +
> > + /* check length is enough */
> > + __m512i pkt_lens = _mm512_set_epi32(
> > + 0, pkts[3]->pkt_len, 0, 0,
> > + 0, pkts[2]->pkt_len, 0, 0,
> > + 0, pkts[1]->pkt_len, 0, 0,
> > + 0, pkts[0]->pkt_len, 0, 0);
> > +
> > + __m512i mbuf_len_offset =
> _mm512_maskz_set1_epi32(DESC_LENS_POS,
> > + dev->vhost_hlen);
> > + __m512i buf_len_vec = _mm512_add_epi32(pkt_lens,
> mbuf_len_offset);
> > + uint16_t lens_cmp =
> _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
> > + desc_vec, buf_len_vec, _MM_CMPINT_LT);
> > + if (lens_cmp)
> > + return -1;
> > +
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
> > + rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
> > + pkts[i]->pkt_len);
> > + }
> > +
> > + if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + lens[i] = descs[avail_idx + i].len;
> > + vhost_log_cache_write_iova(dev, vq,
> > + descs[avail_idx + i].addr, lens[i]);
> > + }
> > + }
> > +
> > + vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
> > + vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
> > + /* save len and flags, skip addr and id */
> > + __m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
> > + DESC_LENS_FLAGS_POS, buf_len_vec,
> > + used_flag_vec);
> > + _mm512_storeu_si512(desc_addr, desc_updated);
> > +
> > + return 0;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index e4d2e2e7d6..5c56a8d6ff 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -1354,6 +1354,21 @@ virtio_dev_rx_single_packed(struct virtio_net
> *dev,
> > return 0;
> > }
> >
> > +static __rte_always_inline int
> > +virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mbuf **pkts)
> > +
> > +{
> > + if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > + return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> > +#else
> > + return virtio_dev_rx_batch_packed(dev, vq, pkts);
> > +#endif
> > + return virtio_dev_rx_batch_packed(dev, vq, pkts);
>
> It should be as below to not have any performance impact when
> CC_AVX512_SUPPORT is not set:
>
> #ifdef CC_AVX512_SUPPORT
> if (unlikely(dev->vectorized))
> return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
> #else
> return virtio_dev_rx_batch_packed(dev, vq, pkts);
> #endif
>
Got, will fix in next version.
> > +}
> > +
> > static __rte_noinline uint32_t
> > virtio_dev_rx_packed(struct virtio_net *dev,
> > struct vhost_virtqueue *__rte_restrict vq,
> > @@ -1367,8 +1382,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
> > rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
> >
> > if (remained >= PACKED_BATCH_SIZE) {
> > - if (!virtio_dev_rx_batch_packed(dev, vq,
> > - &pkts[pkt_idx])) {
> > + if (!virtio_dev_rx_handle_batch_packed(dev, vq,
> > + &pkts[pkt_idx])) {
> > pkt_idx += PACKED_BATCH_SIZE;
> > remained -= PACKED_BATCH_SIZE;
> > continue;
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
2020-10-06 15:18 ` Maxime Coquelin
@ 2020-10-09 7:59 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-09 7:59 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Tuesday, October 6, 2020 11:19 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v2 4/5] vhost: add packed ring vectorized dequeue
>
>
>
> On 9/21/20 8:48 AM, Marvin Liu wrote:
> > Optimize vhost packed ring dequeue path with SIMD instructions. Four
> > descriptors status check and writeback are batched handled with AVX512
> > instructions. Address translation operations are also accelerated by
> > AVX512 instructions.
> >
> > If platform or compiler not support vectorization, will fallback to
> > default path.
> >
> > Signed-off-by: Marvin Liu <yong.liu@intel.com>
> >
> > diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
> > index cc9aa65c67..c1481802d7 100644
> > --- a/lib/librte_vhost/meson.build
> > +++ b/lib/librte_vhost/meson.build
> > @@ -8,6 +8,22 @@ endif
> > if has_libnuma == 1
> > dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
> > endif
> > +
> > +if arch_subdir == 'x86'
> > + if not machine_args.contains('-mno-avx512f')
> > + if cc.has_argument('-mavx512f') and cc.has_argument('-
> mavx512vl') and cc.has_argument('-mavx512bw')
> > + cflags += ['-DCC_AVX512_SUPPORT']
> > + vhost_avx512_lib = static_library('vhost_avx512_lib',
> > + 'vhost_vec_avx.c',
> > + dependencies: [static_rte_eal,
> static_rte_mempool,
> > + static_rte_mbuf, static_rte_ethdev,
> static_rte_net],
> > + include_directories: includes,
> > + c_args: [cflags, '-mavx512f', '-mavx512bw', '-
> mavx512vl'])
> > + objs += vhost_avx512_lib.extract_objects('vhost_vec_avx.c')
> > + endif
> > + endif
> > +endif
> > +
> > if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
> > cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
> > elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
> > diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> > index 4a81f18f01..fc7daf2145 100644
> > --- a/lib/librte_vhost/vhost.h
> > +++ b/lib/librte_vhost/vhost.h
> > @@ -1124,4 +1124,12 @@ virtio_dev_pktmbuf_alloc(struct virtio_net
> *dev, struct rte_mempool *mp,
> > return NULL;
> > }
> >
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids);
> > #endif /* _VHOST_NET_CDEV_H_ */
> > diff --git a/lib/librte_vhost/vhost_vec_avx.c
> b/lib/librte_vhost/vhost_vec_avx.c
> > new file mode 100644
> > index 0000000000..dc5322d002
> > --- /dev/null
> > +++ b/lib/librte_vhost/vhost_vec_avx.c
> > @@ -0,0 +1,181 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2016 Intel Corporation
> > + */
> > +#include <stdint.h>
> > +
> > +#include "vhost.h"
> > +
> > +#define BYTE_SIZE 8
> > +/* reference count offset in mbuf rearm data */
> > +#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +/* segment number offset in mbuf rearm data */
> > +#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
> > + offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
> > +
> > +/* default rearm data */
> > +#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
> > + 1ULL << REFCNT_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc,
> flags) / \
> > + sizeof(uint16_t))
> > +
> > +#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
> > + sizeof(uint16_t))
> > +#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) |
> \
> > + 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
> > +
> > +#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
> > + offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
> > +
> > +#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL |
> VRING_DESC_F_USED) \
> > + << FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) <<
> FLAGS_BITS_OFFSET)
> > +#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
> > + FLAGS_BITS_OFFSET)
> > +
> > +#define DESC_FLAGS_POS 0xaa
> > +#define MBUF_LENS_POS 0x6666
> > +
> > +int
> > +vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids)
> > +{
> > + struct vring_packed_desc *descs = vq->desc_packed;
> > + uint32_t descs_status;
> > + void *desc_addr;
> > + uint16_t i;
> > + uint8_t cmp_low, cmp_high, cmp_result;
> > + uint64_t lens[PACKED_BATCH_SIZE];
> > + struct virtio_net_hdr *hdr;
> > +
> > + if (unlikely(avail_idx & PACKED_BATCH_MASK))
> > + return -1;
> > +
> > + /* load 4 descs */
> > + desc_addr = &vq->desc_packed[avail_idx];
> > + __m512i desc_vec = _mm512_loadu_si512(desc_addr);
> > +
> > + /* burst check four status */
> > + __m512i avail_flag_vec;
> > + if (vq->avail_wrap_counter)
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG,
> 0x0,
> > + PACKED_FLAGS_MASK, 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG);
> > +
> > +#endif
> > + else
> > +#if defined(RTE_ARCH_I686)
> > + avail_flag_vec =
> _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
> > + 0x0, PACKED_AVAIL_FLAG_WRAP,
> 0x0);
> > +#else
> > + avail_flag_vec =
> _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
> > + PACKED_AVAIL_FLAG_WRAP);
> > +#endif
> > +
> > + descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
> > + _MM_CMPINT_NE);
> > + if (descs_status & BATCH_FLAGS_MASK)
> > + return -1;
> > +
>
>
> Also, please try to factorize code to avoid duplication between Tx and
> Rx paths for desc address translation:
Hi Maxime,
I have factorized the translation function in Rx and Tx paths, but there's a few performance drop after the change.
Since vectorized datapath is focusing on performance, I'd like to keep current implementation.
Thanks,
Marvin
> > + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t size = (uint64_t)descs[avail_idx + i].len;
> > + desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
> > + descs[avail_idx + i].addr, &size,
> > + VHOST_ACCESS_RO);
> > +
> > + if (!desc_addrs[i])
> > + goto free_buf;
> > + lens[i] = descs[avail_idx + i].len;
> > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > + lens[i]);
> > + if (!pkts[i])
> > + goto free_buf;
> > + }
> > + } else {> + /* check buffer fit into one region &
> translate address */
> > + __m512i regions_low_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_low_addrs);
> > + __m512i regions_high_addrs =
> > + _mm512_loadu_si512((void *)&dev-
> >regions_high_addrs);
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + uint64_t addr_low = descs[avail_idx + i].addr;
> > + uint64_t addr_high = addr_low +
> > + descs[avail_idx + i].len;
> > + __m512i low_addr_vec =
> _mm512_set1_epi64(addr_low);
> > + __m512i high_addr_vec =
> _mm512_set1_epi64(addr_high);
> > +
> > + cmp_low =
> _mm512_cmp_epi64_mask(low_addr_vec,
> > + regions_low_addrs,
> _MM_CMPINT_NLT);
> > + cmp_high =
> _mm512_cmp_epi64_mask(high_addr_vec,
> > + regions_high_addrs,
> _MM_CMPINT_LT);
> > + cmp_result = cmp_low & cmp_high;
> > + int index = __builtin_ctz(cmp_result);
> > + if (unlikely((uint32_t)index >= dev->mem->nregions))
> > + goto free_buf;
> > +
> > + desc_addrs[i] = addr_low +
> > + dev->mem->regions[index].host_user_addr -
> > + dev->mem->regions[index].guest_phys_addr;
> > + lens[i] = descs[avail_idx + i].len;
> > + rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
> > +
> > + pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
> > + lens[i]);
> > + if (!pkts[i])
> > + goto free_buf;
> > + }
> > + }
> > +
> > + if (virtio_net_with_host_offload(dev)) {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
> > + vhost_dequeue_offload(hdr, pkts[i]);
> > + }
> > + }
> > +
> > + if (unlikely(virtio_net_is_inorder(dev))) {
> > + ids[PACKED_BATCH_SIZE - 1] =
> > + descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
> > + } else {
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> > + ids[i] = descs[avail_idx + i].id;
> > + }
> > +
> > + uint64_t addrs[PACKED_BATCH_SIZE << 1];
> > + /* store mbuf data_len, pkt_len */
> > + vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
> > + addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
> > + addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
> > + + sizeof(uint64_t);
> > + }
> > +
> > + /* save pkt_len and data_len into mbufs */
> > + __m512i value_vec =
> _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
> > + 0xAA);
> > + __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
> > + (uint32_t)-12);
> > + value_vec = _mm512_add_epi32(value_vec, offsets_vec);
> > + __m512i vindex = _mm512_loadu_si512((void *)addrs);
> > + _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
> > +
> > + return 0;
> > +free_buf:
> > + for (i = 0; i < PACKED_BATCH_SIZE; i++)
> > + rte_pktmbuf_free(pkts[i]);
> > +
> > + return -1;
> > +}
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index 6107662685..e4d2e2e7d6 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -2249,6 +2249,28 @@ vhost_reserve_avail_batch_packed(struct
> virtio_net *dev,
> > return -1;
> > }
> >
> > +static __rte_always_inline int
> > +vhost_handle_avail_batch_packed(struct virtio_net *dev,
> > + struct vhost_virtqueue *vq,
> > + struct rte_mempool *mbuf_pool,
> > + struct rte_mbuf **pkts,
> > + uint16_t avail_idx,
> > + uintptr_t *desc_addrs,
> > + uint16_t *ids)
> > +{
> > + if (unlikely(dev->vectorized))
> > +#ifdef CC_AVX512_SUPPORT
> > + return vhost_reserve_avail_batch_packed_avx(dev, vq,
> mbuf_pool,
> > + pkts, avail_idx, desc_addrs, ids);
> > +#else
> > + return vhost_reserve_avail_batch_packed(dev, vq,
> mbuf_pool,
> > + pkts, avail_idx, desc_addrs, ids);
> > +
> > +#endif
> > + return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > + avail_idx, desc_addrs, ids);
> > +}
> > +
> > static __rte_always_inline int
> > virtio_dev_tx_batch_packed(struct virtio_net *dev,
> > struct vhost_virtqueue *vq,
> > @@ -2261,8 +2283,9 @@ virtio_dev_tx_batch_packed(struct virtio_net
> *dev,
> > uint16_t ids[PACKED_BATCH_SIZE];
> > uint16_t i;
> >
> > - if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > - avail_idx, desc_addrs, ids))
> > +
> > + if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
> > + avail_idx, desc_addrs, ids))
> > return -1;
> >
> > vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
` (5 more replies)
1 sibling, 6 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring format is imported since virtio spec 1.1. All descriptors
are compacted into one single ring when packed ring format is on. It is
straight forward that ring operations can be accelerated by utilizing
SIMD instructions.
This patch set will introduce vectorized data path in vhost library. If
vectorized option is on, operations like descs check, descs writeback,
address translation will be accelerated by SIMD instructions. On skylake
server, it can bring 6% performance gain in loopback case and around 4%
performance gain in PvP case.
Vhost application can choose whether using vectorized acceleration, just
like external buffer feature. If platform or ring format not support
vectorized function, vhost will fallback to use default batch function.
There will be no impact in current data path.
v3:
* rename vectorized datapath file
* eliminate the impact when avx512 disabled
* dynamically allocate memory regions structure
* remove unlikely hint for in_order
v2:
* add vIOMMU support
* add dequeue offloading
* rebase code
Marvin Liu (5):
vhost: add vectorized data path
vhost: reuse packed ring functions
vhost: prepare memory regions addresses
vhost: add packed ring vectorized dequeue
vhost: add packed ring vectorized enqueue
doc/guides/nics/vhost.rst | 5 +
doc/guides/prog_guide/vhost_lib.rst | 12 +
drivers/net/vhost/rte_eth_vhost.c | 17 +-
lib/librte_vhost/meson.build | 16 ++
lib/librte_vhost/rte_vhost.h | 1 +
lib/librte_vhost/socket.c | 5 +
lib/librte_vhost/vhost.c | 11 +
lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
lib/librte_vhost/vhost_user.c | 26 +++
lib/librte_vhost/virtio_net.c | 258 ++++-----------------
lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
11 files changed, 718 insertions(+), 216 deletions(-)
create mode 100644 lib/librte_vhost/virtio_net_avx.c
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 1/5] vhost: add vectorized data path
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
` (4 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Packed ring operations are split into batch and single functions for
performance perspective. Ring operations in batch function can be
accelerated by SIMD instructions like AVX512.
So introduce vectorized parameter in vhost. Vectorized data path can be
selected if platform and ring format matched requirements. Otherwise
will fallback to original data path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/doc/guides/nics/vhost.rst b/doc/guides/nics/vhost.rst
index d36f3120b..efdaf4de0 100644
--- a/doc/guides/nics/vhost.rst
+++ b/doc/guides/nics/vhost.rst
@@ -64,6 +64,11 @@ The user can specify below arguments in `--vdev` option.
It is used to enable external buffer support in vhost library.
(Default: 0 (disabled))
+#. ``vectorized``:
+
+ It is used to enable vectorized data path support in vhost library.
+ (Default: 0 (disabled))
+
Vhost PMD event handling
------------------------
diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index ba4c62aeb..5ef3844a0 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -118,6 +118,18 @@ The following is an overview of some key Vhost API functions:
It is disabled by default.
+ - ``RTE_VHOST_USER_VECTORIZED``
+ Vectorized data path will used when this flag is set. When packed ring
+ enabled, available descriptors are stored from frontend driver in sequence.
+ SIMD instructions like AVX can be used to handle multiple descriptors
+ simultaneously. Thus can accelerate the throughput of ring operations.
+
+ * Only packed ring has vectorized data path.
+
+ * Will fallback to normal datapath if no vectorization support.
+
+ It is disabled by default.
+
* ``rte_vhost_driver_set_features(path, features)``
This function sets the feature bits the vhost-user driver supports. The
diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index 66efecb32..8f71054ad 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -34,6 +34,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
#define ETH_VHOST_VIRTIO_NET_F_HOST_TSO "tso"
#define ETH_VHOST_LINEAR_BUF "linear-buffer"
#define ETH_VHOST_EXT_BUF "ext-buffer"
+#define ETH_VHOST_VECTORIZED "vectorized"
#define VHOST_MAX_PKT_BURST 32
static const char *valid_arguments[] = {
@@ -45,6 +46,7 @@ static const char *valid_arguments[] = {
ETH_VHOST_VIRTIO_NET_F_HOST_TSO,
ETH_VHOST_LINEAR_BUF,
ETH_VHOST_EXT_BUF,
+ ETH_VHOST_VECTORIZED,
NULL
};
@@ -1509,6 +1511,7 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
int tso = 0;
int linear_buf = 0;
int ext_buf = 0;
+ int vectorized = 0;
struct rte_eth_dev *eth_dev;
const char *name = rte_vdev_device_name(dev);
@@ -1618,6 +1621,17 @@ rte_pmd_vhost_probe(struct rte_vdev_device *dev)
flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
}
+ if (rte_kvargs_count(kvlist, ETH_VHOST_VECTORIZED) == 1) {
+ ret = rte_kvargs_process(kvlist,
+ ETH_VHOST_VECTORIZED,
+ &open_int, &vectorized);
+ if (ret < 0)
+ goto out_free;
+
+ if (vectorized == 1)
+ flags |= RTE_VHOST_USER_VECTORIZED;
+ }
+
if (dev->device.numa_node == SOCKET_ID_ANY)
dev->device.numa_node = rte_socket_id();
@@ -1666,4 +1680,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_vhost,
"postcopy-support=<0|1> "
"tso=<0|1> "
"linear-buffer=<0|1> "
- "ext-buffer=<0|1>");
+ "ext-buffer=<0|1> "
+ "vectorized=<0|1>");
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 010f16086..c49c1aca2 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -36,6 +36,7 @@ extern "C" {
/* support only linear buffers (no chained mbufs) */
#define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6)
#define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7)
+#define RTE_VHOST_USER_VECTORIZED (1ULL << 8)
/* Features. */
#ifndef VIRTIO_NET_F_GUEST_ANNOUNCE
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 0169d3648..e492c8c87 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -42,6 +42,7 @@ struct vhost_user_socket {
bool extbuf;
bool linearbuf;
bool async_copy;
+ bool vectorized;
/*
* The "supported_features" indicates the feature bits the
@@ -241,6 +242,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
dev->async_copy = 1;
}
+ if (vsocket->vectorized)
+ vhost_enable_vectorized(vid);
+
VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
if (vsocket->notify_ops->new_connection) {
@@ -876,6 +880,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
vsocket->vdpa_dev = NULL;
vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
+ vsocket->vectorized = flags & RTE_VHOST_USER_VECTORIZED;
vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
if (vsocket->async_copy &&
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index c7cd34e42..4b5ef10a8 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -738,6 +738,17 @@ vhost_enable_linearbuf(int vid)
dev->linearbuf = 1;
}
+void
+vhost_enable_vectorized(int vid)
+{
+ struct virtio_net *dev = get_device(vid);
+
+ if (dev == NULL)
+ return;
+
+ dev->vectorized = 1;
+}
+
int
rte_vhost_get_mtu(int vid, uint16_t *mtu)
{
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 20ccdc9bd..87583c0b6 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -363,6 +363,7 @@ struct virtio_net {
int async_copy;
int extbuf;
int linearbuf;
+ int vectorized;
struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
struct inflight_mem_info *inflight_info;
#define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
@@ -700,6 +701,7 @@ void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
void vhost_set_builtin_virtio_net(int vid, bool enable);
void vhost_enable_extbuf(int vid);
void vhost_enable_linearbuf(int vid);
+void vhost_enable_vectorized(int vid);
int vhost_enable_guest_notification(struct virtio_net *dev,
struct vhost_virtqueue *vq, int enable);
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
` (3 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Move parse_ethernet, offload, extbuf functions to header file. These
functions will be reused by vhost vectorized path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 87583c0b6..12b7699cf 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -20,6 +20,10 @@
#include <rte_rwlock.h>
#include <rte_malloc.h>
+#include <rte_ip.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_sctp.h>
#include "rte_vhost.h"
#include "rte_vdpa.h"
#include "rte_vdpa_dev.h"
@@ -878,4 +882,214 @@ mbuf_is_consumed(struct rte_mbuf *m)
return true;
}
+static __rte_always_inline bool
+virtio_net_is_inorder(struct virtio_net *dev)
+{
+ return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
+}
+
+static void
+parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
+{
+ struct rte_ipv4_hdr *ipv4_hdr;
+ struct rte_ipv6_hdr *ipv6_hdr;
+ void *l3_hdr = NULL;
+ struct rte_ether_hdr *eth_hdr;
+ uint16_t ethertype;
+
+ eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
+
+ m->l2_len = sizeof(struct rte_ether_hdr);
+ ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
+
+ if (ethertype == RTE_ETHER_TYPE_VLAN) {
+ struct rte_vlan_hdr *vlan_hdr =
+ (struct rte_vlan_hdr *)(eth_hdr + 1);
+
+ m->l2_len += sizeof(struct rte_vlan_hdr);
+ ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
+ }
+
+ l3_hdr = (char *)eth_hdr + m->l2_len;
+
+ switch (ethertype) {
+ case RTE_ETHER_TYPE_IPV4:
+ ipv4_hdr = l3_hdr;
+ *l4_proto = ipv4_hdr->next_proto_id;
+ m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV4;
+ break;
+ case RTE_ETHER_TYPE_IPV6:
+ ipv6_hdr = l3_hdr;
+ *l4_proto = ipv6_hdr->proto;
+ m->l3_len = sizeof(struct rte_ipv6_hdr);
+ *l4_hdr = (char *)l3_hdr + m->l3_len;
+ m->ol_flags |= PKT_TX_IPV6;
+ break;
+ default:
+ m->l3_len = 0;
+ *l4_proto = 0;
+ *l4_hdr = NULL;
+ break;
+ }
+}
+
+static inline bool
+virtio_net_with_host_offload(struct virtio_net *dev)
+{
+ if (dev->features &
+ ((1ULL << VIRTIO_NET_F_CSUM) |
+ (1ULL << VIRTIO_NET_F_HOST_ECN) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO4) |
+ (1ULL << VIRTIO_NET_F_HOST_TSO6) |
+ (1ULL << VIRTIO_NET_F_HOST_UFO)))
+ return true;
+
+ return false;
+}
+
+static __rte_always_inline void
+vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
+{
+ uint16_t l4_proto = 0;
+ void *l4_hdr = NULL;
+ struct rte_tcp_hdr *tcp_hdr = NULL;
+
+ if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+ return;
+
+ parse_ethernet(m, &l4_proto, &l4_hdr);
+ if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+ if (hdr->csum_start == (m->l2_len + m->l3_len)) {
+ switch (hdr->csum_offset) {
+ case (offsetof(struct rte_tcp_hdr, cksum)):
+ if (l4_proto == IPPROTO_TCP)
+ m->ol_flags |= PKT_TX_TCP_CKSUM;
+ break;
+ case (offsetof(struct rte_udp_hdr, dgram_cksum)):
+ if (l4_proto == IPPROTO_UDP)
+ m->ol_flags |= PKT_TX_UDP_CKSUM;
+ break;
+ case (offsetof(struct rte_sctp_hdr, cksum)):
+ if (l4_proto == IPPROTO_SCTP)
+ m->ol_flags |= PKT_TX_SCTP_CKSUM;
+ break;
+ default:
+ break;
+ }
+ }
+ }
+
+ if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+ switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+ case VIRTIO_NET_HDR_GSO_TCPV4:
+ case VIRTIO_NET_HDR_GSO_TCPV6:
+ tcp_hdr = l4_hdr;
+ m->ol_flags |= PKT_TX_TCP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+ break;
+ case VIRTIO_NET_HDR_GSO_UDP:
+ m->ol_flags |= PKT_TX_UDP_SEG;
+ m->tso_segsz = hdr->gso_size;
+ m->l4_len = sizeof(struct rte_udp_hdr);
+ break;
+ default:
+ VHOST_LOG_DATA(WARNING,
+ "unsupported gso type %u.\n", hdr->gso_type);
+ break;
+ }
+ }
+}
+
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+ rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+ struct rte_mbuf_ext_shared_info *shinfo = NULL;
+ uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+ uint16_t buf_len;
+ rte_iova_t iova;
+ void *buf;
+
+ /* Try to use pkt buffer to store shinfo to reduce the amount of memory
+ * required, otherwise store shinfo in the new buffer.
+ */
+ if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+ shinfo = rte_pktmbuf_mtod(pkt,
+ struct rte_mbuf_ext_shared_info *);
+ else {
+ total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+ total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+ }
+
+ if (unlikely(total_len > UINT16_MAX))
+ return -ENOSPC;
+
+ buf_len = total_len;
+ buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+ if (unlikely(buf == NULL))
+ return -ENOMEM;
+
+ /* Initialize shinfo */
+ if (shinfo) {
+ shinfo->free_cb = virtio_dev_extbuf_free;
+ shinfo->fcb_opaque = buf;
+ rte_mbuf_ext_refcnt_set(shinfo, 1);
+ } else {
+ shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+ virtio_dev_extbuf_free, buf);
+ if (unlikely(shinfo == NULL)) {
+ rte_free(buf);
+ VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
+ return -1;
+ }
+ }
+
+ iova = rte_malloc_virt2iova(buf);
+ rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+ rte_pktmbuf_reset_headroom(pkt);
+
+ return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+ uint32_t data_len)
+{
+ struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+ if (unlikely(pkt == NULL)) {
+ VHOST_LOG_DATA(ERR,
+ "Failed to allocate memory for mbuf.\n");
+ return NULL;
+ }
+
+ if (rte_pktmbuf_tailroom(pkt) >= data_len)
+ return pkt;
+
+ /* attach an external buffer if supported */
+ if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+ return pkt;
+
+ /* check if chained buffers are allowed */
+ if (!dev->linearbuf)
+ return pkt;
+
+ /* Data doesn't fit into the buffer and the host supports
+ * only linear buffers
+ */
+ rte_pktmbuf_free(pkt);
+
+ return NULL;
+}
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 0a0bea1a5..9757ed053 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -32,12 +32,6 @@ rxvq_is_mergeable(struct virtio_net *dev)
return dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF);
}
-static __rte_always_inline bool
-virtio_net_is_inorder(struct virtio_net *dev)
-{
- return dev->features & (1ULL << VIRTIO_F_IN_ORDER);
-}
-
static bool
is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t nr_vring)
{
@@ -1804,121 +1798,6 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
}
-static inline bool
-virtio_net_with_host_offload(struct virtio_net *dev)
-{
- if (dev->features &
- ((1ULL << VIRTIO_NET_F_CSUM) |
- (1ULL << VIRTIO_NET_F_HOST_ECN) |
- (1ULL << VIRTIO_NET_F_HOST_TSO4) |
- (1ULL << VIRTIO_NET_F_HOST_TSO6) |
- (1ULL << VIRTIO_NET_F_HOST_UFO)))
- return true;
-
- return false;
-}
-
-static void
-parse_ethernet(struct rte_mbuf *m, uint16_t *l4_proto, void **l4_hdr)
-{
- struct rte_ipv4_hdr *ipv4_hdr;
- struct rte_ipv6_hdr *ipv6_hdr;
- void *l3_hdr = NULL;
- struct rte_ether_hdr *eth_hdr;
- uint16_t ethertype;
-
- eth_hdr = rte_pktmbuf_mtod(m, struct rte_ether_hdr *);
-
- m->l2_len = sizeof(struct rte_ether_hdr);
- ethertype = rte_be_to_cpu_16(eth_hdr->ether_type);
-
- if (ethertype == RTE_ETHER_TYPE_VLAN) {
- struct rte_vlan_hdr *vlan_hdr =
- (struct rte_vlan_hdr *)(eth_hdr + 1);
-
- m->l2_len += sizeof(struct rte_vlan_hdr);
- ethertype = rte_be_to_cpu_16(vlan_hdr->eth_proto);
- }
-
- l3_hdr = (char *)eth_hdr + m->l2_len;
-
- switch (ethertype) {
- case RTE_ETHER_TYPE_IPV4:
- ipv4_hdr = l3_hdr;
- *l4_proto = ipv4_hdr->next_proto_id;
- m->l3_len = (ipv4_hdr->version_ihl & 0x0f) * 4;
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV4;
- break;
- case RTE_ETHER_TYPE_IPV6:
- ipv6_hdr = l3_hdr;
- *l4_proto = ipv6_hdr->proto;
- m->l3_len = sizeof(struct rte_ipv6_hdr);
- *l4_hdr = (char *)l3_hdr + m->l3_len;
- m->ol_flags |= PKT_TX_IPV6;
- break;
- default:
- m->l3_len = 0;
- *l4_proto = 0;
- *l4_hdr = NULL;
- break;
- }
-}
-
-static __rte_always_inline void
-vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
-{
- uint16_t l4_proto = 0;
- void *l4_hdr = NULL;
- struct rte_tcp_hdr *tcp_hdr = NULL;
-
- if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
- return;
-
- parse_ethernet(m, &l4_proto, &l4_hdr);
- if (hdr->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
- if (hdr->csum_start == (m->l2_len + m->l3_len)) {
- switch (hdr->csum_offset) {
- case (offsetof(struct rte_tcp_hdr, cksum)):
- if (l4_proto == IPPROTO_TCP)
- m->ol_flags |= PKT_TX_TCP_CKSUM;
- break;
- case (offsetof(struct rte_udp_hdr, dgram_cksum)):
- if (l4_proto == IPPROTO_UDP)
- m->ol_flags |= PKT_TX_UDP_CKSUM;
- break;
- case (offsetof(struct rte_sctp_hdr, cksum)):
- if (l4_proto == IPPROTO_SCTP)
- m->ol_flags |= PKT_TX_SCTP_CKSUM;
- break;
- default:
- break;
- }
- }
- }
-
- if (l4_hdr && hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
- switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
- case VIRTIO_NET_HDR_GSO_TCPV4:
- case VIRTIO_NET_HDR_GSO_TCPV6:
- tcp_hdr = l4_hdr;
- m->ol_flags |= PKT_TX_TCP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
- break;
- case VIRTIO_NET_HDR_GSO_UDP:
- m->ol_flags |= PKT_TX_UDP_SEG;
- m->tso_segsz = hdr->gso_size;
- m->l4_len = sizeof(struct rte_udp_hdr);
- break;
- default:
- VHOST_LOG_DATA(WARNING,
- "unsupported gso type %u.\n", hdr->gso_type);
- break;
- }
- }
-}
-
static __rte_noinline void
copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
struct buf_vector *buf_vec)
@@ -2083,96 +1962,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
return error;
}
-static void
-virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
-{
- rte_free(opaque);
-}
-
-static int
-virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
-{
- struct rte_mbuf_ext_shared_info *shinfo = NULL;
- uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
- uint16_t buf_len;
- rte_iova_t iova;
- void *buf;
-
- /* Try to use pkt buffer to store shinfo to reduce the amount of memory
- * required, otherwise store shinfo in the new buffer.
- */
- if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
- shinfo = rte_pktmbuf_mtod(pkt,
- struct rte_mbuf_ext_shared_info *);
- else {
- total_len += sizeof(*shinfo) + sizeof(uintptr_t);
- total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
- }
-
- if (unlikely(total_len > UINT16_MAX))
- return -ENOSPC;
-
- buf_len = total_len;
- buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
- if (unlikely(buf == NULL))
- return -ENOMEM;
-
- /* Initialize shinfo */
- if (shinfo) {
- shinfo->free_cb = virtio_dev_extbuf_free;
- shinfo->fcb_opaque = buf;
- rte_mbuf_ext_refcnt_set(shinfo, 1);
- } else {
- shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
- virtio_dev_extbuf_free, buf);
- if (unlikely(shinfo == NULL)) {
- rte_free(buf);
- VHOST_LOG_DATA(ERR, "Failed to init shinfo\n");
- return -1;
- }
- }
-
- iova = rte_malloc_virt2iova(buf);
- rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
- rte_pktmbuf_reset_headroom(pkt);
-
- return 0;
-}
-
-/*
- * Allocate a host supported pktmbuf.
- */
-static __rte_always_inline struct rte_mbuf *
-virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
- uint32_t data_len)
-{
- struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
-
- if (unlikely(pkt == NULL)) {
- VHOST_LOG_DATA(ERR,
- "Failed to allocate memory for mbuf.\n");
- return NULL;
- }
-
- if (rte_pktmbuf_tailroom(pkt) >= data_len)
- return pkt;
-
- /* attach an external buffer if supported */
- if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
- return pkt;
-
- /* check if chained buffers are allowed */
- if (!dev->linearbuf)
- return pkt;
-
- /* Data doesn't fit into the buffer and the host supports
- * only linear buffers
- */
- rte_pktmbuf_free(pkt);
-
- return NULL;
-}
-
static __rte_noinline uint16_t
virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
` (2 subsequent siblings)
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Prepare memory regions guest physical addresses for vectorized data
path. These information will be utilized by SIMD instructions to find
matched region index.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 12b7699cf..a19fe9423 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -52,6 +52,8 @@
#define ASYNC_MAX_POLL_SEG 255
+#define MAX_NREGIONS 8
+
#define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST * 2)
#define VHOST_MAX_ASYNC_VEC (BUF_VECTOR_MAX * 2)
@@ -349,6 +351,11 @@ struct inflight_mem_info {
uint64_t size;
};
+struct mem_regions_range {
+ uint64_t regions_low_addrs[MAX_NREGIONS];
+ uint64_t regions_high_addrs[MAX_NREGIONS];
+};
+
/**
* Device structure contains all configuration information relating
* to the device.
@@ -356,6 +363,7 @@ struct inflight_mem_info {
struct virtio_net {
/* Frontend (QEMU) memory and memory region information */
struct rte_vhost_memory *mem;
+ struct mem_regions_range *regions_range;
uint64_t features;
uint64_t protocol_features;
int vid;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 4deceb3e0..2d2a2a1a3 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -185,6 +185,11 @@ vhost_backend_cleanup(struct virtio_net *dev)
dev->inflight_info = NULL;
}
+ if (dev->regions_range) {
+ free(dev->regions_range);
+ dev->regions_range = NULL;
+ }
+
if (dev->slave_req_fd >= 0) {
close(dev->slave_req_fd);
dev->slave_req_fd = -1;
@@ -1230,6 +1235,27 @@ vhost_user_set_mem_table(struct virtio_net **pdev, struct VhostUserMsg *msg,
}
}
+ RTE_BUILD_BUG_ON(VHOST_MEMORY_MAX_NREGIONS != 8);
+ if (dev->vectorized) {
+ if (dev->regions_range == NULL) {
+ dev->regions_range = calloc(1,
+ sizeof(struct mem_regions_range));
+ if (!dev->regions_range) {
+ VHOST_LOG_CONFIG(ERR,
+ "failed to alloc dev vectorized area\n");
+ return RTE_VHOST_MSG_RESULT_ERR;
+ }
+ }
+
+ for (i = 0; i < memory->nregions; i++) {
+ dev->regions_range->regions_low_addrs[i] =
+ memory->regions[i].guest_phys_addr;
+ dev->regions_range->regions_high_addrs[i] =
+ memory->regions[i].guest_phys_addr +
+ memory->regions[i].memory_size;
+ }
+ }
+
for (i = 0; i < dev->nr_vring; i++) {
struct vhost_virtqueue *vq = dev->virtqueue[i];
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
` (2 preceding siblings ...)
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-12 8:21 ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring dequeue path with SIMD instructions. Four
descriptors status check and writeback are batched handled with AVX512
instructions. Address translation operations are also accelerated by
AVX512 instructions.
If platform or compiler not support vectorization, will fallback to
default path.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/meson.build b/lib/librte_vhost/meson.build
index cc9aa65c6..5eadcbae4 100644
--- a/lib/librte_vhost/meson.build
+++ b/lib/librte_vhost/meson.build
@@ -8,6 +8,22 @@ endif
if has_libnuma == 1
dpdk_conf.set10('RTE_LIBRTE_VHOST_NUMA', true)
endif
+
+if arch_subdir == 'x86'
+ if not machine_args.contains('-mno-avx512f')
+ if cc.has_argument('-mavx512f') and cc.has_argument('-mavx512vl') and cc.has_argument('-mavx512bw')
+ cflags += ['-DCC_AVX512_SUPPORT']
+ vhost_avx512_lib = static_library('vhost_avx512_lib',
+ 'virtio_net_avx.c',
+ dependencies: [static_rte_eal, static_rte_mempool,
+ static_rte_mbuf, static_rte_ethdev, static_rte_net],
+ include_directories: includes,
+ c_args: [cflags, '-mavx512f', '-mavx512bw', '-mavx512vl'])
+ objs += vhost_avx512_lib.extract_objects('virtio_net_avx.c')
+ endif
+ endif
+endif
+
if (toolchain == 'gcc' and cc.version().version_compare('>=8.3.0'))
cflags += '-DVHOST_GCC_UNROLL_PRAGMA'
elif (toolchain == 'clang' and cc.version().version_compare('>=3.7.0'))
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index a19fe9423..b270c424b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1100,4 +1100,15 @@ virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
return NULL;
}
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids);
+
+
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 9757ed053..3bc6b9b20 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2136,6 +2136,28 @@ vhost_reserve_avail_batch_packed(struct virtio_net *dev,
return -1;
}
+static __rte_always_inline int
+vhost_handle_avail_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+#ifdef CC_AVX512_SUPPORT
+ if (unlikely(dev->vectorized))
+ return vhost_reserve_avail_batch_packed_avx(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+ else
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool,
+ pkts, avail_idx, desc_addrs, ids);
+#else
+ return vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids);
+#endif
+}
+
static __rte_always_inline int
virtio_dev_tx_batch_packed(struct virtio_net *dev,
struct vhost_virtqueue *vq,
@@ -2148,8 +2170,9 @@ virtio_dev_tx_batch_packed(struct virtio_net *dev,
uint16_t ids[PACKED_BATCH_SIZE];
uint16_t i;
- if (vhost_reserve_avail_batch_packed(dev, vq, mbuf_pool, pkts,
- avail_idx, desc_addrs, ids))
+
+ if (vhost_handle_avail_batch_packed(dev, vq, mbuf_pool, pkts,
+ avail_idx, desc_addrs, ids))
return -1;
vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
diff --git a/lib/librte_vhost/virtio_net_avx.c b/lib/librte_vhost/virtio_net_avx.c
new file mode 100644
index 000000000..e10b2a285
--- /dev/null
+++ b/lib/librte_vhost/virtio_net_avx.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2016 Intel Corporation
+ */
+#include <stdint.h>
+
+#include "vhost.h"
+
+#define BYTE_SIZE 8
+/* reference count offset in mbuf rearm data */
+#define REFCNT_BITS_OFFSET ((offsetof(struct rte_mbuf, refcnt) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+/* segment number offset in mbuf rearm data */
+#define SEG_NUM_BITS_OFFSET ((offsetof(struct rte_mbuf, nb_segs) - \
+ offsetof(struct rte_mbuf, rearm_data)) * BYTE_SIZE)
+
+/* default rearm data */
+#define DEFAULT_REARM_DATA (1ULL << SEG_NUM_BITS_OFFSET | \
+ 1ULL << REFCNT_BITS_OFFSET)
+
+#define DESC_FLAGS_SHORT_OFFSET (offsetof(struct vring_packed_desc, flags) / \
+ sizeof(uint16_t))
+
+#define DESC_FLAGS_SHORT_SIZE (sizeof(struct vring_packed_desc) / \
+ sizeof(uint16_t))
+#define BATCH_FLAGS_MASK (1 << DESC_FLAGS_SHORT_OFFSET | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 2) | \
+ 1 << (DESC_FLAGS_SHORT_OFFSET + DESC_FLAGS_SHORT_SIZE * 3))
+
+#define FLAGS_BITS_OFFSET ((offsetof(struct vring_packed_desc, flags) - \
+ offsetof(struct vring_packed_desc, len)) * BYTE_SIZE)
+
+#define PACKED_FLAGS_MASK ((0ULL | VRING_DESC_F_AVAIL | VRING_DESC_F_USED) \
+ << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
+#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
+ FLAGS_BITS_OFFSET)
+
+#define DESC_FLAGS_POS 0xaa
+#define MBUF_LENS_POS 0x6666
+
+int
+vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mempool *mbuf_pool,
+ struct rte_mbuf **pkts,
+ uint16_t avail_idx,
+ uintptr_t *desc_addrs,
+ uint16_t *ids)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint32_t descs_status;
+ void *desc_addr;
+ uint16_t i;
+ uint8_t cmp_low, cmp_high, cmp_result;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ struct virtio_net_hdr *hdr;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+ if (unlikely((avail_idx + PACKED_BATCH_SIZE) > vq->size))
+ return -1;
+
+ /* load 4 descs */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ /* burst check four status */
+ __m512i avail_flag_vec;
+ if (vq->avail_wrap_counter)
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG);
+
+#endif
+ else
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_AVAIL_FLAG_WRAP,
+ 0x0, PACKED_AVAIL_FLAG_WRAP, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_AVAIL_FLAG_WRAP);
+#endif
+
+ descs_status = _mm512_cmp_epu16_mask(desc_vec, avail_flag_vec,
+ _MM_CMPINT_NE);
+ if (descs_status & BATCH_FLAGS_MASK)
+ return -1;
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t size = (uint64_t)descs[avail_idx + i].len;
+ desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+ descs[avail_idx + i].addr, &size,
+ VHOST_ACCESS_RO);
+
+ if (!desc_addrs[i])
+ goto free_buf;
+ lens[i] = descs[avail_idx + i].len;
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+ pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+ lens[i]);
+ if (!pkts[i])
+ goto free_buf;
+ }
+ } else {
+ /* check buffer fit into one region & translate address */
+ struct mem_regions_range *range = dev->regions_range;
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&range->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&range->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low +
+ descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ goto free_buf;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ lens[i] = descs[avail_idx + i].len;
+ rte_prefetch0((void *)(uintptr_t)desc_addrs[i]);
+
+ pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool,
+ lens[i]);
+ if (!pkts[i])
+ goto free_buf;
+ }
+ }
+
+ if (virtio_net_with_host_offload(dev)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ hdr = (struct virtio_net_hdr *)(desc_addrs[i]);
+ vhost_dequeue_offload(hdr, pkts[i]);
+ }
+ }
+
+ if (virtio_net_is_inorder(dev)) {
+ ids[PACKED_BATCH_SIZE - 1] =
+ descs[avail_idx + PACKED_BATCH_SIZE - 1].id;
+ } else {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE)
+ ids[i] = descs[avail_idx + i].id;
+ }
+
+ uint64_t addrs[PACKED_BATCH_SIZE << 1];
+ /* store mbuf data_len, pkt_len */
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ addrs[i << 1] = (uint64_t)pkts[i]->rx_descriptor_fields1;
+ addrs[(i << 1) + 1] = (uint64_t)pkts[i]->rx_descriptor_fields1
+ + sizeof(uint64_t);
+ }
+
+ /* save pkt_len and data_len into mbufs */
+ __m512i value_vec = _mm512_maskz_shuffle_epi32(MBUF_LENS_POS, desc_vec,
+ 0xAA);
+ __m512i offsets_vec = _mm512_maskz_set1_epi32(MBUF_LENS_POS,
+ (uint32_t)-12);
+ value_vec = _mm512_add_epi32(value_vec, offsets_vec);
+ __m512i vindex = _mm512_loadu_si512((void *)addrs);
+ _mm512_i64scatter_epi64(0, vindex, value_vec, 1);
+
+ return 0;
+free_buf:
+ for (i = 0; i < PACKED_BATCH_SIZE; i++)
+ rte_pktmbuf_free(pkts[i]);
+
+ return -1;
+}
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
` (3 preceding siblings ...)
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
@ 2020-10-09 8:14 ` Marvin Liu
2020-10-12 8:21 ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
5 siblings, 0 replies; 36+ messages in thread
From: Marvin Liu @ 2020-10-09 8:14 UTC (permalink / raw)
To: maxime.coquelin, chenbo.xia, zhihong.wang; +Cc: dev, Marvin Liu
Optimize vhost packed ring enqueue path with SIMD instructions. Four
descriptors status and length are batched handled with AVX512
instructions. Address translation operations are also accelerated
by AVX512 instructions.
Signed-off-by: Marvin Liu <yong.liu@intel.com>
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index b270c424b..84dc289e9 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -1110,5 +1110,9 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
uintptr_t *desc_addrs,
uint16_t *ids);
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts);
#endif /* _VHOST_NET_CDEV_H_ */
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 3bc6b9b20..3e49c88ac 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1354,6 +1354,22 @@ virtio_dev_rx_single_packed(struct virtio_net *dev,
return 0;
}
+static __rte_always_inline int
+virtio_dev_rx_handle_batch_packed(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+
+{
+#ifdef CC_AVX512_SUPPORT
+ if (unlikely(dev->vectorized))
+ return virtio_dev_rx_batch_packed_avx(dev, vq, pkts);
+ else
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#else
+ return virtio_dev_rx_batch_packed(dev, vq, pkts);
+#endif
+}
+
static __rte_noinline uint32_t
virtio_dev_rx_packed(struct virtio_net *dev,
struct vhost_virtqueue *__rte_restrict vq,
@@ -1367,8 +1383,8 @@ virtio_dev_rx_packed(struct virtio_net *dev,
rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
if (remained >= PACKED_BATCH_SIZE) {
- if (!virtio_dev_rx_batch_packed(dev, vq,
- &pkts[pkt_idx])) {
+ if (!virtio_dev_rx_handle_batch_packed(dev, vq,
+ &pkts[pkt_idx])) {
pkt_idx += PACKED_BATCH_SIZE;
remained -= PACKED_BATCH_SIZE;
continue;
diff --git a/lib/librte_vhost/virtio_net_avx.c b/lib/librte_vhost/virtio_net_avx.c
index e10b2a285..aa47b15ae 100644
--- a/lib/librte_vhost/virtio_net_avx.c
+++ b/lib/librte_vhost/virtio_net_avx.c
@@ -35,9 +35,15 @@
#define PACKED_AVAIL_FLAG ((0ULL | VRING_DESC_F_AVAIL) << FLAGS_BITS_OFFSET)
#define PACKED_AVAIL_FLAG_WRAP ((0ULL | VRING_DESC_F_USED) << \
FLAGS_BITS_OFFSET)
+#define PACKED_WRITE_AVAIL_FLAG (PACKED_AVAIL_FLAG | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
+#define PACKED_WRITE_AVAIL_FLAG_WRAP (PACKED_AVAIL_FLAG_WRAP | \
+ ((0ULL | VRING_DESC_F_WRITE) << FLAGS_BITS_OFFSET))
#define DESC_FLAGS_POS 0xaa
#define MBUF_LENS_POS 0x6666
+#define DESC_LENS_POS 0x4444
+#define DESC_LENS_FLAGS_POS 0xB0B0B0B0
int
vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
@@ -182,3 +188,157 @@ vhost_reserve_avail_batch_packed_avx(struct virtio_net *dev,
return -1;
}
+
+int
+virtio_dev_rx_batch_packed_avx(struct virtio_net *dev,
+ struct vhost_virtqueue *vq,
+ struct rte_mbuf **pkts)
+{
+ struct vring_packed_desc *descs = vq->desc_packed;
+ uint16_t avail_idx = vq->last_avail_idx;
+ uint64_t desc_addrs[PACKED_BATCH_SIZE];
+ uint32_t buf_offset = dev->vhost_hlen;
+ uint32_t desc_status;
+ uint64_t lens[PACKED_BATCH_SIZE];
+ uint16_t i;
+ void *desc_addr;
+ uint8_t cmp_low, cmp_high, cmp_result;
+
+ if (unlikely(avail_idx & PACKED_BATCH_MASK))
+ return -1;
+ if (unlikely((avail_idx + PACKED_BATCH_SIZE) > vq->size))
+ return -1;
+
+ /* check refcnt and nb_segs */
+ __m256i mbuf_ref = _mm256_set1_epi64x(DEFAULT_REARM_DATA);
+
+ /* load four mbufs rearm data */
+ __m256i mbufs = _mm256_set_epi64x(
+ *pkts[3]->rearm_data,
+ *pkts[2]->rearm_data,
+ *pkts[1]->rearm_data,
+ *pkts[0]->rearm_data);
+
+ uint16_t cmp = _mm256_cmpneq_epu16_mask(mbufs, mbuf_ref);
+ if (cmp & MBUF_LENS_POS)
+ return -1;
+
+ /* check desc status */
+ desc_addr = &vq->desc_packed[avail_idx];
+ __m512i desc_vec = _mm512_loadu_si512(desc_addr);
+
+ __m512i avail_flag_vec;
+ __m512i used_flag_vec;
+ if (vq->avail_wrap_counter) {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(PACKED_WRITE_AVAIL_FLAG,
+ 0x0, PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(PACKED_FLAGS_MASK, 0x0,
+ PACKED_FLAGS_MASK, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG);
+ used_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_FLAGS_MASK);
+#endif
+ } else {
+#if defined(RTE_ARCH_I686)
+ avail_flag_vec = _mm512_set4_epi64(
+ PACKED_WRITE_AVAIL_FLAG_WRAP, 0x0,
+ PACKED_WRITE_AVAIL_FLAG, 0x0);
+ used_flag_vec = _mm512_set4_epi64(0x0, 0x0, 0x0, 0x0);
+#else
+ avail_flag_vec = _mm512_maskz_set1_epi64(DESC_FLAGS_POS,
+ PACKED_WRITE_AVAIL_FLAG_WRAP);
+ used_flag_vec = _mm512_setzero_epi32();
+#endif
+ }
+
+ desc_status = _mm512_mask_cmp_epu16_mask(BATCH_FLAGS_MASK, desc_vec,
+ avail_flag_vec, _MM_CMPINT_NE);
+ if (desc_status)
+ return -1;
+
+ if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t size = (uint64_t)descs[avail_idx + i].len;
+ desc_addrs[i] = __vhost_iova_to_vva(dev, vq,
+ descs[avail_idx + i].addr, &size,
+ VHOST_ACCESS_RW);
+
+ if (!desc_addrs[i])
+ return -1;
+
+ rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+ 0));
+ }
+ } else {
+ /* check buffer fit into one region & translate address */
+ struct mem_regions_range *range = dev->regions_range;
+ __m512i regions_low_addrs =
+ _mm512_loadu_si512((void *)&range->regions_low_addrs);
+ __m512i regions_high_addrs =
+ _mm512_loadu_si512((void *)&range->regions_high_addrs);
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ uint64_t addr_low = descs[avail_idx + i].addr;
+ uint64_t addr_high = addr_low +
+ descs[avail_idx + i].len;
+ __m512i low_addr_vec = _mm512_set1_epi64(addr_low);
+ __m512i high_addr_vec = _mm512_set1_epi64(addr_high);
+
+ cmp_low = _mm512_cmp_epi64_mask(low_addr_vec,
+ regions_low_addrs, _MM_CMPINT_NLT);
+ cmp_high = _mm512_cmp_epi64_mask(high_addr_vec,
+ regions_high_addrs, _MM_CMPINT_LT);
+ cmp_result = cmp_low & cmp_high;
+ int index = __builtin_ctz(cmp_result);
+ if (unlikely((uint32_t)index >= dev->mem->nregions))
+ return -1;
+
+ desc_addrs[i] = addr_low +
+ dev->mem->regions[index].host_user_addr -
+ dev->mem->regions[index].guest_phys_addr;
+ rte_prefetch0(rte_pktmbuf_mtod_offset(pkts[i], void *,
+ 0));
+ }
+ }
+
+ /* check length is enough */
+ __m512i pkt_lens = _mm512_set_epi32(
+ 0, pkts[3]->pkt_len, 0, 0,
+ 0, pkts[2]->pkt_len, 0, 0,
+ 0, pkts[1]->pkt_len, 0, 0,
+ 0, pkts[0]->pkt_len, 0, 0);
+
+ __m512i mbuf_len_offset = _mm512_maskz_set1_epi32(DESC_LENS_POS,
+ dev->vhost_hlen);
+ __m512i buf_len_vec = _mm512_add_epi32(pkt_lens, mbuf_len_offset);
+ uint16_t lens_cmp = _mm512_mask_cmp_epu32_mask(DESC_LENS_POS,
+ desc_vec, buf_len_vec, _MM_CMPINT_LT);
+ if (lens_cmp)
+ return -1;
+
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ rte_memcpy((void *)(uintptr_t)(desc_addrs[i] + buf_offset),
+ rte_pktmbuf_mtod_offset(pkts[i], void *, 0),
+ pkts[i]->pkt_len);
+ }
+
+ if (unlikely((dev->features & (1ULL << VHOST_F_LOG_ALL)))) {
+ vhost_for_each_try_unroll(i, 0, PACKED_BATCH_SIZE) {
+ lens[i] = descs[avail_idx + i].len;
+ vhost_log_cache_write_iova(dev, vq,
+ descs[avail_idx + i].addr, lens[i]);
+ }
+ }
+
+ vq_inc_last_avail_packed(vq, PACKED_BATCH_SIZE);
+ vq_inc_last_used_packed(vq, PACKED_BATCH_SIZE);
+ /* save len and flags, skip addr and id */
+ __m512i desc_updated = _mm512_mask_add_epi16(desc_vec,
+ DESC_LENS_FLAGS_POS, buf_len_vec,
+ used_flag_vec);
+ _mm512_storeu_si512(desc_addr, desc_updated);
+
+ return 0;
+}
--
2.17.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
` (4 preceding siblings ...)
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
@ 2020-10-12 8:21 ` Maxime Coquelin
2020-10-12 9:10 ` Liu, Yong
2020-10-15 15:28 ` Liu, Yong
5 siblings, 2 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-12 8:21 UTC (permalink / raw)
To: Marvin Liu, chenbo.xia, zhihong.wang; +Cc: dev
Hi Marvin,
On 10/9/20 10:14 AM, Marvin Liu wrote:
> Packed ring format is imported since virtio spec 1.1. All descriptors
> are compacted into one single ring when packed ring format is on. It is
> straight forward that ring operations can be accelerated by utilizing
> SIMD instructions.
>
> This patch set will introduce vectorized data path in vhost library. If
> vectorized option is on, operations like descs check, descs writeback,
> address translation will be accelerated by SIMD instructions. On skylake
> server, it can bring 6% performance gain in loopback case and around 4%
> performance gain in PvP case.
IMHO, 4% gain on PVP is not a significant gain if we compare to the
added complexity. Moreover, I guess this is 4% gain with testpmd-based
PVP? If this is the case it may be even lower with OVS-DPDK PVP
benchmark, I will try to do a benchmark this week.
Thanks,
Maxime
> Vhost application can choose whether using vectorized acceleration, just
> like external buffer feature. If platform or ring format not support
> vectorized function, vhost will fallback to use default batch function.
> There will be no impact in current data path.
>
> v3:
> * rename vectorized datapath file
> * eliminate the impact when avx512 disabled
> * dynamically allocate memory regions structure
> * remove unlikely hint for in_order
>
> v2:
> * add vIOMMU support
> * add dequeue offloading
> * rebase code
>
> Marvin Liu (5):
> vhost: add vectorized data path
> vhost: reuse packed ring functions
> vhost: prepare memory regions addresses
> vhost: add packed ring vectorized dequeue
> vhost: add packed ring vectorized enqueue
>
> doc/guides/nics/vhost.rst | 5 +
> doc/guides/prog_guide/vhost_lib.rst | 12 +
> drivers/net/vhost/rte_eth_vhost.c | 17 +-
> lib/librte_vhost/meson.build | 16 ++
> lib/librte_vhost/rte_vhost.h | 1 +
> lib/librte_vhost/socket.c | 5 +
> lib/librte_vhost/vhost.c | 11 +
> lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
> lib/librte_vhost/vhost_user.c | 26 +++
> lib/librte_vhost/virtio_net.c | 258 ++++-----------------
> lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
> 11 files changed, 718 insertions(+), 216 deletions(-)
> create mode 100644 lib/librte_vhost/virtio_net_avx.c
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-12 8:21 ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
@ 2020-10-12 9:10 ` Liu, Yong
2020-10-12 9:57 ` Maxime Coquelin
2020-10-15 15:28 ` Liu, Yong
1 sibling, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-10-12 9:10 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 4:22 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>
> Hi Marvin,
>
> On 10/9/20 10:14 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. On skylake
> > server, it can bring 6% performance gain in loopback case and around 4%
> > performance gain in PvP case.
>
> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> benchmark, I will try to do a benchmark this week.
>
Maxime,
I have observed around 3% gain with OVS-DPDK in first version. But the number is not reliable as datapath has been changed.
I will try again after fixed OVS integration issue with latest dpdk.
> Thanks,
> Maxime
>
> > Vhost application can choose whether using vectorized acceleration, just
> > like external buffer feature. If platform or ring format not support
> > vectorized function, vhost will fallback to use default batch function.
> > There will be no impact in current data path.
> >
> > v3:
> > * rename vectorized datapath file
> > * eliminate the impact when avx512 disabled
> > * dynamically allocate memory regions structure
> > * remove unlikely hint for in_order
> >
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> > vhost: add vectorized data path
> > vhost: reuse packed ring functions
> > vhost: prepare memory regions addresses
> > vhost: add packed ring vectorized dequeue
> > vhost: add packed ring vectorized enqueue
> >
> > doc/guides/nics/vhost.rst | 5 +
> > doc/guides/prog_guide/vhost_lib.rst | 12 +
> > drivers/net/vhost/rte_eth_vhost.c | 17 +-
> > lib/librte_vhost/meson.build | 16 ++
> > lib/librte_vhost/rte_vhost.h | 1 +
> > lib/librte_vhost/socket.c | 5 +
> > lib/librte_vhost/vhost.c | 11 +
> > lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
> > lib/librte_vhost/vhost_user.c | 26 +++
> > lib/librte_vhost/virtio_net.c | 258 ++++-----------------
> > lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
> > 11 files changed, 718 insertions(+), 216 deletions(-)
> > create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-12 9:10 ` Liu, Yong
@ 2020-10-12 9:57 ` Maxime Coquelin
2020-10-12 13:24 ` Liu, Yong
0 siblings, 1 reply; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-12 9:57 UTC (permalink / raw)
To: Liu, Yong, Xia, Chenbo, Wang, Zhihong; +Cc: dev
Hi Marvin,
On 10/12/20 11:10 AM, Liu, Yong wrote:
>
>
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, October 12, 2020 4:22 PM
>> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>>
>> Hi Marvin,
>>
>> On 10/9/20 10:14 AM, Marvin Liu wrote:
>>> Packed ring format is imported since virtio spec 1.1. All descriptors
>>> are compacted into one single ring when packed ring format is on. It is
>>> straight forward that ring operations can be accelerated by utilizing
>>> SIMD instructions.
>>>
>>> This patch set will introduce vectorized data path in vhost library. If
>>> vectorized option is on, operations like descs check, descs writeback,
>>> address translation will be accelerated by SIMD instructions. On skylake
>>> server, it can bring 6% performance gain in loopback case and around 4%
>>> performance gain in PvP case.
>>
>> IMHO, 4% gain on PVP is not a significant gain if we compare to the
>> added complexity. Moreover, I guess this is 4% gain with testpmd-based
>> PVP? If this is the case it may be even lower with OVS-DPDK PVP
>> benchmark, I will try to do a benchmark this week.
>>
>
> Maxime,
> I have observed around 3% gain with OVS-DPDK in first version. But the number is not reliable as datapath has been changed.
> I will try again after fixed OVS integration issue with latest dpdk.
Thanks for the information.
Also, wouldn't using AVX512 lower the CPU frequency?
If so, could it have an impact on the workload running on the other
CPUs?
Thanks,
Maxime
>> Thanks,
>> Maxime
>>
>>> Vhost application can choose whether using vectorized acceleration, just
>>> like external buffer feature. If platform or ring format not support
>>> vectorized function, vhost will fallback to use default batch function.
>>> There will be no impact in current data path.
>>>
>>> v3:
>>> * rename vectorized datapath file
>>> * eliminate the impact when avx512 disabled
>>> * dynamically allocate memory regions structure
>>> * remove unlikely hint for in_order
>>>
>>> v2:
>>> * add vIOMMU support
>>> * add dequeue offloading
>>> * rebase code
>>>
>>> Marvin Liu (5):
>>> vhost: add vectorized data path
>>> vhost: reuse packed ring functions
>>> vhost: prepare memory regions addresses
>>> vhost: add packed ring vectorized dequeue
>>> vhost: add packed ring vectorized enqueue
>>>
>>> doc/guides/nics/vhost.rst | 5 +
>>> doc/guides/prog_guide/vhost_lib.rst | 12 +
>>> drivers/net/vhost/rte_eth_vhost.c | 17 +-
>>> lib/librte_vhost/meson.build | 16 ++
>>> lib/librte_vhost/rte_vhost.h | 1 +
>>> lib/librte_vhost/socket.c | 5 +
>>> lib/librte_vhost/vhost.c | 11 +
>>> lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
>>> lib/librte_vhost/vhost_user.c | 26 +++
>>> lib/librte_vhost/virtio_net.c | 258 ++++-----------------
>>> lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
>>> 11 files changed, 718 insertions(+), 216 deletions(-)
>>> create mode 100644 lib/librte_vhost/virtio_net_avx.c
>>>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-12 9:57 ` Maxime Coquelin
@ 2020-10-12 13:24 ` Liu, Yong
0 siblings, 0 replies; 36+ messages in thread
From: Liu, Yong @ 2020-10-12 13:24 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 5:57 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>
> Hi Marvin,
>
> On 10/12/20 11:10 AM, Liu, Yong wrote:
> >
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Monday, October 12, 2020 4:22 PM
> >> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo
> <chenbo.xia@intel.com>;
> >> Wang, Zhihong <zhihong.wang@intel.com>
> >> Cc: dev@dpdk.org
> >> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
> >>
> >> Hi Marvin,
> >>
> >> On 10/9/20 10:14 AM, Marvin Liu wrote:
> >>> Packed ring format is imported since virtio spec 1.1. All descriptors
> >>> are compacted into one single ring when packed ring format is on. It is
> >>> straight forward that ring operations can be accelerated by utilizing
> >>> SIMD instructions.
> >>>
> >>> This patch set will introduce vectorized data path in vhost library. If
> >>> vectorized option is on, operations like descs check, descs writeback,
> >>> address translation will be accelerated by SIMD instructions. On skylake
> >>> server, it can bring 6% performance gain in loopback case and around 4%
> >>> performance gain in PvP case.
> >>
> >> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> >> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> >> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> >> benchmark, I will try to do a benchmark this week.
> >>
> >
> > Maxime,
> > I have observed around 3% gain with OVS-DPDK in first version. But the
> number is not reliable as datapath has been changed.
> > I will try again after fixed OVS integration issue with latest dpdk.
>
> Thanks for the information.
>
> Also, wouldn't using AVX512 lower the CPU frequency?
> If so, could it have an impact on the workload running on the other
> CPUs?
>
All AVX512 instructions used in vhost are lightweight ones, frequency won't be affected.
Theoretically system performance won’t be affected if only lightweight instructions are used.
Thanks.
> Thanks,
> Maxime
>
> >> Thanks,
> >> Maxime
> >>
> >>> Vhost application can choose whether using vectorized acceleration,
> just
> >>> like external buffer feature. If platform or ring format not support
> >>> vectorized function, vhost will fallback to use default batch function.
> >>> There will be no impact in current data path.
> >>>
> >>> v3:
> >>> * rename vectorized datapath file
> >>> * eliminate the impact when avx512 disabled
> >>> * dynamically allocate memory regions structure
> >>> * remove unlikely hint for in_order
> >>>
> >>> v2:
> >>> * add vIOMMU support
> >>> * add dequeue offloading
> >>> * rebase code
> >>>
> >>> Marvin Liu (5):
> >>> vhost: add vectorized data path
> >>> vhost: reuse packed ring functions
> >>> vhost: prepare memory regions addresses
> >>> vhost: add packed ring vectorized dequeue
> >>> vhost: add packed ring vectorized enqueue
> >>>
> >>> doc/guides/nics/vhost.rst | 5 +
> >>> doc/guides/prog_guide/vhost_lib.rst | 12 +
> >>> drivers/net/vhost/rte_eth_vhost.c | 17 +-
> >>> lib/librte_vhost/meson.build | 16 ++
> >>> lib/librte_vhost/rte_vhost.h | 1 +
> >>> lib/librte_vhost/socket.c | 5 +
> >>> lib/librte_vhost/vhost.c | 11 +
> >>> lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
> >>> lib/librte_vhost/vhost_user.c | 26 +++
> >>> lib/librte_vhost/virtio_net.c | 258 ++++-----------------
> >>> lib/librte_vhost/virtio_net_avx.c | 344
> ++++++++++++++++++++++++++++
> >>> 11 files changed, 718 insertions(+), 216 deletions(-)
> >>> create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >>>
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-12 8:21 ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
2020-10-12 9:10 ` Liu, Yong
@ 2020-10-15 15:28 ` Liu, Yong
2020-10-15 15:35 ` Maxime Coquelin
1 sibling, 1 reply; 36+ messages in thread
From: Liu, Yong @ 2020-10-15 15:28 UTC (permalink / raw)
To: Maxime Coquelin, Xia, Chenbo, Wang, Zhihong; +Cc: dev
Hi All,
Performance gain from vectorized datapath in OVS-DPDK is around 1%, meanwhile it have a small impact of original datapath.
On the other hand, it will increase the complexity of vhost (new parameter introduced, prepare memory information for address translation).
After weighed the procs and co, I’d like to drawback this patch set. Thanks for your time.
Regards,
Marvin
> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Monday, October 12, 2020 4:22 PM
> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
> Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>
> Hi Marvin,
>
> On 10/9/20 10:14 AM, Marvin Liu wrote:
> > Packed ring format is imported since virtio spec 1.1. All descriptors
> > are compacted into one single ring when packed ring format is on. It is
> > straight forward that ring operations can be accelerated by utilizing
> > SIMD instructions.
> >
> > This patch set will introduce vectorized data path in vhost library. If
> > vectorized option is on, operations like descs check, descs writeback,
> > address translation will be accelerated by SIMD instructions. On skylake
> > server, it can bring 6% performance gain in loopback case and around 4%
> > performance gain in PvP case.
>
> IMHO, 4% gain on PVP is not a significant gain if we compare to the
> added complexity. Moreover, I guess this is 4% gain with testpmd-based
> PVP? If this is the case it may be even lower with OVS-DPDK PVP
> benchmark, I will try to do a benchmark this week.
>
> Thanks,
> Maxime
>
> > Vhost application can choose whether using vectorized acceleration, just
> > like external buffer feature. If platform or ring format not support
> > vectorized function, vhost will fallback to use default batch function.
> > There will be no impact in current data path.
> >
> > v3:
> > * rename vectorized datapath file
> > * eliminate the impact when avx512 disabled
> > * dynamically allocate memory regions structure
> > * remove unlikely hint for in_order
> >
> > v2:
> > * add vIOMMU support
> > * add dequeue offloading
> > * rebase code
> >
> > Marvin Liu (5):
> > vhost: add vectorized data path
> > vhost: reuse packed ring functions
> > vhost: prepare memory regions addresses
> > vhost: add packed ring vectorized dequeue
> > vhost: add packed ring vectorized enqueue
> >
> > doc/guides/nics/vhost.rst | 5 +
> > doc/guides/prog_guide/vhost_lib.rst | 12 +
> > drivers/net/vhost/rte_eth_vhost.c | 17 +-
> > lib/librte_vhost/meson.build | 16 ++
> > lib/librte_vhost/rte_vhost.h | 1 +
> > lib/librte_vhost/socket.c | 5 +
> > lib/librte_vhost/vhost.c | 11 +
> > lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
> > lib/librte_vhost/vhost_user.c | 26 +++
> > lib/librte_vhost/virtio_net.c | 258 ++++-----------------
> > lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
> > 11 files changed, 718 insertions(+), 216 deletions(-)
> > create mode 100644 lib/librte_vhost/virtio_net_avx.c
> >
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path
2020-10-15 15:28 ` Liu, Yong
@ 2020-10-15 15:35 ` Maxime Coquelin
0 siblings, 0 replies; 36+ messages in thread
From: Maxime Coquelin @ 2020-10-15 15:35 UTC (permalink / raw)
To: Liu, Yong, Xia, Chenbo, Wang, Zhihong; +Cc: dev
Hi Marvin,
On 10/15/20 5:28 PM, Liu, Yong wrote:
> Hi All,
> Performance gain from vectorized datapath in OVS-DPDK is around 1%, meanwhile it have a small impact of original datapath.
> On the other hand, it will increase the complexity of vhost (new parameter introduced, prepare memory information for address translation).
> After weighed the procs and co, I’d like to drawback this patch set. Thanks for your time.
Thanks for running the test with the new version.
I have removed it from Patchwork.
Thanks,
Maxime
> Regards,
> Marvin
>
>> -----Original Message-----
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Sent: Monday, October 12, 2020 4:22 PM
>> To: Liu, Yong <yong.liu@intel.com>; Xia, Chenbo <chenbo.xia@intel.com>;
>> Wang, Zhihong <zhihong.wang@intel.com>
>> Cc: dev@dpdk.org
>> Subject: Re: [PATCH v3 0/5] vhost add vectorized data path
>>
>> Hi Marvin,
>>
>> On 10/9/20 10:14 AM, Marvin Liu wrote:
>>> Packed ring format is imported since virtio spec 1.1. All descriptors
>>> are compacted into one single ring when packed ring format is on. It is
>>> straight forward that ring operations can be accelerated by utilizing
>>> SIMD instructions.
>>>
>>> This patch set will introduce vectorized data path in vhost library. If
>>> vectorized option is on, operations like descs check, descs writeback,
>>> address translation will be accelerated by SIMD instructions. On skylake
>>> server, it can bring 6% performance gain in loopback case and around 4%
>>> performance gain in PvP case.
>>
>> IMHO, 4% gain on PVP is not a significant gain if we compare to the
>> added complexity. Moreover, I guess this is 4% gain with testpmd-based
>> PVP? If this is the case it may be even lower with OVS-DPDK PVP
>> benchmark, I will try to do a benchmark this week.
>>
>> Thanks,
>> Maxime
>>
>>> Vhost application can choose whether using vectorized acceleration, just
>>> like external buffer feature. If platform or ring format not support
>>> vectorized function, vhost will fallback to use default batch function.
>>> There will be no impact in current data path.
>>>
>>> v3:
>>> * rename vectorized datapath file
>>> * eliminate the impact when avx512 disabled
>>> * dynamically allocate memory regions structure
>>> * remove unlikely hint for in_order
>>>
>>> v2:
>>> * add vIOMMU support
>>> * add dequeue offloading
>>> * rebase code
>>>
>>> Marvin Liu (5):
>>> vhost: add vectorized data path
>>> vhost: reuse packed ring functions
>>> vhost: prepare memory regions addresses
>>> vhost: add packed ring vectorized dequeue
>>> vhost: add packed ring vectorized enqueue
>>>
>>> doc/guides/nics/vhost.rst | 5 +
>>> doc/guides/prog_guide/vhost_lib.rst | 12 +
>>> drivers/net/vhost/rte_eth_vhost.c | 17 +-
>>> lib/librte_vhost/meson.build | 16 ++
>>> lib/librte_vhost/rte_vhost.h | 1 +
>>> lib/librte_vhost/socket.c | 5 +
>>> lib/librte_vhost/vhost.c | 11 +
>>> lib/librte_vhost/vhost.h | 239 +++++++++++++++++++
>>> lib/librte_vhost/vhost_user.c | 26 +++
>>> lib/librte_vhost/virtio_net.c | 258 ++++-----------------
>>> lib/librte_vhost/virtio_net_avx.c | 344 ++++++++++++++++++++++++++++
>>> 11 files changed, 718 insertions(+), 216 deletions(-)
>>> create mode 100644 lib/librte_vhost/virtio_net_avx.c
>>>
>
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2020-10-15 15:35 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-19 3:24 [dpdk-dev] [PATCH v1 0/5] vhost add vectorized data path Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 1/5] vhost: " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 0/5] vhost " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 1/5] vhost: " Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 2/5] vhost: reuse packed ring functions Marvin Liu
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-10-06 15:06 ` Maxime Coquelin
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-10-06 14:59 ` Maxime Coquelin
2020-10-08 7:05 ` Liu, Yong
2020-10-06 15:18 ` Maxime Coquelin
2020-10-09 7:59 ` Liu, Yong
2020-09-21 6:48 ` [dpdk-dev] [PATCH v2 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-06 15:00 ` Maxime Coquelin
2020-10-08 7:09 ` Liu, Yong
2020-10-06 13:34 ` [dpdk-dev] [PATCH v2 0/5] vhost add vectorized data path Maxime Coquelin
2020-10-08 6:20 ` Liu, Yong
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 1/5] vhost: " Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 2/5] vhost: reuse packed ring functions Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-10-09 8:14 ` [dpdk-dev] [PATCH v3 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
2020-10-12 8:21 ` [dpdk-dev] [PATCH v3 0/5] vhost add vectorized data path Maxime Coquelin
2020-10-12 9:10 ` Liu, Yong
2020-10-12 9:57 ` Maxime Coquelin
2020-10-12 13:24 ` Liu, Yong
2020-10-15 15:28 ` Liu, Yong
2020-10-15 15:35 ` Maxime Coquelin
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 2/5] vhost: reuse packed ring functions Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 3/5] vhost: prepare memory regions addresses Marvin Liu
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 4/5] vhost: add packed ring vectorized dequeue Marvin Liu
2020-09-18 13:44 ` Maxime Coquelin
2020-09-21 6:26 ` Liu, Yong
2020-09-21 7:47 ` Liu, Yong
2020-08-19 3:24 ` [dpdk-dev] [PATCH v1 5/5] vhost: add packed ring vectorized enqueue Marvin Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).