DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations
@ 2019-05-29 13:04 Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

Some OVS-DPDK PVP benchmarks show a performance drop
when switching from DPDK v17.11 to v18.11.

With the addition of packed ring layout support,
rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
became very large, and only a part of the instructions
are executed (either packed or split ring used).

This series aims at improving the I-cache pressure,
first by un-inlining split and packed rings, but
also by moving parts considered as cold in dedicated
functions (dirty page logging, fragmented descriptors
buffer management added for CVE-2018-1059).

With the series applied, size of the enqueue and
dequeue split paths is reduced significantly:

+---------+--------------------+---------------------+
| Version | Enqueue split path |  Dequeue split path |
+---------+--------------------+---------------------+
| v19.05  | 16461B             | 25521B              |
| +series | 7286B              | 11285B              |
+---------+--------------------+---------------------+

Using perf tool to monitor iTLB-load-misses event
while doing PVP benchmark with testpmd as vswitch,
we can see the number of iTLB misses being reduced:

- v19.05:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

             2,438      iTLB-load-miss                                                ( +- 13.43% )

       10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )

- +series:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

                55      iTLB-load-miss                                                ( +- 10.08% )

       10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )

The series also force the inlining of some rte_memcpy
helpers, as by adding packed ring support, some of them
were not more inlined but embedded as functions in
the virtio_net object file, which was not expected.

Finally, the series simplifies the descriptors buffers
prefetching, by doing it in the recently introduced
descriptor buffer mapping function.

v3:
===
 - Prefix alloc_copy_ind_table with vhost_ (Mattias)
 - Remove double new line (Tiwei)
 - Fix grammar error in patch 3's commit message (Jens)
 - Force noinline for hear copy functions (Mattias)
 - Fix dst assignement in copy_hdr_from_desc (Tiwei)

v2:
===
 - Fix checkpatch issue
 - Reset author for patch 5 (David)
 - Force non-inlining in patch 2 (David)
 - Fix typo in path 3 commit message (David)

Maxime Coquelin (5):
  vhost: un-inline dirty pages logging functions
  vhost: do not inline packed and split functions
  vhost: do not inline unlikely fragmented buffers code
  vhost: simplify descriptor's buffer prefetching
  eal/x86: force inlining of all memcpy and mov helpers

 .../common/include/arch/x86/rte_memcpy.h      |  18 +-
 lib/librte_vhost/vdpa.c                       |   2 +-
 lib/librte_vhost/vhost.c                      | 164 +++++++++++++++++
 lib/librte_vhost/vhost.h                      | 165 ++----------------
 lib/librte_vhost/virtio_net.c                 | 140 +++++++--------
 5 files changed, 251 insertions(+), 238 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] [PATCH v3 1/5] vhost: un-inline dirty pages logging functions
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
@ 2019-05-29 13:04 ` Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 2/5] vhost: do not inline packed and split functions Maxime Coquelin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

In order to reduce the I-cache pressure, this patch removes
the inlining of the dirty pages logging functions, that we
can consider as cold path.

Indeed, these functions are only called while doing live
migration, so not called most of the time.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_vhost/vhost.c | 131 +++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost.h | 129 ++++----------------------------------
 2 files changed, 143 insertions(+), 117 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 163f4595ef..7d427b60a5 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -69,6 +69,137 @@ __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+#define VHOST_LOG_PAGE	4096
+
+/*
+ * Atomically set a bit in memory.
+ */
+static __rte_always_inline void
+vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
+{
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+	/*
+	 * __sync_ built-ins are deprecated, but __atomic_ ones
+	 * are sub-optimized in older GCC versions.
+	 */
+	__sync_fetch_and_or_1(addr, (1U << nr));
+#else
+	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
+#endif
+}
+
+static __rte_always_inline void
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	vhost_set_bit(page % 8, &log_base[page / 8]);
+}
+
+void
+__vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += 1;
+	}
+}
+
+void
+__vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+	unsigned long *log_base;
+	int i;
+
+	if (unlikely(!dev->log_base))
+		return;
+
+	rte_smp_wmb();
+
+	log_base = (unsigned long *)(uintptr_t)dev->log_base;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+		/*
+		 * '__sync' builtins are deprecated, but '__atomic' ones
+		 * are sub-optimized in older GCC versions.
+		 */
+		__sync_fetch_and_or(log_base + elem->offset, elem->val);
+#else
+		__atomic_fetch_or(log_base + elem->offset, elem->val,
+				__ATOMIC_RELAXED);
+#endif
+	}
+
+	rte_smp_wmb();
+
+	vq->log_cache_nb_elem = 0;
+}
+
+static __rte_always_inline void
+vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t page)
+{
+	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
+	uint32_t offset = page / (sizeof(unsigned long) << 3);
+	int i;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+		if (elem->offset == offset) {
+			elem->val |= (1UL << bit_nr);
+			return;
+		}
+	}
+
+	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
+		/*
+		 * No more room for a new log cache entry,
+		 * so write the dirty log map directly.
+		 */
+		rte_smp_wmb();
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+
+		return;
+	}
+
+	vq->log_cache[i].offset = offset;
+	vq->log_cache[i].val = (1UL << bit_nr);
+	vq->log_cache_nb_elem++;
+}
+
+void
+__vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_cache_page(dev, vq, page);
+		page += 1;
+	}
+}
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index e9138dfab4..3ab7b4950f 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,138 +350,33 @@ desc_is_avail(struct vring_packed_desc *desc, bool wrap_counter)
 		wrap_counter != !!(flags & VRING_DESC_F_USED);
 }
 
-#define VHOST_LOG_PAGE	4096
-
-/*
- * Atomically set a bit in memory.
- */
-static __rte_always_inline void
-vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
-{
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-	/*
-	 * __sync_ built-ins are deprecated, but __atomic_ ones
-	 * are sub-optimized in older GCC versions.
-	 */
-	__sync_fetch_and_or_1(addr, (1U << nr));
-#else
-	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
-#endif
-}
-
-static __rte_always_inline void
-vhost_log_page(uint8_t *log_base, uint64_t page)
-{
-	vhost_set_bit(page % 8, &log_base[page / 8]);
-}
+void __vhost_log_cache_write(struct virtio_net *dev,
+		struct vhost_virtqueue *vq,
+		uint64_t addr, uint64_t len);
+void __vhost_log_cache_sync(struct virtio_net *dev,
+		struct vhost_virtqueue *vq);
+void __vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len);
 
 static __rte_always_inline void
 vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	/* To make sure guest memory updates are committed before logging */
-	rte_smp_wmb();
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_write(dev, addr, len);
 }
 
 static __rte_always_inline void
 vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
-	unsigned long *log_base;
-	int i;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base))
-		return;
-
-	rte_smp_wmb();
-
-	log_base = (unsigned long *)(uintptr_t)dev->log_base;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-		/*
-		 * '__sync' builtins are deprecated, but '__atomic' ones
-		 * are sub-optimized in older GCC versions.
-		 */
-		__sync_fetch_and_or(log_base + elem->offset, elem->val);
-#else
-		__atomic_fetch_or(log_base + elem->offset, elem->val,
-				__ATOMIC_RELAXED);
-#endif
-	}
-
-	rte_smp_wmb();
-
-	vq->log_cache_nb_elem = 0;
-}
-
-static __rte_always_inline void
-vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
-			uint64_t page)
-{
-	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
-	uint32_t offset = page / (sizeof(unsigned long) << 3);
-	int i;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-		if (elem->offset == offset) {
-			elem->val |= (1UL << bit_nr);
-			return;
-		}
-	}
-
-	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
-		/*
-		 * No more room for a new log cache entry,
-		 * so write the dirty log map directly.
-		 */
-		rte_smp_wmb();
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-
-		return;
-	}
-
-	vq->log_cache[i].offset = offset;
-	vq->log_cache[i].val = (1UL << bit_nr);
-	vq->log_cache_nb_elem++;
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_sync(dev, vq);
 }
 
 static __rte_always_inline void
 vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_cache_page(dev, vq, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_write(dev, vq, addr, len);
 }
 
 static __rte_always_inline void
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] [PATCH v3 2/5] vhost: do not inline packed and split functions
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
@ 2019-05-29 13:04 ` Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

At runtime either packed Tx/Rx functions will always be called,
or split Tx/Rx functions will always be called.

This patch removes the forced inlining in order to reduce
the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_vhost/virtio_net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index a6a33a1013..8aeb180016 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -771,7 +771,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
-static __rte_always_inline uint32_t
+static __rte_noinline uint32_t
 virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -830,7 +830,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return pkt_idx;
 }
 
-static __rte_always_inline uint32_t
+static __rte_noinline uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -1300,7 +1300,7 @@ get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
-static __rte_always_inline uint16_t
+static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
@@ -1422,7 +1422,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return i;
 }
 
-static __rte_always_inline uint16_t
+static __rte_noinline uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] [PATCH v3 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 2/5] vhost: do not inline packed and split functions Maxime Coquelin
@ 2019-05-29 13:04 ` Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

Handling of fragmented virtio-net header and indirect descriptors
tables was implemented to fix CVE-2018-1059. It should never
happen with healthy guests and so is already considered as
unlikely code path.

This patch moves these bits into non-inline dedicated functions
to reduce the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_vhost/vdpa.c       |   2 +-
 lib/librte_vhost/vhost.c      |  33 +++++++++++
 lib/librte_vhost/vhost.h      |  36 +-----------
 lib/librte_vhost/virtio_net.c | 100 +++++++++++++++++++---------------
 4 files changed, 93 insertions(+), 78 deletions(-)

diff --git a/lib/librte_vhost/vdpa.c b/lib/librte_vhost/vdpa.c
index e915488432..24a6698e91 100644
--- a/lib/librte_vhost/vdpa.c
+++ b/lib/librte_vhost/vdpa.c
@@ -181,7 +181,7 @@ rte_vdpa_relay_vring_used(int vid, uint16_t qid, void *vring_m)
 				return -1;
 
 			if (unlikely(dlen < vq->desc[desc_id].len)) {
-				idesc = alloc_copy_ind_table(dev, vq,
+				idesc = vhost_alloc_copy_ind_table(dev, vq,
 						vq->desc[desc_id].addr,
 						vq->desc[desc_id].len);
 				if (unlikely(!idesc))
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 7d427b60a5..981837b5dd 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -200,6 +200,39 @@ __vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	}
 }
 
+void *
+vhost_alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint64_t desc_addr, uint64_t desc_len)
+{
+	void *idesc;
+	uint64_t src, dst;
+	uint64_t len, remain = desc_len;
+
+	idesc = rte_malloc(__func__, desc_len, 0);
+	if (unlikely(!idesc))
+		return NULL;
+
+	dst = (uint64_t)(uintptr_t)idesc;
+
+	while (remain) {
+		len = remain;
+		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
+				VHOST_ACCESS_RO);
+		if (unlikely(!src || !len)) {
+			rte_free(idesc);
+			return NULL;
+		}
+
+		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		desc_addr += len;
+	}
+
+	return idesc;
+}
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 3ab7b4950f..691f535530 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -488,6 +488,9 @@ void vhost_backend_cleanup(struct virtio_net *dev);
 
 uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t iova, uint64_t *len, uint8_t perm);
+void *vhost_alloc_copy_ind_table(struct virtio_net *dev,
+			struct vhost_virtqueue *vq,
+			uint64_t desc_addr, uint64_t desc_len);
 int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 
@@ -601,39 +604,6 @@ vhost_vring_call_packed(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		eventfd_write(vq->callfd, (eventfd_t)1);
 }
 
-static __rte_always_inline void *
-alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
-		uint64_t desc_addr, uint64_t desc_len)
-{
-	void *idesc;
-	uint64_t src, dst;
-	uint64_t len, remain = desc_len;
-
-	idesc = rte_malloc(__func__, desc_len, 0);
-	if (unlikely(!idesc))
-		return 0;
-
-	dst = (uint64_t)(uintptr_t)idesc;
-
-	while (remain) {
-		len = remain;
-		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
-				VHOST_ACCESS_RO);
-		if (unlikely(!src || !len)) {
-			rte_free(idesc);
-			return 0;
-		}
-
-		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
-
-		remain -= len;
-		dst += len;
-		desc_addr += len;
-	}
-
-	return idesc;
-}
-
 static __rte_always_inline void
 free_ind_table(void *idesc)
 {
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8aeb180016..4564e9bcc9 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -337,7 +337,7 @@ fill_vec_buf_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			 * The indirect desc table is not contiguous
 			 * in process VA space, we have to copy it.
 			 */
-			idesc = alloc_copy_ind_table(dev, vq,
+			idesc = vhost_alloc_copy_ind_table(dev, vq,
 					vq->desc[idx].addr, vq->desc[idx].len);
 			if (unlikely(!idesc))
 				return -1;
@@ -454,7 +454,8 @@ fill_vec_buf_packed_indirect(struct virtio_net *dev,
 		 * The indirect desc table is not contiguous
 		 * in process VA space, we have to copy it.
 		 */
-		idescs = alloc_copy_ind_table(dev, vq, desc->addr, desc->len);
+		idescs = vhost_alloc_copy_ind_table(dev,
+				vq, desc->addr, desc->len);
 		if (unlikely(!idescs))
 			return -1;
 
@@ -610,6 +611,36 @@ reserve_avail_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+static __rte_noinline void
+copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		struct buf_vector *buf_vec,
+		struct virtio_net_hdr_mrg_rxbuf *hdr)
+{
+	uint64_t len;
+	uint64_t remain = dev->vhost_hlen;
+	uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
+	uint64_t iova = buf_vec->buf_iova;
+
+	while (remain) {
+		len = RTE_MIN(remain,
+				buf_vec->buf_len);
+		dst = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src,
+				len);
+
+		PRINT_PACKET(dev, (uintptr_t)dst,
+				(uint32_t)len, 0);
+		vhost_log_cache_write(dev, vq,
+				iova, len);
+
+		remain -= len;
+		iova += len;
+		src += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			    struct rte_mbuf *m, struct buf_vector *buf_vec,
@@ -703,30 +734,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 						num_buffers);
 
 			if (unlikely(hdr == &tmp_hdr)) {
-				uint64_t len;
-				uint64_t remain = dev->vhost_hlen;
-				uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
-				uint64_t iova = buf_vec[0].buf_iova;
-				uint16_t hdr_vec_idx = 0;
-
-				while (remain) {
-					len = RTE_MIN(remain,
-						buf_vec[hdr_vec_idx].buf_len);
-					dst = buf_vec[hdr_vec_idx].buf_addr;
-					rte_memcpy((void *)(uintptr_t)dst,
-							(void *)(uintptr_t)src,
-							len);
-
-					PRINT_PACKET(dev, (uintptr_t)dst,
-							(uint32_t)len, 0);
-					vhost_log_cache_write(dev, vq,
-							iova, len);
-
-					remain -= len;
-					iova += len;
-					src += len;
-					hdr_vec_idx++;
-				}
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
 			} else {
 				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 						dev->vhost_hlen, 0);
@@ -1063,6 +1071,27 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
 	}
 }
 
+static __rte_noinline void
+copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
+		struct buf_vector *buf_vec)
+{
+	uint64_t len;
+	uint64_t remain = sizeof(struct virtio_net_hdr);
+	uint64_t src;
+	uint64_t dst = (uint64_t)(uintptr_t)hdr;
+
+	while (remain) {
+		len = RTE_MIN(remain, buf_vec->buf_len);
+		src = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct buf_vector *buf_vec, uint16_t nr_vec,
@@ -1094,28 +1123,11 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
-			uint64_t len;
-			uint64_t remain = sizeof(struct virtio_net_hdr);
-			uint64_t src;
-			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
-			uint16_t hdr_vec_idx = 0;
-
 			/*
 			 * No luck, the virtio-net header doesn't fit
 			 * in a contiguous virtual area.
 			 */
-			while (remain) {
-				len = RTE_MIN(remain,
-					buf_vec[hdr_vec_idx].buf_len);
-				src = buf_vec[hdr_vec_idx].buf_addr;
-				rte_memcpy((void *)(uintptr_t)dst,
-						   (void *)(uintptr_t)src, len);
-
-				remain -= len;
-				dst += len;
-				hdr_vec_idx++;
-			}
-
+			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] [PATCH v3 4/5] vhost: simplify descriptor's buffer prefetching
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (2 preceding siblings ...)
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
@ 2019-05-29 13:04 ` Maxime Coquelin
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
  2019-06-05 12:32 ` [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  5 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

Now that we have a single function to map the descriptors
buffers, let's prefetch them there as it is the earliest
place we can do it.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_vhost/virtio_net.c | 32 ++------------------------------
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 4564e9bcc9..8f0e784f77 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -286,6 +286,8 @@ map_one_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (unlikely(!desc_addr))
 			return -1;
 
+		rte_prefetch0((void *)(uintptr_t)desc_addr);
+
 		buf_vec[vec_id].buf_iova = desc_iova;
 		buf_vec[vec_id].buf_addr = desc_addr;
 		buf_vec[vec_id].buf_len  = desc_chunck_len;
@@ -666,9 +668,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	buf_iova = buf_vec[vec_idx].buf_iova;
 	buf_len = buf_vec[vec_idx].buf_len;
 
-	if (nr_vec > 1)
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
 		error = -1;
 		goto out;
@@ -711,10 +710,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/* Prefetch next buffer address. */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
 			buf_offset = 0;
 			buf_avail  = buf_len;
 		}
@@ -812,8 +807,6 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -861,8 +854,6 @@ virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -1118,9 +1109,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		goto out;
 	}
 
-	if (likely(nr_vec > 1))
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
 			/*
@@ -1131,7 +1119,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-			rte_prefetch0(hdr);
 		}
 	}
 
@@ -1161,9 +1148,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		buf_avail = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
 	}
 
-	rte_prefetch0((void *)(uintptr_t)
-			(buf_addr + buf_offset));
-
 	PRINT_PACKET(dev,
 			(uintptr_t)(buf_addr + buf_offset),
 			(uint32_t)buf_avail, 0);
@@ -1229,14 +1213,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/*
-			 * Prefecth desc n + 1 buffer while
-			 * desc n buffer is processed.
-			 */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
-
 			buf_offset = 0;
 			buf_avail  = buf_len;
 
@@ -1380,8 +1356,6 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
@@ -1491,8 +1465,6 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (3 preceding siblings ...)
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
@ 2019-05-29 13:04 ` Maxime Coquelin
  2019-06-05 12:53   ` Bruce Richardson
  2019-06-05 12:32 ` [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  5 siblings, 1 reply; 11+ messages in thread
From: Maxime Coquelin @ 2019-05-29 13:04 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom
  Cc: Maxime Coquelin

Some helpers in the header file are forced inlined other are
only inlined, this patch forces inline for all.

It will avoid it to be embedded as functions when called multiple
times in the same object file. For example, when we added packed
ring support in vhost-user library, rte_memcpy_generic got no
more inlined.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 .../common/include/arch/x86/rte_memcpy.h       | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 7b758094df..ba44c4a328 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -115,7 +115,7 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m512i zmm0, zmm1;
@@ -163,7 +163,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -330,7 +330,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
@@ -343,7 +343,7 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m256i ymm0, ymm1, ymm2, ymm3;
@@ -363,7 +363,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -523,7 +523,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
@@ -655,7 +655,7 @@ __extension__ ({                                                      \
     }                                                                 \
 })
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
@@ -800,7 +800,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 
 #endif /* RTE_MACHINE_CPUFLAG */
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_aligned(void *dst, const void *src, size_t n)
 {
 	void *ret = dst;
@@ -860,7 +860,7 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n)
 	return ret;
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
 	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations
  2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (4 preceding siblings ...)
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
@ 2019-06-05 12:32 ` Maxime Coquelin
  2019-06-05 12:52   ` Bruce Richardson
  5 siblings, 1 reply; 11+ messages in thread
From: Maxime Coquelin @ 2019-06-05 12:32 UTC (permalink / raw)
  To: dev, tiwei.bie, david.marchand, jfreimann, bruce.richardson,
	zhihong.wang, konstantin.ananyev, mattias.ronnblom



On 5/29/19 3:04 PM, Maxime Coquelin wrote:
> Some OVS-DPDK PVP benchmarks show a performance drop
> when switching from DPDK v17.11 to v18.11.
> 
> With the addition of packed ring layout support,
> rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
> became very large, and only a part of the instructions
> are executed (either packed or split ring used).
> 
> This series aims at improving the I-cache pressure,
> first by un-inlining split and packed rings, but
> also by moving parts considered as cold in dedicated
> functions (dirty page logging, fragmented descriptors
> buffer management added for CVE-2018-1059).
> 
> With the series applied, size of the enqueue and
> dequeue split paths is reduced significantly:
> 
> +---------+--------------------+---------------------+
> | Version | Enqueue split path |  Dequeue split path |
> +---------+--------------------+---------------------+
> | v19.05  | 16461B             | 25521B              |
> | +series | 7286B              | 11285B              |
> +---------+--------------------+---------------------+
> 
> Using perf tool to monitor iTLB-load-misses event
> while doing PVP benchmark with testpmd as vswitch,
> we can see the number of iTLB misses being reduced:
> 
> - v19.05:
> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> 
>   Performance counter stats for 'CPU(s) 2,3' (10 runs):
> 
>               2,438      iTLB-load-miss                                                ( +- 13.43% )
> 
>         10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )
> 
> - +series:
> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> 
>   Performance counter stats for 'CPU(s) 2,3' (10 runs):
> 
>                  55      iTLB-load-miss                                                ( +- 10.08% )
> 
>         10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )
> 
> The series also force the inlining of some rte_memcpy
> helpers, as by adding packed ring support, some of them
> were not more inlined but embedded as functions in
> the virtio_net object file, which was not expected.
> 
> Finally, the series simplifies the descriptors buffers
> prefetching, by doing it in the recently introduced
> descriptor buffer mapping function.
> 
> v3:
> ===
>   - Prefix alloc_copy_ind_table with vhost_ (Mattias)
>   - Remove double new line (Tiwei)
>   - Fix grammar error in patch 3's commit message (Jens)
>   - Force noinline for hear copy functions (Mattias)
>   - Fix dst assignement in copy_hdr_from_desc (Tiwei)
> 
> v2:
> ===
>   - Fix checkpatch issue
>   - Reset author for patch 5 (David)
>   - Force non-inlining in patch 2 (David)
>   - Fix typo in path 3 commit message (David)
> 
> Maxime Coquelin (5):
>    vhost: un-inline dirty pages logging functions
>    vhost: do not inline packed and split functions
>    vhost: do not inline unlikely fragmented buffers code
>    vhost: simplify descriptor's buffer prefetching
>    eal/x86: force inlining of all memcpy and mov helpers
> 
>   .../common/include/arch/x86/rte_memcpy.h      |  18 +-
>   lib/librte_vhost/vdpa.c                       |   2 +-
>   lib/librte_vhost/vhost.c                      | 164 +++++++++++++++++
>   lib/librte_vhost/vhost.h                      | 165 ++----------------
>   lib/librte_vhost/virtio_net.c                 | 140 +++++++--------
>   5 files changed, 251 insertions(+), 238 deletions(-)
> 


Applied patches 1 to 4 to dpdk-next-virtio/master.

Bruce, I'm assigning patch 5 to you in Patchwork, as this is not
vhost/virtio specific.

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations
  2019-06-05 12:32 ` [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
@ 2019-06-05 12:52   ` Bruce Richardson
  2019-06-05 13:00     ` Maxime Coquelin
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2019-06-05 12:52 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, tiwei.bie, david.marchand, jfreimann, zhihong.wang,
	konstantin.ananyev, mattias.ronnblom

On Wed, Jun 05, 2019 at 02:32:27PM +0200, Maxime Coquelin wrote:
> 
> 
> On 5/29/19 3:04 PM, Maxime Coquelin wrote:
> > Some OVS-DPDK PVP benchmarks show a performance drop
> > when switching from DPDK v17.11 to v18.11.
> > 
> > With the addition of packed ring layout support,
> > rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
> > became very large, and only a part of the instructions
> > are executed (either packed or split ring used).
> > 
> > This series aims at improving the I-cache pressure,
> > first by un-inlining split and packed rings, but
> > also by moving parts considered as cold in dedicated
> > functions (dirty page logging, fragmented descriptors
> > buffer management added for CVE-2018-1059).
> > 
> > With the series applied, size of the enqueue and
> > dequeue split paths is reduced significantly:
> > 
> > +---------+--------------------+---------------------+
> > | Version | Enqueue split path |  Dequeue split path |
> > +---------+--------------------+---------------------+
> > | v19.05  | 16461B             | 25521B              |
> > | +series | 7286B              | 11285B              |
> > +---------+--------------------+---------------------+
> > 
> > Using perf tool to monitor iTLB-load-misses event
> > while doing PVP benchmark with testpmd as vswitch,
> > we can see the number of iTLB misses being reduced:
> > 
> > - v19.05:
> > # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> > 
> >   Performance counter stats for 'CPU(s) 2,3' (10 runs):
> > 
> >               2,438      iTLB-load-miss                                                ( +- 13.43% )
> > 
> >         10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )
> > 
> > - +series:
> > # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
> > 
> >   Performance counter stats for 'CPU(s) 2,3' (10 runs):
> > 
> >                  55      iTLB-load-miss                                                ( +- 10.08% )
> > 
> >         10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )
> > 
> > The series also force the inlining of some rte_memcpy
> > helpers, as by adding packed ring support, some of them
> > were not more inlined but embedded as functions in
> > the virtio_net object file, which was not expected.
> > 
> > Finally, the series simplifies the descriptors buffers
> > prefetching, by doing it in the recently introduced
> > descriptor buffer mapping function.
> > 
> > v3:
> > ===
> >   - Prefix alloc_copy_ind_table with vhost_ (Mattias)
> >   - Remove double new line (Tiwei)
> >   - Fix grammar error in patch 3's commit message (Jens)
> >   - Force noinline for hear copy functions (Mattias)
> >   - Fix dst assignement in copy_hdr_from_desc (Tiwei)
> > 
> > v2:
> > ===
> >   - Fix checkpatch issue
> >   - Reset author for patch 5 (David)
> >   - Force non-inlining in patch 2 (David)
> >   - Fix typo in path 3 commit message (David)
> > 
> > Maxime Coquelin (5):
> >    vhost: un-inline dirty pages logging functions
> >    vhost: do not inline packed and split functions
> >    vhost: do not inline unlikely fragmented buffers code
> >    vhost: simplify descriptor's buffer prefetching
> >    eal/x86: force inlining of all memcpy and mov helpers
> > 
> >   .../common/include/arch/x86/rte_memcpy.h      |  18 +-
> >   lib/librte_vhost/vdpa.c                       |   2 +-
> >   lib/librte_vhost/vhost.c                      | 164 +++++++++++++++++
> >   lib/librte_vhost/vhost.h                      | 165 ++----------------
> >   lib/librte_vhost/virtio_net.c                 | 140 +++++++--------
> >   5 files changed, 251 insertions(+), 238 deletions(-)
> > 
> 
> 
> Applied patches 1 to 4 to dpdk-next-virtio/master.
> 
> Bruce, I'm assigning patch 5 to you in Patchwork, as this is not
> vhost/virtio specific.
> 
Patch looks ok to me, but I'm not the one to apply it.

/Bruce

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
@ 2019-06-05 12:53   ` Bruce Richardson
  2019-06-06  9:33     ` Maxime Coquelin
  0 siblings, 1 reply; 11+ messages in thread
From: Bruce Richardson @ 2019-06-05 12:53 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, tiwei.bie, david.marchand, jfreimann, zhihong.wang,
	konstantin.ananyev, mattias.ronnblom

On Wed, May 29, 2019 at 03:04:20PM +0200, Maxime Coquelin wrote:
> Some helpers in the header file are forced inlined other are
> only inlined, this patch forces inline for all.
> 
> It will avoid it to be embedded as functions when called multiple
> times in the same object file. For example, when we added packed
> ring support in vhost-user library, rte_memcpy_generic got no
> more inlined.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations
  2019-06-05 12:52   ` Bruce Richardson
@ 2019-06-05 13:00     ` Maxime Coquelin
  0 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-06-05 13:00 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, tiwei.bie, david.marchand, jfreimann, zhihong.wang,
	konstantin.ananyev, mattias.ronnblom



On 6/5/19 2:52 PM, Bruce Richardson wrote:
> On Wed, Jun 05, 2019 at 02:32:27PM +0200, Maxime Coquelin wrote:
>>
>>
>> On 5/29/19 3:04 PM, Maxime Coquelin wrote:
>>> Some OVS-DPDK PVP benchmarks show a performance drop
>>> when switching from DPDK v17.11 to v18.11.
>>>
>>> With the addition of packed ring layout support,
>>> rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
>>> became very large, and only a part of the instructions
>>> are executed (either packed or split ring used).
>>>
>>> This series aims at improving the I-cache pressure,
>>> first by un-inlining split and packed rings, but
>>> also by moving parts considered as cold in dedicated
>>> functions (dirty page logging, fragmented descriptors
>>> buffer management added for CVE-2018-1059).
>>>
>>> With the series applied, size of the enqueue and
>>> dequeue split paths is reduced significantly:
>>>
>>> +---------+--------------------+---------------------+
>>> | Version | Enqueue split path |  Dequeue split path |
>>> +---------+--------------------+---------------------+
>>> | v19.05  | 16461B             | 25521B              |
>>> | +series | 7286B              | 11285B              |
>>> +---------+--------------------+---------------------+
>>>
>>> Using perf tool to monitor iTLB-load-misses event
>>> while doing PVP benchmark with testpmd as vswitch,
>>> we can see the number of iTLB misses being reduced:
>>>
>>> - v19.05:
>>> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
>>>
>>>    Performance counter stats for 'CPU(s) 2,3' (10 runs):
>>>
>>>                2,438      iTLB-load-miss                                                ( +- 13.43% )
>>>
>>>          10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )
>>>
>>> - +series:
>>> # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10
>>>
>>>    Performance counter stats for 'CPU(s) 2,3' (10 runs):
>>>
>>>                   55      iTLB-load-miss                                                ( +- 10.08% )
>>>
>>>          10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )
>>>
>>> The series also force the inlining of some rte_memcpy
>>> helpers, as by adding packed ring support, some of them
>>> were not more inlined but embedded as functions in
>>> the virtio_net object file, which was not expected.
>>>
>>> Finally, the series simplifies the descriptors buffers
>>> prefetching, by doing it in the recently introduced
>>> descriptor buffer mapping function.
>>>
>>> v3:
>>> ===
>>>    - Prefix alloc_copy_ind_table with vhost_ (Mattias)
>>>    - Remove double new line (Tiwei)
>>>    - Fix grammar error in patch 3's commit message (Jens)
>>>    - Force noinline for hear copy functions (Mattias)
>>>    - Fix dst assignement in copy_hdr_from_desc (Tiwei)
>>>
>>> v2:
>>> ===
>>>    - Fix checkpatch issue
>>>    - Reset author for patch 5 (David)
>>>    - Force non-inlining in patch 2 (David)
>>>    - Fix typo in path 3 commit message (David)
>>>
>>> Maxime Coquelin (5):
>>>     vhost: un-inline dirty pages logging functions
>>>     vhost: do not inline packed and split functions
>>>     vhost: do not inline unlikely fragmented buffers code
>>>     vhost: simplify descriptor's buffer prefetching
>>>     eal/x86: force inlining of all memcpy and mov helpers
>>>
>>>    .../common/include/arch/x86/rte_memcpy.h      |  18 +-
>>>    lib/librte_vhost/vdpa.c                       |   2 +-
>>>    lib/librte_vhost/vhost.c                      | 164 +++++++++++++++++
>>>    lib/librte_vhost/vhost.h                      | 165 ++----------------
>>>    lib/librte_vhost/virtio_net.c                 | 140 +++++++--------
>>>    5 files changed, 251 insertions(+), 238 deletions(-)
>>>
>>
>>
>> Applied patches 1 to 4 to dpdk-next-virtio/master.
>>
>> Bruce, I'm assigning patch 5 to you in Patchwork, as this is not
>> vhost/virtio specific.
>>
> Patch looks ok to me, but I'm not the one to apply it.

Ok, my bad. I'll switch to the right maintainer.

Thanks for the ack,
Maxime

> /Bruce
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-06-05 12:53   ` Bruce Richardson
@ 2019-06-06  9:33     ` Maxime Coquelin
  0 siblings, 0 replies; 11+ messages in thread
From: Maxime Coquelin @ 2019-06-06  9:33 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, tiwei.bie, david.marchand, jfreimann, zhihong.wang,
	konstantin.ananyev, mattias.ronnblom



On 6/5/19 2:53 PM, Bruce Richardson wrote:
> On Wed, May 29, 2019 at 03:04:20PM +0200, Maxime Coquelin wrote:
>> Some helpers in the header file are forced inlined other are
>> only inlined, this patch forces inline for all.
>>
>> It will avoid it to be embedded as functions when called multiple
>> times in the same object file. For example, when we added packed
>> ring support in vhost-user library, rte_memcpy_generic got no
>> more inlined.
>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 

Thanks, finally applied it on dpdk-next-virtio tree as per Thomas'
request.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-06-06  9:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 13:04 [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 2/5] vhost: do not inline packed and split functions Maxime Coquelin
2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
2019-05-29 13:04 ` [dpdk-dev] [PATCH v3 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
2019-06-05 12:53   ` Bruce Richardson
2019-06-06  9:33     ` Maxime Coquelin
2019-06-05 12:32 ` [dpdk-dev] [PATCH v3 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
2019-06-05 12:52   ` Bruce Richardson
2019-06-05 13:00     ` Maxime Coquelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).