DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations
@ 2019-05-17 15:06 Maxime Coquelin
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

Some OVS-DPDK PVP benchmarks show a performance drop
when switching from DPDK v17.11 to v18.11.

With the addition of packed ring layout support,
rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
became very large, and only a part of the instructions
are executed (either packed or split ring used).

This series aims at improving the I-cache pressure,
first by un-inlining split and packed rings, but
also by moving parts considered as cold in dedicated
functions (dirty page logging, fragmented descriptors
buffer management added for CVE-2018-1059).

With the series applied, size of the enqueue and
dequeue split paths is reduced significantly:

+---------+--------------------+---------------------+
| Version | Enqueue split path |  Dequeue split path |
+---------+--------------------+---------------------+
| v19.05  | 16461B             | 25521B              |
| +series | 7286B              | 11285B              |
+---------+--------------------+---------------------+

Using perf tool to monitor iTLB-load-misses event
while doing PVP benchmark with testpmd as vswitch,
we can see the number of iTLB misses being reduced:

- v19.05:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

             2,438      iTLB-load-miss                                                ( +- 13.43% )

       10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )

- +series:
# perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

 Performance counter stats for 'CPU(s) 2,3' (10 runs):

                55      iTLB-load-miss                                                ( +- 10.08% )

       10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )

The series also force the inlining of some rte_memcpy
helpers, as by adding packed ring support, some of them
were not more inlined but embedded as functions in
the virtio_net object file, which was not expected.

Finally, the series simplifies the descriptors buffers
prefetching, by doing it in the recently introduced
descriptor buffer mapping function.

v2:
===
 - Fix checkpatch issue
 - Reset author for patch 5 (David)
 - Force non-inlining in patch 2 (David)
 - Fix typo in path 3 commit message (David)

Maxime Coquelin (5):
  vhost: un-inline dirty pages logging functions
  vhost: do not inline packed and split functions
  vhost: do not inline unlikely fragmented buffers code
  vhost: simplify descriptor's buffer prefetching
  eal/x86: force inlining of all memcpy and mov helpers

 .../common/include/arch/x86/rte_memcpy.h      |  18 +-
 lib/librte_vhost/vhost.c                      | 165 ++++++++++++++++++
 lib/librte_vhost/vhost.h                      | 164 ++---------------
 lib/librte_vhost/virtio_net.c                 | 143 +++++++--------
 4 files changed, 251 insertions(+), 239 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions
  2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
@ 2019-05-17 15:06 ` Maxime Coquelin
  2019-05-20  5:18   ` Tiwei Bie
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions Maxime Coquelin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

In order to reduce the I-cache pressure, this patch removes
the inlining of the dirty pages logging functions, that we
can consider as cold path.

Indeed, these functions are only called while doing live
migration, so not called most of the time.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.c | 132 +++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/vhost.h | 129 ++++----------------------------------
 2 files changed, 144 insertions(+), 117 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 163f4595ef..4a54ad6bd1 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -69,6 +69,138 @@ __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+#define VHOST_LOG_PAGE	4096
+
+/*
+ * Atomically set a bit in memory.
+ */
+static __rte_always_inline void
+vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
+{
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+	/*
+	 * __sync_ built-ins are deprecated, but __atomic_ ones
+	 * are sub-optimized in older GCC versions.
+	 */
+	__sync_fetch_and_or_1(addr, (1U << nr));
+#else
+	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
+#endif
+}
+
+static __rte_always_inline void
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	vhost_set_bit(page % 8, &log_base[page / 8]);
+}
+
+void
+__vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += 1;
+	}
+}
+
+void
+__vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+	unsigned long *log_base;
+	int i;
+
+	if (unlikely(!dev->log_base))
+		return;
+
+	rte_smp_wmb();
+
+	log_base = (unsigned long *)(uintptr_t)dev->log_base;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
+		/*
+		 * '__sync' builtins are deprecated, but '__atomic' ones
+		 * are sub-optimized in older GCC versions.
+		 */
+		__sync_fetch_and_or(log_base + elem->offset, elem->val);
+#else
+		__atomic_fetch_or(log_base + elem->offset, elem->val,
+				__ATOMIC_RELAXED);
+#endif
+	}
+
+	rte_smp_wmb();
+
+	vq->log_cache_nb_elem = 0;
+}
+
+static __rte_always_inline void
+vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t page)
+{
+	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
+	uint32_t offset = page / (sizeof(unsigned long) << 3);
+	int i;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+		if (elem->offset == offset) {
+			elem->val |= (1UL << bit_nr);
+			return;
+		}
+	}
+
+	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
+		/*
+		 * No more room for a new log cache entry,
+		 * so write the dirty log map directly.
+		 */
+		rte_smp_wmb();
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+
+		return;
+	}
+
+	vq->log_cache[i].offset = offset;
+	vq->log_cache[i].val = (1UL << bit_nr);
+	vq->log_cache_nb_elem++;
+}
+
+void
+__vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (unlikely(!dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_cache_page(dev, vq, page);
+		page += 1;
+	}
+}
+
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index e9138dfab4..3ab7b4950f 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,138 +350,33 @@ desc_is_avail(struct vring_packed_desc *desc, bool wrap_counter)
 		wrap_counter != !!(flags & VRING_DESC_F_USED);
 }
 
-#define VHOST_LOG_PAGE	4096
-
-/*
- * Atomically set a bit in memory.
- */
-static __rte_always_inline void
-vhost_set_bit(unsigned int nr, volatile uint8_t *addr)
-{
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-	/*
-	 * __sync_ built-ins are deprecated, but __atomic_ ones
-	 * are sub-optimized in older GCC versions.
-	 */
-	__sync_fetch_and_or_1(addr, (1U << nr));
-#else
-	__atomic_fetch_or(addr, (1U << nr), __ATOMIC_RELAXED);
-#endif
-}
-
-static __rte_always_inline void
-vhost_log_page(uint8_t *log_base, uint64_t page)
-{
-	vhost_set_bit(page % 8, &log_base[page / 8]);
-}
+void __vhost_log_cache_write(struct virtio_net *dev,
+		struct vhost_virtqueue *vq,
+		uint64_t addr, uint64_t len);
+void __vhost_log_cache_sync(struct virtio_net *dev,
+		struct vhost_virtqueue *vq);
+void __vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len);
 
 static __rte_always_inline void
 vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	/* To make sure guest memory updates are committed before logging */
-	rte_smp_wmb();
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_write(dev, addr, len);
 }
 
 static __rte_always_inline void
 vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
-	unsigned long *log_base;
-	int i;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base))
-		return;
-
-	rte_smp_wmb();
-
-	log_base = (unsigned long *)(uintptr_t)dev->log_base;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < 70100)
-		/*
-		 * '__sync' builtins are deprecated, but '__atomic' ones
-		 * are sub-optimized in older GCC versions.
-		 */
-		__sync_fetch_and_or(log_base + elem->offset, elem->val);
-#else
-		__atomic_fetch_or(log_base + elem->offset, elem->val,
-				__ATOMIC_RELAXED);
-#endif
-	}
-
-	rte_smp_wmb();
-
-	vq->log_cache_nb_elem = 0;
-}
-
-static __rte_always_inline void
-vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
-			uint64_t page)
-{
-	uint32_t bit_nr = page % (sizeof(unsigned long) << 3);
-	uint32_t offset = page / (sizeof(unsigned long) << 3);
-	int i;
-
-	for (i = 0; i < vq->log_cache_nb_elem; i++) {
-		struct log_cache_entry *elem = vq->log_cache + i;
-
-		if (elem->offset == offset) {
-			elem->val |= (1UL << bit_nr);
-			return;
-		}
-	}
-
-	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
-		/*
-		 * No more room for a new log cache entry,
-		 * so write the dirty log map directly.
-		 */
-		rte_smp_wmb();
-		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
-
-		return;
-	}
-
-	vq->log_cache[i].offset = offset;
-	vq->log_cache[i].val = (1UL << bit_nr);
-	vq->log_cache_nb_elem++;
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_sync(dev, vq);
 }
 
 static __rte_always_inline void
 vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t addr, uint64_t len)
 {
-	uint64_t page;
-
-	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
-		   !dev->log_base || !len))
-		return;
-
-	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
-		return;
-
-	page = addr / VHOST_LOG_PAGE;
-	while (page * VHOST_LOG_PAGE < addr + len) {
-		vhost_log_cache_page(dev, vq, page);
-		page += 1;
-	}
+	if (unlikely(dev->features & (1ULL << VHOST_F_LOG_ALL)))
+		__vhost_log_cache_write(dev, vq, addr, len);
 }
 
 static __rte_always_inline void
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions
  2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
@ 2019-05-17 15:06 ` Maxime Coquelin
  2019-05-20  5:30   ` Tiwei Bie
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

At runtime either packed Tx/Rx functions will always be called,
or split Tx/Rx functions will always be called.

This patch removes the forced inlining in order to reduce
the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/virtio_net.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index a6a33a1013..8aeb180016 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -771,7 +771,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return error;
 }
 
-static __rte_always_inline uint32_t
+static __rte_noinline uint32_t
 virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -830,7 +830,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return pkt_idx;
 }
 
-static __rte_always_inline uint32_t
+static __rte_noinline uint32_t
 virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mbuf **pkts, uint32_t count)
 {
@@ -1300,7 +1300,7 @@ get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
-static __rte_always_inline uint16_t
+static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
@@ -1422,7 +1422,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return i;
 }
 
-static __rte_always_inline uint16_t
+static __rte_noinline uint16_t
 virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions Maxime Coquelin
@ 2019-05-17 15:06 ` Maxime Coquelin
  2019-05-20  5:51   ` Tiwei Bie
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
  4 siblings, 1 reply; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

Handling of fragmented virtio-net header and indirect descriptors
tables was implemented to fix CVE-2018-1059. It should never
happen with healthy guests and so are already considered as
unlikely code path.

This patch moves these bits into non-inline dedicated functions
to reduce the I-cache pressure.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.c      |  33 +++++++++++
 lib/librte_vhost/vhost.h      |  35 +-----------
 lib/librte_vhost/virtio_net.c | 103 +++++++++++++++++++---------------
 3 files changed, 92 insertions(+), 79 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 4a54ad6bd1..8a4379bc13 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -201,6 +201,39 @@ __vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
 }
 
 
+void *
+alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint64_t desc_addr, uint64_t desc_len)
+{
+	void *idesc;
+	uint64_t src, dst;
+	uint64_t len, remain = desc_len;
+
+	idesc = rte_malloc(__func__, desc_len, 0);
+	if (unlikely(!idesc))
+		return NULL;
+
+	dst = (uint64_t)(uintptr_t)idesc;
+
+	while (remain) {
+		len = remain;
+		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
+				VHOST_ACCESS_RO);
+		if (unlikely(!src || !len)) {
+			rte_free(idesc);
+			return NULL;
+		}
+
+		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		desc_addr += len;
+	}
+
+	return idesc;
+}
+
 void
 cleanup_vq(struct vhost_virtqueue *vq, int destroy)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 3ab7b4950f..ab26454e1c 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -488,6 +488,8 @@ void vhost_backend_cleanup(struct virtio_net *dev);
 
 uint64_t __vhost_iova_to_vva(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			uint64_t iova, uint64_t *len, uint8_t perm);
+void *alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
+			uint64_t desc_addr, uint64_t desc_len);
 int vring_translate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 void vring_invalidate(struct virtio_net *dev, struct vhost_virtqueue *vq);
 
@@ -601,39 +603,6 @@ vhost_vring_call_packed(struct virtio_net *dev, struct vhost_virtqueue *vq)
 		eventfd_write(vq->callfd, (eventfd_t)1);
 }
 
-static __rte_always_inline void *
-alloc_copy_ind_table(struct virtio_net *dev, struct vhost_virtqueue *vq,
-		uint64_t desc_addr, uint64_t desc_len)
-{
-	void *idesc;
-	uint64_t src, dst;
-	uint64_t len, remain = desc_len;
-
-	idesc = rte_malloc(__func__, desc_len, 0);
-	if (unlikely(!idesc))
-		return 0;
-
-	dst = (uint64_t)(uintptr_t)idesc;
-
-	while (remain) {
-		len = remain;
-		src = vhost_iova_to_vva(dev, vq, desc_addr, &len,
-				VHOST_ACCESS_RO);
-		if (unlikely(!src || !len)) {
-			rte_free(idesc);
-			return 0;
-		}
-
-		rte_memcpy((void *)(uintptr_t)dst, (void *)(uintptr_t)src, len);
-
-		remain -= len;
-		dst += len;
-		desc_addr += len;
-	}
-
-	return idesc;
-}
-
 static __rte_always_inline void
 free_ind_table(void *idesc)
 {
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8aeb180016..8ba526c070 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -610,6 +610,36 @@ reserve_avail_buf_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+static void
+copy_vnet_hdr_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		struct buf_vector *buf_vec,
+		struct virtio_net_hdr_mrg_rxbuf *hdr)
+{
+	uint64_t len;
+	uint64_t remain = dev->vhost_hlen;
+	uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
+	uint64_t iova = buf_vec->buf_iova;
+
+	while (remain) {
+		len = RTE_MIN(remain,
+				buf_vec->buf_len);
+		dst = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src,
+				len);
+
+		PRINT_PACKET(dev, (uintptr_t)dst,
+				(uint32_t)len, 0);
+		vhost_log_cache_write(dev, vq,
+				iova, len);
+
+		remain -= len;
+		iova += len;
+		src += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			    struct rte_mbuf *m, struct buf_vector *buf_vec,
@@ -703,30 +733,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 						num_buffers);
 
 			if (unlikely(hdr == &tmp_hdr)) {
-				uint64_t len;
-				uint64_t remain = dev->vhost_hlen;
-				uint64_t src = (uint64_t)(uintptr_t)hdr, dst;
-				uint64_t iova = buf_vec[0].buf_iova;
-				uint16_t hdr_vec_idx = 0;
-
-				while (remain) {
-					len = RTE_MIN(remain,
-						buf_vec[hdr_vec_idx].buf_len);
-					dst = buf_vec[hdr_vec_idx].buf_addr;
-					rte_memcpy((void *)(uintptr_t)dst,
-							(void *)(uintptr_t)src,
-							len);
-
-					PRINT_PACKET(dev, (uintptr_t)dst,
-							(uint32_t)len, 0);
-					vhost_log_cache_write(dev, vq,
-							iova, len);
-
-					remain -= len;
-					iova += len;
-					src += len;
-					hdr_vec_idx++;
-				}
+				copy_vnet_hdr_to_desc(dev, vq, buf_vec, hdr);
 			} else {
 				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 						dev->vhost_hlen, 0);
@@ -1063,6 +1070,31 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
 	}
 }
 
+static void
+copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
+		struct buf_vector *buf_vec)
+{
+	uint64_t len;
+	uint64_t remain = sizeof(struct virtio_net_hdr);
+	uint64_t src;
+	uint64_t dst = (uint64_t)(uintptr_t)&hdr;
+
+	/*
+	 * No luck, the virtio-net header doesn't fit
+	 * in a contiguous virtual area.
+	 */
+	while (remain) {
+		len = RTE_MIN(remain, buf_vec->buf_len);
+		src = buf_vec->buf_addr;
+		rte_memcpy((void *)(uintptr_t)dst,
+				(void *)(uintptr_t)src, len);
+
+		remain -= len;
+		dst += len;
+		buf_vec++;
+	}
+}
+
 static __rte_always_inline int
 copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct buf_vector *buf_vec, uint16_t nr_vec,
@@ -1094,28 +1126,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
-			uint64_t len;
-			uint64_t remain = sizeof(struct virtio_net_hdr);
-			uint64_t src;
-			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
-			uint16_t hdr_vec_idx = 0;
-
-			/*
-			 * No luck, the virtio-net header doesn't fit
-			 * in a contiguous virtual area.
-			 */
-			while (remain) {
-				len = RTE_MIN(remain,
-					buf_vec[hdr_vec_idx].buf_len);
-				src = buf_vec[hdr_vec_idx].buf_addr;
-				rte_memcpy((void *)(uintptr_t)dst,
-						   (void *)(uintptr_t)src, len);
-
-				remain -= len;
-				dst += len;
-				hdr_vec_idx++;
-			}
-
+			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching
  2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (2 preceding siblings ...)
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
@ 2019-05-17 15:06 ` Maxime Coquelin
  2019-05-29  8:05   ` Tiwei Bie
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
  4 siblings, 1 reply; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

Now that we have a single function to map the descriptors
buffers, let's prefetch them there as it is the earliest
place we can do it.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/virtio_net.c | 32 ++------------------------------
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8ba526c070..364c468be8 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -286,6 +286,8 @@ map_one_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (unlikely(!desc_addr))
 			return -1;
 
+		rte_prefetch0((void *)(uintptr_t)desc_addr);
+
 		buf_vec[vec_id].buf_iova = desc_iova;
 		buf_vec[vec_id].buf_addr = desc_addr;
 		buf_vec[vec_id].buf_len  = desc_chunck_len;
@@ -665,9 +667,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	buf_iova = buf_vec[vec_idx].buf_iova;
 	buf_len = buf_vec[vec_idx].buf_len;
 
-	if (nr_vec > 1)
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (unlikely(buf_len < dev->vhost_hlen && nr_vec <= 1)) {
 		error = -1;
 		goto out;
@@ -710,10 +709,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/* Prefetch next buffer address. */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
 			buf_offset = 0;
 			buf_avail  = buf_len;
 		}
@@ -811,8 +806,6 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -860,8 +853,6 @@ virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			break;
 		}
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		VHOST_LOG_DEBUG(VHOST_DATA, "(%d) current index %d | end index %d\n",
 			dev->vid, vq->last_avail_idx,
 			vq->last_avail_idx + num_buffers);
@@ -1121,16 +1112,12 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		goto out;
 	}
 
-	if (likely(nr_vec > 1))
-		rte_prefetch0((void *)(uintptr_t)buf_vec[1].buf_addr);
-
 	if (virtio_net_with_host_offload(dev)) {
 		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
 			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
 			hdr = &tmp_hdr;
 		} else {
 			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
-			rte_prefetch0(hdr);
 		}
 	}
 
@@ -1160,9 +1147,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		buf_avail = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
 	}
 
-	rte_prefetch0((void *)(uintptr_t)
-			(buf_addr + buf_offset));
-
 	PRINT_PACKET(dev,
 			(uintptr_t)(buf_addr + buf_offset),
 			(uint32_t)buf_avail, 0);
@@ -1228,14 +1212,6 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			buf_iova = buf_vec[vec_idx].buf_iova;
 			buf_len = buf_vec[vec_idx].buf_len;
 
-			/*
-			 * Prefecth desc n + 1 buffer while
-			 * desc n buffer is processed.
-			 */
-			if (vec_idx + 1 < nr_vec)
-				rte_prefetch0((void *)(uintptr_t)
-						buf_vec[vec_idx + 1].buf_addr);
-
 			buf_offset = 0;
 			buf_avail  = buf_len;
 
@@ -1379,8 +1355,6 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
@@ -1490,8 +1464,6 @@ virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		rte_prefetch0((void *)(uintptr_t)buf_vec[0].buf_addr);
-
 		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
 		if (unlikely(pkts[i] == NULL)) {
 			RTE_LOG(ERR, VHOST_DATA,
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
                   ` (3 preceding siblings ...)
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
@ 2019-05-17 15:06 ` Maxime Coquelin
  2019-05-20  8:30   ` David Marchand
  4 siblings, 1 reply; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-17 15:06 UTC (permalink / raw)
  To: dev, tiwei.bie, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand
  Cc: Maxime Coquelin

Some helpers in the header file are forced inlined other are
only inlined, this patch forces inline for all.

It will avoid it to be embedded as functions when called multiple
times in the same object file. For example, when we added packed
ring support in vhost-user library, rte_memcpy_generic got no
more inlined.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 .../common/include/arch/x86/rte_memcpy.h       | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 7b758094df..ba44c4a328 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -115,7 +115,7 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m512i zmm0, zmm1;
@@ -163,7 +163,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -330,7 +330,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
@@ -343,7 +343,7 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
 	__m256i ymm0, ymm1, ymm2, ymm3;
@@ -363,7 +363,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
@@ -523,7 +523,7 @@ rte_mov64(uint8_t *dst, const uint8_t *src)
  * Copy 128 bytes from one location to another,
  * locations should not overlap.
  */
-static inline void
+static __rte_always_inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
 	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
@@ -655,7 +655,7 @@ __extension__ ({                                                      \
     }                                                                 \
 })
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
@@ -800,7 +800,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 
 #endif /* RTE_MACHINE_CPUFLAG */
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy_aligned(void *dst, const void *src, size_t n)
 {
 	void *ret = dst;
@@ -860,7 +860,7 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n)
 	return ret;
 }
 
-static inline void *
+static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
 	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
@ 2019-05-20  5:18   ` Tiwei Bie
  0 siblings, 0 replies; 12+ messages in thread
From: Tiwei Bie @ 2019-05-20  5:18 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand

On Fri, May 17, 2019 at 05:06:09PM +0200, Maxime Coquelin wrote:
[...]
> +void
> +__vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +			uint64_t addr, uint64_t len)
> +{
> +	uint64_t page;
> +
> +	if (unlikely(!dev->log_base || !len))
> +		return;
> +
> +	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> +		return;
> +
> +	page = addr / VHOST_LOG_PAGE;
> +	while (page * VHOST_LOG_PAGE < addr + len) {
> +		vhost_log_cache_page(dev, vq, page);
> +		page += 1;
> +	}
> +}
> +
> +

Just need one empty line here.

For the rest,
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>

>  void
>  cleanup_vq(struct vhost_virtqueue *vq, int destroy)
>  {

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions Maxime Coquelin
@ 2019-05-20  5:30   ` Tiwei Bie
  0 siblings, 0 replies; 12+ messages in thread
From: Tiwei Bie @ 2019-05-20  5:30 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand

On Fri, May 17, 2019 at 05:06:10PM +0200, Maxime Coquelin wrote:
> At runtime either packed Tx/Rx functions will always be called,
> or split Tx/Rx functions will always be called.
> 
> This patch removes the forced inlining in order to reduce
> the I-cache pressure.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  lib/librte_vhost/virtio_net.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index a6a33a1013..8aeb180016 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -771,7 +771,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	return error;
>  }
>  
> -static __rte_always_inline uint32_t
> +static __rte_noinline uint32_t
>  virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	struct rte_mbuf **pkts, uint32_t count)
>  {
> @@ -830,7 +830,7 @@ virtio_dev_rx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	return pkt_idx;
>  }
>  
> -static __rte_always_inline uint32_t
> +static __rte_noinline uint32_t
>  virtio_dev_rx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	struct rte_mbuf **pkts, uint32_t count)
>  {
> @@ -1300,7 +1300,7 @@ get_zmbuf(struct vhost_virtqueue *vq)
>  	return NULL;
>  }
>  
> -static __rte_always_inline uint16_t
> +static __rte_noinline uint16_t
>  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>  {
> @@ -1422,7 +1422,7 @@ virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	return i;
>  }
>  
> -static __rte_always_inline uint16_t
> +static __rte_noinline uint16_t
>  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>  {
> -- 
> 2.21.0

Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
@ 2019-05-20  5:51   ` Tiwei Bie
  2019-05-24 13:50     ` Maxime Coquelin
  0 siblings, 1 reply; 12+ messages in thread
From: Tiwei Bie @ 2019-05-20  5:51 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand

On Fri, May 17, 2019 at 05:06:11PM +0200, Maxime Coquelin wrote:
[...]
>  
> +static void
> +copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
> +		struct buf_vector *buf_vec)
> +{
> +	uint64_t len;
> +	uint64_t remain = sizeof(struct virtio_net_hdr);
> +	uint64_t src;
> +	uint64_t dst = (uint64_t)(uintptr_t)&hdr;

typo: s/&hdr/hdr/

> +
> +	/*
> +	 * No luck, the virtio-net header doesn't fit
> +	 * in a contiguous virtual area.
> +	 */
> +	while (remain) {
> +		len = RTE_MIN(remain, buf_vec->buf_len);
> +		src = buf_vec->buf_addr;
> +		rte_memcpy((void *)(uintptr_t)dst,
> +				(void *)(uintptr_t)src, len);
> +
> +		remain -= len;
> +		dst += len;
> +		buf_vec++;
> +	}
> +}
> +
>  static __rte_always_inline int
>  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  		  struct buf_vector *buf_vec, uint16_t nr_vec,
> @@ -1094,28 +1126,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>  
>  	if (virtio_net_with_host_offload(dev)) {
>  		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
> -			uint64_t len;
> -			uint64_t remain = sizeof(struct virtio_net_hdr);
> -			uint64_t src;
> -			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
> -			uint16_t hdr_vec_idx = 0;
> -
> -			/*
> -			 * No luck, the virtio-net header doesn't fit
> -			 * in a contiguous virtual area.
> -			 */

It's better to not move above comments.

For the rest,
Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>

> -			while (remain) {
> -				len = RTE_MIN(remain,
> -					buf_vec[hdr_vec_idx].buf_len);
> -				src = buf_vec[hdr_vec_idx].buf_addr;
> -				rte_memcpy((void *)(uintptr_t)dst,
> -						   (void *)(uintptr_t)src, len);
> -
> -				remain -= len;
> -				dst += len;
> -				hdr_vec_idx++;
> -			}
> -
> +			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
>  			hdr = &tmp_hdr;
>  		} else {
>  			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
@ 2019-05-20  8:30   ` David Marchand
  0 siblings, 0 replies; 12+ messages in thread
From: David Marchand @ 2019-05-20  8:30 UTC (permalink / raw)
  To: Maxime Coquelin, Bruce Richardson
  Cc: dev, Tiwei Bie, Jens Freimann, Zhihong Wang, Ananyev, Konstantin

On Fri, May 17, 2019 at 5:14 PM Maxime Coquelin <maxime.coquelin@redhat.com>
wrote:

> Some helpers in the header file are forced inlined other are
> only inlined, this patch forces inline for all.
>
> It will avoid it to be embedded as functions when called multiple
> times in the same object file. For example, when we added packed
> ring support in vhost-user library, rte_memcpy_generic got no
> more inlined.
>

Weird that we have only some functions marked as always inlined in commit:
https://git.dpdk.org/dpdk/commit/?id=1c9467a6efd8d85b5bbbf7004a4407cae2d09431

Bruce, is there a reason for this?


-- 
David Marchand

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code
  2019-05-20  5:51   ` Tiwei Bie
@ 2019-05-24 13:50     ` Maxime Coquelin
  0 siblings, 0 replies; 12+ messages in thread
From: Maxime Coquelin @ 2019-05-24 13:50 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: dev, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand



On 5/20/19 7:51 AM, Tiwei Bie wrote:
> On Fri, May 17, 2019 at 05:06:11PM +0200, Maxime Coquelin wrote:
> [...]
>>   
>> +static void
>> +copy_vnet_hdr_from_desc(struct virtio_net_hdr *hdr,
>> +		struct buf_vector *buf_vec)
>> +{
>> +	uint64_t len;
>> +	uint64_t remain = sizeof(struct virtio_net_hdr);
>> +	uint64_t src;
>> +	uint64_t dst = (uint64_t)(uintptr_t)&hdr;
> 
> typo: s/&hdr/hdr/

Nice catch! It wasn't spotted at build time due to the cast.

>> +
>> +	/*
>> +	 * No luck, the virtio-net header doesn't fit
>> +	 * in a contiguous virtual area.
>> +	 */
>> +	while (remain) {
>> +		len = RTE_MIN(remain, buf_vec->buf_len);
>> +		src = buf_vec->buf_addr;
>> +		rte_memcpy((void *)(uintptr_t)dst,
>> +				(void *)(uintptr_t)src, len);
>> +
>> +		remain -= len;
>> +		dst += len;
>> +		buf_vec++;
>> +	}
>> +}
>> +
>>   static __rte_always_inline int
>>   copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>   		  struct buf_vector *buf_vec, uint16_t nr_vec,
>> @@ -1094,28 +1126,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>   
>>   	if (virtio_net_with_host_offload(dev)) {
>>   		if (unlikely(buf_len < sizeof(struct virtio_net_hdr))) {
>> -			uint64_t len;
>> -			uint64_t remain = sizeof(struct virtio_net_hdr);
>> -			uint64_t src;
>> -			uint64_t dst = (uint64_t)(uintptr_t)&tmp_hdr;
>> -			uint16_t hdr_vec_idx = 0;
>> -
>> -			/*
>> -			 * No luck, the virtio-net header doesn't fit
>> -			 * in a contiguous virtual area.
>> -			 */
> 
> It's better to not move above comments.

Right, I will revert it back here.

> 
> For the rest,
> Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>


Thanks,
Maxime

> 
>> -			while (remain) {
>> -				len = RTE_MIN(remain,
>> -					buf_vec[hdr_vec_idx].buf_len);
>> -				src = buf_vec[hdr_vec_idx].buf_addr;
>> -				rte_memcpy((void *)(uintptr_t)dst,
>> -						   (void *)(uintptr_t)src, len);
>> -
>> -				remain -= len;
>> -				dst += len;
>> -				hdr_vec_idx++;
>> -			}
>> -
>> +			copy_vnet_hdr_from_desc(&tmp_hdr, buf_vec);
>>   			hdr = &tmp_hdr;
>>   		} else {
>>   			hdr = (struct virtio_net_hdr *)((uintptr_t)buf_addr);
>> -- 
>> 2.21.0
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching
  2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
@ 2019-05-29  8:05   ` Tiwei Bie
  0 siblings, 0 replies; 12+ messages in thread
From: Tiwei Bie @ 2019-05-29  8:05 UTC (permalink / raw)
  To: Maxime Coquelin
  Cc: dev, jfreimann, zhihong.wang, bruce.richardson,
	konstantin.ananyev, david.marchand

On Fri, May 17, 2019 at 05:06:12PM +0200, Maxime Coquelin wrote:
> Now that we have a single function to map the descriptors
> buffers, let's prefetch them there as it is the earliest
> place we can do it.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
>  lib/librte_vhost/virtio_net.c | 32 ++------------------------------
>  1 file changed, 2 insertions(+), 30 deletions(-)

Reviewed-by: Tiwei Bie <tiwei.bie@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-05-29  8:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 15:06 [dpdk-dev] [PATCH v2 0/5] vhost: I-cache pressure optimizations Maxime Coquelin
2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 1/5] vhost: un-inline dirty pages logging functions Maxime Coquelin
2019-05-20  5:18   ` Tiwei Bie
2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 2/5] vhost: do not inline packed and split functions Maxime Coquelin
2019-05-20  5:30   ` Tiwei Bie
2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 3/5] vhost: do not inline unlikely fragmented buffers code Maxime Coquelin
2019-05-20  5:51   ` Tiwei Bie
2019-05-24 13:50     ` Maxime Coquelin
2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 4/5] vhost: simplify descriptor's buffer prefetching Maxime Coquelin
2019-05-29  8:05   ` Tiwei Bie
2019-05-17 15:06 ` [dpdk-dev] [PATCH v2 5/5] eal/x86: force inlining of all memcpy and mov helpers Maxime Coquelin
2019-05-20  8:30   ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).