* [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
@ 2016-12-02  1:19 Zhihong Wang
  2016-12-02 10:30 ` Thomas Monjalon
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Zhihong Wang @ 2016-12-02  1:19 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, Zhihong Wang
This patch optimizes Vhost performance for large packets when the
Mergeable Rx buffer feature is enabled. It introduces a dedicated
memcpy function for vhost enqueue/dequeue to replace rte_memcpy.
The reason is that rte_memcpy is for general cases, it handles
unaligned copies and make store aligned, it even makes load aligned
for micro architectures like Ivy Bridge. However alignment handling
comes at a price: It introduces extra load/store instructions.
Vhost memcpy is rather special: The copy is aligned, and remote,
and there is header write along which is also remote. In this case
the memcpy instruction stream should be simplified, to reduce extra
load/store, therefore reduce the probability of load/store buffer
full caused pipeline stall, to let the actual memcpy instructions
be issued and let H/W prefetcher goes to work as early as possible.
Performance gain is visible when packet size:
 1. Larger than 512 bytes on AVX/SSE platforms like Ivy Bridge
 2. Larger than 256 bytes on AVX2 platforms like Haswell
 3. Larger than 512 bytes on AVX512 platforms like Skylake
Up to 20% gain can be achieved by this patch for PVP traffic. The
test can also be conducted without NIC, by using loopback traffic
between Vhost and Virtio. For example, increase TXONLY_DEF_PACKET_LEN
to the requested packet size in testpmd.h, rebuild and start testpmd
in both host and guest, then "start" on one side and "start tx_first 32"
on the other.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 lib/librte_vhost/virtio_net.c | 72 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 3 deletions(-)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 595f67c..cd6f21a 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -50,6 +50,72 @@
 #define MAX_PKT_BURST 32
 #define VHOST_LOG_PAGE	4096
 
+/**
+ * This function is used to for vhost memcpy, to replace rte_memcpy.
+ * The reason is that rte_memcpy is for general cases, where vhost
+ * memcpy is a rather special case: The copy is aligned, and remote,
+ * and there is header write along which is also remote. In this case
+ * the memcpy instruction stream should be simplified to reduce extra
+ * load/store, therefore reduce the probability of load/store buffer
+ * full caused pipeline stall, to let the actual memcpy instructions
+ * be issued and let H/W prefetcher goes to work as early as possible.
+ */
+static inline void __attribute__((always_inline))
+vhost_memcpy(void *dst, const void *src, size_t n)
+{
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+}
+
 static inline void __attribute__((always_inline))
 vhost_log_page(uint8_t *log_base, uint64_t page)
 {
@@ -246,7 +312,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vring_desc *descs,
 		}
 
 		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
+		vhost_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
 			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
 			cpy_len);
 		vhost_log_write(dev, desc->addr + desc_offset, cpy_len);
@@ -522,7 +588,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct rte_mbuf *m,
 		}
 
 		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
+		vhost_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
 			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
 			cpy_len);
 		vhost_log_write(dev, buf_vec[vec_idx].buf_addr + desc_offset,
@@ -856,7 +922,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
 			 */
 			mbuf_avail = cpy_len;
 		} else {
-			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
+			vhost_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
 							   mbuf_offset),
 				(void *)((uintptr_t)(desc_addr + desc_offset)),
 				cpy_len);
-- 
2.7.4
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
  2016-12-02  1:19 [dpdk-dev] [PATCH] vhost: optimize vhost memcpy Zhihong Wang
@ 2016-12-02 10:30 ` Thomas Monjalon
  2016-12-05  8:27 ` Yuanhan Liu
  2016-12-07  1:31 ` [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
  2 siblings, 0 replies; 10+ messages in thread
From: Thomas Monjalon @ 2016-12-02 10:30 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev, yuanhan.liu
2016-12-01 20:19, Zhihong Wang:
> Up to 20% gain can be achieved by this patch for PVP traffic.
Really nice!
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
  2016-12-02  1:19 [dpdk-dev] [PATCH] vhost: optimize vhost memcpy Zhihong Wang
  2016-12-02 10:30 ` Thomas Monjalon
@ 2016-12-05  8:27 ` Yuanhan Liu
  2016-12-05 10:27   ` Wang, Zhihong
  2016-12-07  1:31 ` [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
  2 siblings, 1 reply; 10+ messages in thread
From: Yuanhan Liu @ 2016-12-05  8:27 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev
On Thu, Dec 01, 2016 at 08:19:42PM -0500, Zhihong Wang wrote:
> This patch optimizes Vhost performance for large packets when the
> Mergeable Rx buffer feature is enabled. It introduces a dedicated
> memcpy function for vhost enqueue/dequeue to replace rte_memcpy.
> 
> The reason is that rte_memcpy is for general cases, it handles
> unaligned copies and make store aligned, it even makes load aligned
> for micro architectures like Ivy Bridge. However alignment handling
> comes at a price: It introduces extra load/store instructions.
> 
> Vhost memcpy is rather special: The copy is aligned, and remote,
> and there is header write along which is also remote. In this case
> the memcpy instruction stream should be simplified, to reduce extra
> load/store, therefore reduce the probability of load/store buffer
> full caused pipeline stall, to let the actual memcpy instructions
> be issued and let H/W prefetcher goes to work as early as possible.
...
>  
> +/**
> + * This function is used to for vhost memcpy, to replace rte_memcpy.
> + * The reason is that rte_memcpy is for general cases, where vhost
> + * memcpy is a rather special case: The copy is aligned, and remote,
> + * and there is header write along which is also remote. In this case
> + * the memcpy instruction stream should be simplified to reduce extra
> + * load/store, therefore reduce the probability of load/store buffer
> + * full caused pipeline stall, to let the actual memcpy instructions
> + * be issued and let H/W prefetcher goes to work as early as possible.
> + */
> +static inline void __attribute__((always_inline))
> +vhost_memcpy(void *dst, const void *src, size_t n)
I like this function a lot, since it's really simple and straightforward!
Moreover, it performs better.
But, I don't quite like how this function is proposed:
- rte_movX are more like internal help functions that should be used only
  in corresponding rte_memcpy.h file.
- It's a good optimization, however, it will not benefit for other use
  cases, though vhost is the most typical case here.
- The optimization proves to be good for X86, but think there is no
  guarantee it may behave well for other platforms, say ARM.
I still would suggest you to go this way: move this function into x86's
rte_memcpy.h and call it when the data is well aligned.
	--yliu
> +{
> +	/* Copy size <= 16 bytes */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dst = *(const uint8_t *)src;
> +			src = (const uint8_t *)src + 1;
> +			dst = (uint8_t *)dst + 1;
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dst = *(const uint16_t *)src;
> +			src = (const uint16_t *)src + 1;
> +			dst = (uint16_t *)dst + 1;
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dst = *(const uint32_t *)src;
> +			src = (const uint32_t *)src + 1;
> +			dst = (uint32_t *)dst + 1;
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +		return;
> +	}
> +
> +	/* Copy 16 <= size <= 32 bytes */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +
> +		return;
> +	}
> +
> +	/* Copy 32 < size <= 64 bytes */
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov32((uint8_t *)dst - 32 + n,
> +				(const uint8_t *)src - 32 + n);
> +
> +		return;
> +	}
> +
> +	/* Copy 64 bytes blocks */
> +	for (; n >= 64; n -= 64) {
> +		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +		dst = (uint8_t *)dst + 64;
> +		src = (const uint8_t *)src + 64;
> +	}
> +
> +	/* Copy whatever left */
> +	rte_mov64((uint8_t *)dst - 64 + n,
> +			(const uint8_t *)src - 64 + n);
> +}
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
  2016-12-05  8:27 ` Yuanhan Liu
@ 2016-12-05 10:27   ` Wang, Zhihong
  2016-12-05 10:37     ` Yuanhan Liu
  0 siblings, 1 reply; 10+ messages in thread
From: Wang, Zhihong @ 2016-12-05 10:27 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Thomas Monjalon
> I like this function a lot, since it's really simple and straightforward!
> Moreover, it performs better.
> 
> But, I don't quite like how this function is proposed:
> 
> - rte_movX are more like internal help functions that should be used only
>   in corresponding rte_memcpy.h file.
> 
> - It's a good optimization, however, it will not benefit for other use
>   cases, though vhost is the most typical case here.
> 
> - The optimization proves to be good for X86, but think there is no
>   guarantee it may behave well for other platforms, say ARM.
> 
> I still would suggest you to go this way: move this function into x86's
> rte_memcpy.h and call it when the data is well aligned.
Do you mean to add something like rte_memcpy_aligned() in 
lib/librte_eal/common/include/generic/rte_memcpy.h?
I thought of this way before, and didn't choose it because it requires
changes in eal. But it would be a clean solution, I'd certainly like
to implement it this way if people are okay with it.
Thanks
Zhihong
> 
> 	--yliu
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
  2016-12-05 10:27   ` Wang, Zhihong
@ 2016-12-05 10:37     ` Yuanhan Liu
  2016-12-07  6:11       ` Wang, Zhihong
  0 siblings, 1 reply; 10+ messages in thread
From: Yuanhan Liu @ 2016-12-05 10:37 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev, Thomas Monjalon
On Mon, Dec 05, 2016 at 10:27:00AM +0000, Wang, Zhihong wrote:
> > I like this function a lot, since it's really simple and straightforward!
> > Moreover, it performs better.
> > 
> > But, I don't quite like how this function is proposed:
> > 
> > - rte_movX are more like internal help functions that should be used only
> >   in corresponding rte_memcpy.h file.
> > 
> > - It's a good optimization, however, it will not benefit for other use
> >   cases, though vhost is the most typical case here.
> > 
> > - The optimization proves to be good for X86, but think there is no
> >   guarantee it may behave well for other platforms, say ARM.
> > 
> > I still would suggest you to go this way: move this function into x86's
> > rte_memcpy.h and call it when the data is well aligned.
> 
> 
> Do you mean to add something like rte_memcpy_aligned() in 
> lib/librte_eal/common/include/generic/rte_memcpy.h?
Yes, but this one is not supposed to be exported as a public API.
It should be called inside rte_memcpy (when data is well aligned).
In this way, only rte_memcpy is exposed, and nothing else should
be changed.
	--yliu
> 
> I thought of this way before, and didn't choose it because it requires
> changes in eal. But it would be a clean solution, I'd certainly like
> to implement it this way if people are okay with it.
> 
> 
> Thanks
> Zhihong
> 
> 
> > 
> > 	--yliu
^ permalink raw reply	[flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy
  2016-12-02  1:19 [dpdk-dev] [PATCH] vhost: optimize vhost memcpy Zhihong Wang
  2016-12-02 10:30 ` Thomas Monjalon
  2016-12-05  8:27 ` Yuanhan Liu
@ 2016-12-07  1:31 ` Zhihong Wang
  2016-12-08  0:55   ` Yao, Lei A
  2016-12-08  2:18   ` Yuanhan Liu
  2 siblings, 2 replies; 10+ messages in thread
From: Zhihong Wang @ 2016-12-07  1:31 UTC (permalink / raw)
  To: dev; +Cc: yuanhan.liu, thomas.monjalon, lei.a.yao, Zhihong Wang
This patch optimizes rte_memcpy for well aligned cases, where both
dst and src addr are aligned to maximum MOV width. It introduces a
dedicated function called rte_memcpy_aligned to handle the aligned
cases with simplified instruction stream. The existing rte_memcpy
is renamed as rte_memcpy_generic. The selection between them 2 is
done at the entry of rte_memcpy.
The existing rte_memcpy is for generic cases, it handles unaligned
copies and make store aligned, it even makes load aligned for micro
architectures like Ivy Bridge. However alignment handling comes at
a price: It adds extra load/store instructions, which can cause
complications sometime.
DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
The copy is aligned, and remote, and there is header write along
which is also remote. In this case the memcpy instruction stream
should be simplified, to reduce extra load/store, therefore reduce
the probability of load/store buffer full caused pipeline stall, to
let the actual memcpy instructions be issued and let H/W prefetcher
goes to work as early as possible.
This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.
The test can also be conducted without NIC, by setting loopback
traffic between Virtio and Vhost. For example, modify the macro
TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
rebuild and start testpmd in both host and guest, then "start" on
one side and "start tx_first 32" on the other.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 81 +++++++++++++++++++++-
 1 file changed, 78 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index b3bfc23..b9785e8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -69,6 +69,8 @@ rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
 #ifdef RTE_MACHINE_CPUFLAG_AVX512F
 
+#define ALIGNMENT_MASK 0x3F
+
 /**
  * AVX512 implementation below
  */
@@ -189,7 +191,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 }
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
 	uintptr_t srcu = (uintptr_t)src;
@@ -308,6 +310,8 @@ COPY_BLOCK_128_BACK63:
 
 #elif defined RTE_MACHINE_CPUFLAG_AVX2
 
+#define ALIGNMENT_MASK 0x1F
+
 /**
  * AVX2 implementation below
  */
@@ -387,7 +391,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 }
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
 	uintptr_t srcu = (uintptr_t)src;
@@ -499,6 +503,8 @@ COPY_BLOCK_128_BACK31:
 
 #else /* RTE_MACHINE_CPUFLAG */
 
+#define ALIGNMENT_MASK 0x0F
+
 /**
  * SSE & AVX implementation below
  */
@@ -677,7 +683,7 @@ __extension__ ({                                                      \
 })
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
 	uintptr_t dstu = (uintptr_t)dst;
@@ -821,6 +827,75 @@ COPY_BLOCK_64_BACK15:
 
 #endif /* RTE_MACHINE_CPUFLAG */
 
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.7.4
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy
  2016-12-05 10:37     ` Yuanhan Liu
@ 2016-12-07  6:11       ` Wang, Zhihong
  0 siblings, 0 replies; 10+ messages in thread
From: Wang, Zhihong @ 2016-12-07  6:11 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Thomas Monjalon
> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Monday, December 5, 2016 6:37 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>
> Subject: Re: [PATCH] vhost: optimize vhost memcpy
> 
> On Mon, Dec 05, 2016 at 10:27:00AM +0000, Wang, Zhihong wrote:
> > > I like this function a lot, since it's really simple and straightforward!
> > > Moreover, it performs better.
> > >
> > > But, I don't quite like how this function is proposed:
> > >
> > > - rte_movX are more like internal help functions that should be used only
> > >   in corresponding rte_memcpy.h file.
> > >
> > > - It's a good optimization, however, it will not benefit for other use
> > >   cases, though vhost is the most typical case here.
> > >
> > > - The optimization proves to be good for X86, but think there is no
> > >   guarantee it may behave well for other platforms, say ARM.
> > >
> > > I still would suggest you to go this way: move this function into x86's
> > > rte_memcpy.h and call it when the data is well aligned.
> >
> >
> > Do you mean to add something like rte_memcpy_aligned() in
> > lib/librte_eal/common/include/generic/rte_memcpy.h?
> 
> Yes, but this one is not supposed to be exported as a public API.
> It should be called inside rte_memcpy (when data is well aligned).
> In this way, only rte_memcpy is exposed, and nothing else should
> be changed.
Yes I agree this is a better way to introduce this patch, I'll send out v2.
> 
> 	--yliu
> >
> > I thought of this way before, and didn't choose it because it requires
> > changes in eal. But it would be a clean solution, I'd certainly like
> > to implement it this way if people are okay with it.
> >
> >
> > Thanks
> > Zhihong
> >
> >
> > >
> > > 	--yliu
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy
  2016-12-07  1:31 ` [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
@ 2016-12-08  0:55   ` Yao, Lei A
  2016-12-08  2:18   ` Yuanhan Liu
  1 sibling, 0 replies; 10+ messages in thread
From: Yao, Lei A @ 2016-12-08  0:55 UTC (permalink / raw)
  To: Wang, Zhihong, dev; +Cc: yuanhan.liu, thomas.monjalon
Tested-by: Lei Yao <lei.a.yao@intel.com>
- Apply patch to v16.11
I have tested the loopback performance for this patch on 3 following settings:
CPU: IVB
Ubutnu16.04
Kernal:  4.4.0
gcc : 5.4.0
CPU: HSW
Fedora 21
Kernal: 4.1.13
gcc: 4.9.2
CPU:BDW
Ubutnu16.04
Kernal:  4.4.0
gcc : 5.4.0
I can see 10%~20% performance gain for different packet size on mergeable path. Only on IVB + gcc5.4.0, slight performance drop(~4%) on vector path for packet size 128 ,260. It's may related to gcc version as this performance drop not see with gcc 6+.
-----Original Message-----
From: Wang, Zhihong 
Sent: Wednesday, December 7, 2016 9:31 AM
To: dev@dpdk.org
Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; Yao, Lei A <lei.a.yao@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
Subject: [PATCH v2] eal: optimize aligned rte_memcpy
This patch optimizes rte_memcpy for well aligned cases, where both
dst and src addr are aligned to maximum MOV width. It introduces a
dedicated function called rte_memcpy_aligned to handle the aligned
cases with simplified instruction stream. The existing rte_memcpy
is renamed as rte_memcpy_generic. The selection between them 2 is
done at the entry of rte_memcpy.
The existing rte_memcpy is for generic cases, it handles unaligned
copies and make store aligned, it even makes load aligned for micro
architectures like Ivy Bridge. However alignment handling comes at
a price: It adds extra load/store instructions, which can cause
complications sometime.
DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
The copy is aligned, and remote, and there is header write along
which is also remote. In this case the memcpy instruction stream
should be simplified, to reduce extra load/store, therefore reduce
the probability of load/store buffer full caused pipeline stall, to
let the actual memcpy instructions be issued and let H/W prefetcher
goes to work as early as possible.
This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
from 64 to 1500 bytes.
The test can also be conducted without NIC, by setting loopback
traffic between Virtio and Vhost. For example, modify the macro
TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
rebuild and start testpmd in both host and guest, then "start" on
one side and "start tx_first 32" on the other.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 81 +++++++++++++++++++++-
 1 file changed, 78 insertions(+), 3 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index b3bfc23..b9785e8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -69,6 +69,8 @@ rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
 #ifdef RTE_MACHINE_CPUFLAG_AVX512F
 
+#define ALIGNMENT_MASK 0x3F
+
 /**
  * AVX512 implementation below
  */
@@ -189,7 +191,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 }
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
 	uintptr_t srcu = (uintptr_t)src;
@@ -308,6 +310,8 @@ COPY_BLOCK_128_BACK63:
 
 #elif defined RTE_MACHINE_CPUFLAG_AVX2
 
+#define ALIGNMENT_MASK 0x1F
+
 /**
  * AVX2 implementation below
  */
@@ -387,7 +391,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 }
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	uintptr_t dstu = (uintptr_t)dst;
 	uintptr_t srcu = (uintptr_t)src;
@@ -499,6 +503,8 @@ COPY_BLOCK_128_BACK31:
 
 #else /* RTE_MACHINE_CPUFLAG */
 
+#define ALIGNMENT_MASK 0x0F
+
 /**
  * SSE & AVX implementation below
  */
@@ -677,7 +683,7 @@ __extension__ ({                                                      \
 })
 
 static inline void *
-rte_memcpy(void *dst, const void *src, size_t n)
+rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
 	uintptr_t dstu = (uintptr_t)dst;
@@ -821,6 +827,75 @@ COPY_BLOCK_64_BACK15:
 
 #endif /* RTE_MACHINE_CPUFLAG */
 
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.7.4
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy
  2016-12-07  1:31 ` [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
  2016-12-08  0:55   ` Yao, Lei A
@ 2016-12-08  2:18   ` Yuanhan Liu
  2017-01-17 15:08     ` Thomas Monjalon
  1 sibling, 1 reply; 10+ messages in thread
From: Yuanhan Liu @ 2016-12-08  2:18 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: dev, thomas.monjalon, lei.a.yao
On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote:
> This patch optimizes rte_memcpy for well aligned cases, where both
> dst and src addr are aligned to maximum MOV width. It introduces a
> dedicated function called rte_memcpy_aligned to handle the aligned
> cases with simplified instruction stream. The existing rte_memcpy
> is renamed as rte_memcpy_generic. The selection between them 2 is
> done at the entry of rte_memcpy.
> 
> The existing rte_memcpy is for generic cases, it handles unaligned
> copies and make store aligned, it even makes load aligned for micro
> architectures like Ivy Bridge. However alignment handling comes at
> a price: It adds extra load/store instructions, which can cause
> complications sometime.
> 
> DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
> The copy is aligned, and remote, and there is header write along
> which is also remote. In this case the memcpy instruction stream
> should be simplified, to reduce extra load/store, therefore reduce
> the probability of load/store buffer full caused pipeline stall, to
> let the actual memcpy instructions be issued and let H/W prefetcher
> goes to work as early as possible.
> 
> This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> from 64 to 1500 bytes.
> 
> The test can also be conducted without NIC, by setting loopback
> traffic between Virtio and Vhost. For example, modify the macro
> TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
> rebuild and start testpmd in both host and guest, then "start" on
> one side and "start tx_first 32" on the other.
> 
> 
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
	--yliu
^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy
  2016-12-08  2:18   ` Yuanhan Liu
@ 2017-01-17 15:08     ` Thomas Monjalon
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Monjalon @ 2017-01-17 15:08 UTC (permalink / raw)
  To: Zhihong Wang; +Cc: Yuanhan Liu, dev, lei.a.yao
2016-12-08 10:18, Yuanhan Liu:
> On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote:
> > This patch optimizes rte_memcpy for well aligned cases, where both
> > dst and src addr are aligned to maximum MOV width. It introduces a
> > dedicated function called rte_memcpy_aligned to handle the aligned
> > cases with simplified instruction stream. The existing rte_memcpy
> > is renamed as rte_memcpy_generic. The selection between them 2 is
> > done at the entry of rte_memcpy.
> > 
> > The existing rte_memcpy is for generic cases, it handles unaligned
> > copies and make store aligned, it even makes load aligned for micro
> > architectures like Ivy Bridge. However alignment handling comes at
> > a price: It adds extra load/store instructions, which can cause
> > complications sometime.
> > 
> > DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
> > The copy is aligned, and remote, and there is header write along
> > which is also remote. In this case the memcpy instruction stream
> > should be simplified, to reduce extra load/store, therefore reduce
> > the probability of load/store buffer full caused pipeline stall, to
> > let the actual memcpy instructions be issued and let H/W prefetcher
> > goes to work as early as possible.
> > 
> > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> > from 64 to 1500 bytes.
> > 
> > The test can also be conducted without NIC, by setting loopback
> > traffic between Virtio and Vhost. For example, modify the macro
> > TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
> > rebuild and start testpmd in both host and guest, then "start" on
> > one side and "start tx_first 32" on the other.
> > 
> > 
> > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> 
> Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Applied, thanks
^ permalink raw reply	[flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-01-17 15:08 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02  1:19 [dpdk-dev] [PATCH] vhost: optimize vhost memcpy Zhihong Wang
2016-12-02 10:30 ` Thomas Monjalon
2016-12-05  8:27 ` Yuanhan Liu
2016-12-05 10:27   ` Wang, Zhihong
2016-12-05 10:37     ` Yuanhan Liu
2016-12-07  6:11       ` Wang, Zhihong
2016-12-07  1:31 ` [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy Zhihong Wang
2016-12-08  0:55   ` Yao, Lei A
2016-12-08  2:18   ` Yuanhan Liu
2017-01-17 15:08     ` Thomas Monjalon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).