From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <qian.q.xu@intel.com>
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
 by dpdk.org (Postfix) with ESMTP id 2A9802BA7
 for <dev@dpdk.org>; Thu, 26 May 2016 07:19:21 +0200 (CEST)
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by fmsmga104.fm.intel.com with ESMTP; 25 May 2016 22:19:20 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.26,366,1459839600"; d="scan'208";a="815057282"
Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203])
 by orsmga003.jf.intel.com with ESMTP; 25 May 2016 22:19:20 -0700
Received: from fmsmsx154.amr.corp.intel.com (10.18.116.70) by
 FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Wed, 25 May 2016 22:19:19 -0700
Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by
 FMSMSX154.amr.corp.intel.com (10.18.116.70) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Wed, 25 May 2016 22:19:18 -0700
Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.104]) by
 SHSMSX103.ccr.corp.intel.com ([169.254.4.58]) with mapi id 14.03.0248.002;
 Thu, 26 May 2016 13:19:17 +0800
From: "Xu, Qian Q" <qian.q.xu@intel.com>
To: "Wang, Zhihong" <zhihong.wang@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "Wang, Zhihong" <zhihong.wang@intel.com>
Thread-Topic: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw
Thread-Index: AQHRtl+SH8kLLkaTc0+kRXOs7U07g5/KrXOw
Date: Thu, 26 May 2016 05:19:16 +0000
Message-ID: <82F45D86ADE5454A95A89742C8D1410E0328B4EE@shsmsx102.ccr.corp.intel.com>
References: <1464139383-132732-1-git-send-email-zhihong.wang@intel.com>
In-Reply-To: <1464139383-132732-1-git-send-email-zhihong.wang@intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiOWIzOWRhYmEtMjAyYS00NWU0LTk4NDYtMDM5N2Y3MjgyMGYyIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6InR1ajdZaklKSGNiZDN4dTB5d2s2elI0bUR2VUJrMGhGSVE1NVRoejAwRkE9In0=
x-ctpclassification: CTP_IC
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 26 May 2016 05:19:21 -0000

Tested-by: Qian Xu <qian.q.xu@intel.com>

- Test Commit: 8f6f24342281f59de0df7bd976a32f714d39b9a9
- OS/Kernel: Fedora 21/4.1.13
- GCC: gcc (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1)
- CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10
- Total 1 cases, 1 passed, 0 failed.=20

1. The test scenario is Vhost-Virtio loopback without NIC.
2. Update the packet size in testpmd config file from 64 to 1518
3. Run vhost pmd in testpmd:=20
rm -rf vhost-net*
./x86_64-native-linuxapp-gcc/app/testpmd -c 0xc0000 -n 4 --socket-mem 1024,=
1024 --vdev 'eth_vhost0,iface=3Dvhost-net,queues=3D1'  -- -i --nb-cores=3D1
>start

4. launch VM1 with 1 virtio:
taskset -c 20-21 \
/root/qemu-versions/qemu-2.5.0/x86_64-softmmu/qemu-system-x86_64 -name vm1 =
\
-cpu host -enable-kvm -m 2048 -object memory-backend-file,id=3Dmem,size=3D2=
048M,mem-path=3D/mnt/huge,share=3Don -numa node,memdev=3Dmem -mem-prealloc =
\
-smp cores=3D2,sockets=3D1 -drive file=3D/home/img/vm1.img  \
-chardev socket,id=3Dchar0,path=3D./vhost-net \
-netdev type=3Dvhost-user,id=3Dmynet1,chardev=3Dchar0,vhostforce,queues=3D1=
 \
-device virtio-net-pci,mac=3D52:54:00:00:00:01,netdev=3Dmynet1,mrg_rxbuf=3D=
off,mq=3Don \
-netdev tap,id=3Dipvm1,ifname=3Dtap3,script=3D/etc/qemu-ifup -device rtl813=
9,netdev=3Dipvm1,id=3Dnet0,mac=3D00:00:00:00:10:01 \
-vnc :2 -daemonize

5. in VM, run testpmd with the virtio:
./testpmd -c 0x3 -n 4 -- -i
>start=20
>show port stats all

6. Check the Vhost best RX/TX rate, and record it as #1.=20

7. Apply this patch and rerun the case, get another number #2.=20

8. Compare #1 with #2, can see ~5% performance increase on BDW-EP CPU serve=
r.=20
Thanks
Qian

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
Sent: Wednesday, May 25, 2016 9:23 AM
To: dev@dpdk.org
Cc: Wang, Zhihong
Subject: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw

This patch fixes rte_memcpy performance in Haswell and Broadwell for vhost =
when copy size larger than 256 bytes.

It is observed that for large copies like 1024/1518 ones, rte_memcpy suffer=
s high ratio of store buffer full issue which causes pipeline to stall in s=
cenarios like vhost enqueue. This can be alleviated by adjusting instructio=
n layout. Note that this issue may not be visible in micro test.

How to reproduce?

PHY-VM-PHY using vhost/virtio or vhost/virtio loop back, with large packets=
 like 1024/1518 bytes ones. Make sure packet generation rate is not the bot=
tleneck if PHY-VM-PHY is used.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 116 ++++++-----------=
----
 1 file changed, 30 insertions(+), 86 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/libr=
te_eal/common/include/arch/x86/rte_memcpy.h
index f463ab3..413035e 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -363,71 +363,26 @@ rte_mov128(uint8_t *dst, const uint8_t *src)  }
=20
 /**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src) -{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
-	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
-	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
-	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
-}
-
-/**
- * Copy 64-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) -{
-	__m256i ymm0, ymm1;
-
-	while (n >=3D 64) {
-		ymm0 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 *=
 32));
-		n -=3D 64;
-		ymm1 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 *=
 32));
-		src =3D (const uint8_t *)src + 64;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		dst =3D (uint8_t *)dst + 64;
-	}
-}
-
-/**
- * Copy 256-byte blocks from one location to another,
+ * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
  */
 static inline void
-rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 {
-	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
+	__m256i ymm0, ymm1, ymm2, ymm3;
=20
-	while (n >=3D 256) {
+	while (n >=3D 128) {
 		ymm0 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 *=
 32));
-		n -=3D 256;
+		n -=3D 128;
 		ymm1 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 *=
 32));
 		ymm2 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 *=
 32));
 		ymm3 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 *=
 32));
-		ymm4 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 *=
 32));
-		ymm5 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 *=
 32));
-		ymm6 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 *=
 32));
-		ymm7 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 *=
 32));
-		src =3D (const uint8_t *)src + 256;
+		src =3D (const uint8_t *)src + 128;
 		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
 		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
 		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
 		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
-		dst =3D (uint8_t *)dst + 256;
+		dst =3D (uint8_t *)dst + 128;
 	}
 }
=20
@@ -466,51 +421,56 @@ rte_memcpy(void *dst, const void *src, size_t n)
 	}
=20
 	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
+	 * Fast way when copy size doesn't exceed 256 bytes
 	 */
 	if (n <=3D 32) {
 		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <=3D 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
 		return ret;
 	}
 	if (n <=3D 64) {
 		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
 		return ret;
 	}
-	if (n <=3D 512) {
-		if (n >=3D 256) {
-			n -=3D 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src =3D (const uint8_t *)src + 256;
-			dst =3D (uint8_t *)dst + 256;
-		}
+	if (n <=3D 256) {
 		if (n >=3D 128) {
 			n -=3D 128;
 			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
 			src =3D (const uint8_t *)src + 128;
 			dst =3D (uint8_t *)dst + 128;
 		}
+COPY_BLOCK_128_BACK31:
 		if (n >=3D 64) {
 			n -=3D 64;
 			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
 			src =3D (const uint8_t *)src + 64;
 			dst =3D (uint8_t *)dst + 64;
 		}
-COPY_BLOCK_64_BACK31:
 		if (n > 32) {
 			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
 			return ret;
 		}
 		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
 		}
 		return ret;
 	}
=20
 	/**
-	 * Make store aligned when copy size exceeds 512 bytes
+	 * Make store aligned when copy size exceeds 256 bytes
 	 */
 	dstofss =3D (uintptr_t)dst & 0x1F;
 	if (dstofss > 0) {
@@ -522,35 +482,19 @@ COPY_BLOCK_64_BACK31:
 	}
=20
 	/**
-	 * Copy 256-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
+	 * Copy 128-byte blocks
 	 */
-	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
 	bits =3D n;
-	n =3D n & 255;
+	n =3D n & 127;
 	bits -=3D n;
 	src =3D (const uint8_t *)src + bits;
 	dst =3D (uint8_t *)dst + bits;
=20
 	/**
-	 * Copy 64-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >=3D 64) {
-		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits =3D n;
-		n =3D n & 63;
-		bits -=3D n;
-		src =3D (const uint8_t *)src + bits;
-		dst =3D (uint8_t *)dst + bits;
-	}
-
-	/**
 	 * Copy whatever left
 	 */
-	goto COPY_BLOCK_64_BACK31;
+	goto COPY_BLOCK_128_BACK31;
 }
=20
 #else /* RTE_MACHINE_CPUFLAG */
--
2.5.0