From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 2A9802BA7 for ; Thu, 26 May 2016 07:19:21 +0200 (CEST) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP; 25 May 2016 22:19:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,366,1459839600"; d="scan'208";a="815057282" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by orsmga003.jf.intel.com with ESMTP; 25 May 2016 22:19:20 -0700 Received: from fmsmsx154.amr.corp.intel.com (10.18.116.70) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 25 May 2016 22:19:19 -0700 Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by FMSMSX154.amr.corp.intel.com (10.18.116.70) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 25 May 2016 22:19:18 -0700 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.104]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.58]) with mapi id 14.03.0248.002; Thu, 26 May 2016 13:19:17 +0800 From: "Xu, Qian Q" To: "Wang, Zhihong" , "dev@dpdk.org" CC: "Wang, Zhihong" Thread-Topic: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw Thread-Index: AQHRtl+SH8kLLkaTc0+kRXOs7U07g5/KrXOw Date: Thu, 26 May 2016 05:19:16 +0000 Message-ID: <82F45D86ADE5454A95A89742C8D1410E0328B4EE@shsmsx102.ccr.corp.intel.com> References: <1464139383-132732-1-git-send-email-zhihong.wang@intel.com> In-Reply-To: <1464139383-132732-1-git-send-email-zhihong.wang@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiOWIzOWRhYmEtMjAyYS00NWU0LTk4NDYtMDM5N2Y3MjgyMGYyIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6InR1ajdZaklKSGNiZDN4dTB5d2s2elI0bUR2VUJrMGhGSVE1NVRoejAwRkE9In0= x-ctpclassification: CTP_IC x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 May 2016 05:19:21 -0000 Tested-by: Qian Xu - Test Commit: 8f6f24342281f59de0df7bd976a32f714d39b9a9 - OS/Kernel: Fedora 21/4.1.13 - GCC: gcc (GCC) 4.9.2 20141101 (Red Hat 4.9.2-1) - CPU: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10 - Total 1 cases, 1 passed, 0 failed.=20 1. The test scenario is Vhost-Virtio loopback without NIC. 2. Update the packet size in testpmd config file from 64 to 1518 3. Run vhost pmd in testpmd:=20 rm -rf vhost-net* ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xc0000 -n 4 --socket-mem 1024,= 1024 --vdev 'eth_vhost0,iface=3Dvhost-net,queues=3D1' -- -i --nb-cores=3D1 >start 4. launch VM1 with 1 virtio: taskset -c 20-21 \ /root/qemu-versions/qemu-2.5.0/x86_64-softmmu/qemu-system-x86_64 -name vm1 = \ -cpu host -enable-kvm -m 2048 -object memory-backend-file,id=3Dmem,size=3D2= 048M,mem-path=3D/mnt/huge,share=3Don -numa node,memdev=3Dmem -mem-prealloc = \ -smp cores=3D2,sockets=3D1 -drive file=3D/home/img/vm1.img \ -chardev socket,id=3Dchar0,path=3D./vhost-net \ -netdev type=3Dvhost-user,id=3Dmynet1,chardev=3Dchar0,vhostforce,queues=3D1= \ -device virtio-net-pci,mac=3D52:54:00:00:00:01,netdev=3Dmynet1,mrg_rxbuf=3D= off,mq=3Don \ -netdev tap,id=3Dipvm1,ifname=3Dtap3,script=3D/etc/qemu-ifup -device rtl813= 9,netdev=3Dipvm1,id=3Dnet0,mac=3D00:00:00:00:10:01 \ -vnc :2 -daemonize 5. in VM, run testpmd with the virtio: ./testpmd -c 0x3 -n 4 -- -i >start=20 >show port stats all 6. Check the Vhost best RX/TX rate, and record it as #1.=20 7. Apply this patch and rerun the case, get another number #2.=20 8. Compare #1 with #2, can see ~5% performance increase on BDW-EP CPU serve= r.=20 Thanks Qian -----Original Message----- From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang Sent: Wednesday, May 25, 2016 9:23 AM To: dev@dpdk.org Cc: Wang, Zhihong Subject: [dpdk-dev] [PATCH] eal: fix rte_memcpy perf in hsw/bdw This patch fixes rte_memcpy performance in Haswell and Broadwell for vhost = when copy size larger than 256 bytes. It is observed that for large copies like 1024/1518 ones, rte_memcpy suffer= s high ratio of store buffer full issue which causes pipeline to stall in s= cenarios like vhost enqueue. This can be alleviated by adjusting instructio= n layout. Note that this issue may not be visible in micro test. How to reproduce? PHY-VM-PHY using vhost/virtio or vhost/virtio loop back, with large packets= like 1024/1518 bytes ones. Make sure packet generation rate is not the bot= tleneck if PHY-VM-PHY is used. Signed-off-by: Zhihong Wang --- .../common/include/arch/x86/rte_memcpy.h | 116 ++++++-----------= ---- 1 file changed, 30 insertions(+), 86 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/libr= te_eal/common/include/arch/x86/rte_memcpy.h index f463ab3..413035e 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -363,71 +363,26 @@ rte_mov128(uint8_t *dst, const uint8_t *src) } =20 /** - * Copy 256 bytes from one location to another, - * locations should not overlap. - */ -static inline void -rte_mov256(uint8_t *dst, const uint8_t *src) -{ - rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); - rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); - rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); - rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); - rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32); - rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32); - rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32); - rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32); -} - -/** - * Copy 64-byte blocks from one location to another, - * locations should not overlap. - */ -static inline void -rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) -{ - __m256i ymm0, ymm1; - - while (n >=3D 64) { - ymm0 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 *= 32)); - n -=3D 64; - ymm1 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 *= 32)); - src =3D (const uint8_t *)src + 64; - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); - dst =3D (uint8_t *)dst + 64; - } -} - -/** - * Copy 256-byte blocks from one location to another, + * Copy 128-byte blocks from one location to another, * locations should not overlap. */ static inline void -rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) +rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n) { - __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; + __m256i ymm0, ymm1, ymm2, ymm3; =20 - while (n >=3D 256) { + while (n >=3D 128) { ymm0 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 *= 32)); - n -=3D 256; + n -=3D 128; ymm1 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 *= 32)); ymm2 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 *= 32)); ymm3 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 *= 32)); - ymm4 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 *= 32)); - ymm5 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 *= 32)); - ymm6 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 *= 32)); - ymm7 =3D _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 *= 32)); - src =3D (const uint8_t *)src + 256; + src =3D (const uint8_t *)src + 128; _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2); _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3); - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4); - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5); - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6); - _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7); - dst =3D (uint8_t *)dst + 256; + dst =3D (uint8_t *)dst + 128; } } =20 @@ -466,51 +421,56 @@ rte_memcpy(void *dst, const void *src, size_t n) } =20 /** - * Fast way when copy size doesn't exceed 512 bytes + * Fast way when copy size doesn't exceed 256 bytes */ if (n <=3D 32) { rte_mov16((uint8_t *)dst, (const uint8_t *)src); - rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + rte_mov16((uint8_t *)dst - 16 + n, + (const uint8_t *)src - 16 + n); + return ret; + } + if (n <=3D 48) { + rte_mov16((uint8_t *)dst, (const uint8_t *)src); + rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16); + rte_mov16((uint8_t *)dst - 16 + n, + (const uint8_t *)src - 16 + n); return ret; } if (n <=3D 64) { rte_mov32((uint8_t *)dst, (const uint8_t *)src); - rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); + rte_mov32((uint8_t *)dst - 32 + n, + (const uint8_t *)src - 32 + n); return ret; } - if (n <=3D 512) { - if (n >=3D 256) { - n -=3D 256; - rte_mov256((uint8_t *)dst, (const uint8_t *)src); - src =3D (const uint8_t *)src + 256; - dst =3D (uint8_t *)dst + 256; - } + if (n <=3D 256) { if (n >=3D 128) { n -=3D 128; rte_mov128((uint8_t *)dst, (const uint8_t *)src); src =3D (const uint8_t *)src + 128; dst =3D (uint8_t *)dst + 128; } +COPY_BLOCK_128_BACK31: if (n >=3D 64) { n -=3D 64; rte_mov64((uint8_t *)dst, (const uint8_t *)src); src =3D (const uint8_t *)src + 64; dst =3D (uint8_t *)dst + 64; } -COPY_BLOCK_64_BACK31: if (n > 32) { rte_mov32((uint8_t *)dst, (const uint8_t *)src); - rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); + rte_mov32((uint8_t *)dst - 32 + n, + (const uint8_t *)src - 32 + n); return ret; } if (n > 0) { - rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); + rte_mov32((uint8_t *)dst - 32 + n, + (const uint8_t *)src - 32 + n); } return ret; } =20 /** - * Make store aligned when copy size exceeds 512 bytes + * Make store aligned when copy size exceeds 256 bytes */ dstofss =3D (uintptr_t)dst & 0x1F; if (dstofss > 0) { @@ -522,35 +482,19 @@ COPY_BLOCK_64_BACK31: } =20 /** - * Copy 256-byte blocks. - * Use copy block function for better instruction order control, - * which is important when load is unaligned. + * Copy 128-byte blocks */ - rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n); + rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n); bits =3D n; - n =3D n & 255; + n =3D n & 127; bits -=3D n; src =3D (const uint8_t *)src + bits; dst =3D (uint8_t *)dst + bits; =20 /** - * Copy 64-byte blocks. - * Use copy block function for better instruction order control, - * which is important when load is unaligned. - */ - if (n >=3D 64) { - rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n); - bits =3D n; - n =3D n & 63; - bits -=3D n; - src =3D (const uint8_t *)src + bits; - dst =3D (uint8_t *)dst + bits; - } - - /** * Copy whatever left */ - goto COPY_BLOCK_64_BACK31; + goto COPY_BLOCK_128_BACK31; } =20 #else /* RTE_MACHINE_CPUFLAG */ -- 2.5.0