From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 282602BAC for ; Thu, 8 Dec 2016 01:55:44 +0100 (CET) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP; 07 Dec 2016 16:55:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,316,1477983600"; d="scan'208";a="795518679" Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201]) by FMSMGA003.fm.intel.com with ESMTP; 07 Dec 2016 16:55:42 -0800 Received: from fmsmsx154.amr.corp.intel.com (10.18.116.70) by FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 7 Dec 2016 16:55:41 -0800 Received: from shsmsx104.ccr.corp.intel.com (10.239.4.70) by FMSMSX154.amr.corp.intel.com (10.18.116.70) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 7 Dec 2016 16:55:41 -0800 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.37]) by SHSMSX104.ccr.corp.intel.com ([169.254.5.11]) with mapi id 14.03.0248.002; Thu, 8 Dec 2016 08:55:40 +0800 From: "Yao, Lei A" To: "Wang, Zhihong" , "dev@dpdk.org" CC: "yuanhan.liu@linux.intel.com" , "thomas.monjalon@6wind.com" Thread-Topic: [PATCH v2] eal: optimize aligned rte_memcpy Thread-Index: AQHSUGWnYsdFXuGu9kO1foJeCwM3kaD9OU/g Date: Thu, 8 Dec 2016 00:55:39 +0000 Message-ID: <2DBBFF226F7CF64BAFCA79B681719D9537F45981@shsmsx102.ccr.corp.intel.com> References: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com> <1481074266-4461-1-git-send-email-zhihong.wang@intel.com> In-Reply-To: <1481074266-4461-1-git-send-email-zhihong.wang@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Dec 2016 00:55:45 -0000 Tested-by: Lei Yao - Apply patch to v16.11 I have tested the loopback performance for this patch on 3 following settin= gs: CPU: IVB Ubutnu16.04 Kernal: 4.4.0 gcc : 5.4.0 CPU: HSW Fedora 21 Kernal: 4.1.13 gcc: 4.9.2 CPU:BDW Ubutnu16.04 Kernal: 4.4.0 gcc : 5.4.0 I can see 10%~20% performance gain for different packet size on mergeable p= ath. Only on IVB + gcc5.4.0, slight performance drop(~4%) on vector path fo= r packet size 128 ,260. It's may related to gcc version as this performance= drop not see with gcc 6+. -----Original Message----- From: Wang, Zhihong=20 Sent: Wednesday, December 7, 2016 9:31 AM To: dev@dpdk.org Cc: yuanhan.liu@linux.intel.com; thomas.monjalon@6wind.com; Yao, Lei A ; Wang, Zhihong Subject: [PATCH v2] eal: optimize aligned rte_memcpy This patch optimizes rte_memcpy for well aligned cases, where both dst and src addr are aligned to maximum MOV width. It introduces a dedicated function called rte_memcpy_aligned to handle the aligned cases with simplified instruction stream. The existing rte_memcpy is renamed as rte_memcpy_generic. The selection between them 2 is done at the entry of rte_memcpy. The existing rte_memcpy is for generic cases, it handles unaligned copies and make store aligned, it even makes load aligned for micro architectures like Ivy Bridge. However alignment handling comes at a price: It adds extra load/store instructions, which can cause complications sometime. DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example: The copy is aligned, and remote, and there is header write along which is also remote. In this case the memcpy instruction stream should be simplified, to reduce extra load/store, therefore reduce the probability of load/store buffer full caused pipeline stall, to let the actual memcpy instructions be issued and let H/W prefetcher goes to work as early as possible. This patch is tested on Ivy Bridge, Haswell and Skylake, it provides up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging from 64 to 1500 bytes. The test can also be conducted without NIC, by setting loopback traffic between Virtio and Vhost. For example, modify the macro TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h, rebuild and start testpmd in both host and guest, then "start" on one side and "start tx_first 32" on the other. Signed-off-by: Zhihong Wang --- .../common/include/arch/x86/rte_memcpy.h | 81 ++++++++++++++++++= +++- 1 file changed, 78 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/libr= te_eal/common/include/arch/x86/rte_memcpy.h index b3bfc23..b9785e8 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -69,6 +69,8 @@ rte_memcpy(void *dst, const void *src, size_t n) __attrib= ute__((always_inline)); =20 #ifdef RTE_MACHINE_CPUFLAG_AVX512F =20 +#define ALIGNMENT_MASK 0x3F + /** * AVX512 implementation below */ @@ -189,7 +191,7 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size= _t n) } =20 static inline void * -rte_memcpy(void *dst, const void *src, size_t n) +rte_memcpy_generic(void *dst, const void *src, size_t n) { uintptr_t dstu =3D (uintptr_t)dst; uintptr_t srcu =3D (uintptr_t)src; @@ -308,6 +310,8 @@ COPY_BLOCK_128_BACK63: =20 #elif defined RTE_MACHINE_CPUFLAG_AVX2 =20 +#define ALIGNMENT_MASK 0x1F + /** * AVX2 implementation below */ @@ -387,7 +391,7 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size= _t n) } =20 static inline void * -rte_memcpy(void *dst, const void *src, size_t n) +rte_memcpy_generic(void *dst, const void *src, size_t n) { uintptr_t dstu =3D (uintptr_t)dst; uintptr_t srcu =3D (uintptr_t)src; @@ -499,6 +503,8 @@ COPY_BLOCK_128_BACK31: =20 #else /* RTE_MACHINE_CPUFLAG */ =20 +#define ALIGNMENT_MASK 0x0F + /** * SSE & AVX implementation below */ @@ -677,7 +683,7 @@ __extension__ ({ = \ }) =20 static inline void * -rte_memcpy(void *dst, const void *src, size_t n) +rte_memcpy_generic(void *dst, const void *src, size_t n) { __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8; uintptr_t dstu =3D (uintptr_t)dst; @@ -821,6 +827,75 @@ COPY_BLOCK_64_BACK15: =20 #endif /* RTE_MACHINE_CPUFLAG */ =20 +static inline void * +rte_memcpy_aligned(void *dst, const void *src, size_t n) +{ + void *ret =3D dst; + + /* Copy size <=3D 16 bytes */ + if (n < 16) { + if (n & 0x01) { + *(uint8_t *)dst =3D *(const uint8_t *)src; + src =3D (const uint8_t *)src + 1; + dst =3D (uint8_t *)dst + 1; + } + if (n & 0x02) { + *(uint16_t *)dst =3D *(const uint16_t *)src; + src =3D (const uint16_t *)src + 1; + dst =3D (uint16_t *)dst + 1; + } + if (n & 0x04) { + *(uint32_t *)dst =3D *(const uint32_t *)src; + src =3D (const uint32_t *)src + 1; + dst =3D (uint32_t *)dst + 1; + } + if (n & 0x08) + *(uint64_t *)dst =3D *(const uint64_t *)src; + + return ret; + } + + /* Copy 16 <=3D size <=3D 32 bytes */ + if (n <=3D 32) { + rte_mov16((uint8_t *)dst, (const uint8_t *)src); + rte_mov16((uint8_t *)dst - 16 + n, + (const uint8_t *)src - 16 + n); + + return ret; + } + + /* Copy 32 < size <=3D 64 bytes */ + if (n <=3D 64) { + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + rte_mov32((uint8_t *)dst - 32 + n, + (const uint8_t *)src - 32 + n); + + return ret; + } + + /* Copy 64 bytes blocks */ + for (; n >=3D 64; n -=3D 64) { + rte_mov64((uint8_t *)dst, (const uint8_t *)src); + dst =3D (uint8_t *)dst + 64; + src =3D (const uint8_t *)src + 64; + } + + /* Copy whatever left */ + rte_mov64((uint8_t *)dst - 64 + n, + (const uint8_t *)src - 64 + n); + + return ret; +} + +static inline void * +rte_memcpy(void *dst, const void *src, size_t n) +{ + if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK)) + return rte_memcpy_aligned(dst, src, n); + else + return rte_memcpy_generic(dst, src, n); +} + #ifdef __cplusplus } #endif --=20 2.7.4