From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id A6078FA30 for ; Mon, 5 Dec 2016 09:26:30 +0100 (CET) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga101.fm.intel.com with ESMTP; 05 Dec 2016 00:26:29 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,303,1477983600"; d="scan'208";a="1068027393" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162]) by orsmga001.jf.intel.com with ESMTP; 05 Dec 2016 00:26:28 -0800 Date: Mon, 5 Dec 2016 16:27:14 +0800 From: Yuanhan Liu To: Zhihong Wang Cc: dev@dpdk.org Message-ID: <20161205082714.GJ24403@yliu-dev.sh.intel.com> References: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH] vhost: optimize vhost memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Dec 2016 08:26:31 -0000 On Thu, Dec 01, 2016 at 08:19:42PM -0500, Zhihong Wang wrote: > This patch optimizes Vhost performance for large packets when the > Mergeable Rx buffer feature is enabled. It introduces a dedicated > memcpy function for vhost enqueue/dequeue to replace rte_memcpy. > > The reason is that rte_memcpy is for general cases, it handles > unaligned copies and make store aligned, it even makes load aligned > for micro architectures like Ivy Bridge. However alignment handling > comes at a price: It introduces extra load/store instructions. > > Vhost memcpy is rather special: The copy is aligned, and remote, > and there is header write along which is also remote. In this case > the memcpy instruction stream should be simplified, to reduce extra > load/store, therefore reduce the probability of load/store buffer > full caused pipeline stall, to let the actual memcpy instructions > be issued and let H/W prefetcher goes to work as early as possible. ... > > +/** > + * This function is used to for vhost memcpy, to replace rte_memcpy. > + * The reason is that rte_memcpy is for general cases, where vhost > + * memcpy is a rather special case: The copy is aligned, and remote, > + * and there is header write along which is also remote. In this case > + * the memcpy instruction stream should be simplified to reduce extra > + * load/store, therefore reduce the probability of load/store buffer > + * full caused pipeline stall, to let the actual memcpy instructions > + * be issued and let H/W prefetcher goes to work as early as possible. > + */ > +static inline void __attribute__((always_inline)) > +vhost_memcpy(void *dst, const void *src, size_t n) I like this function a lot, since it's really simple and straightforward! Moreover, it performs better. But, I don't quite like how this function is proposed: - rte_movX are more like internal help functions that should be used only in corresponding rte_memcpy.h file. - It's a good optimization, however, it will not benefit for other use cases, though vhost is the most typical case here. - The optimization proves to be good for X86, but think there is no guarantee it may behave well for other platforms, say ARM. I still would suggest you to go this way: move this function into x86's rte_memcpy.h and call it when the data is well aligned. --yliu > +{ > + /* Copy size <= 16 bytes */ > + if (n < 16) { > + if (n & 0x01) { > + *(uint8_t *)dst = *(const uint8_t *)src; > + src = (const uint8_t *)src + 1; > + dst = (uint8_t *)dst + 1; > + } > + if (n & 0x02) { > + *(uint16_t *)dst = *(const uint16_t *)src; > + src = (const uint16_t *)src + 1; > + dst = (uint16_t *)dst + 1; > + } > + if (n & 0x04) { > + *(uint32_t *)dst = *(const uint32_t *)src; > + src = (const uint32_t *)src + 1; > + dst = (uint32_t *)dst + 1; > + } > + if (n & 0x08) > + *(uint64_t *)dst = *(const uint64_t *)src; > + > + return; > + } > + > + /* Copy 16 <= size <= 32 bytes */ > + if (n <= 32) { > + rte_mov16((uint8_t *)dst, (const uint8_t *)src); > + rte_mov16((uint8_t *)dst - 16 + n, > + (const uint8_t *)src - 16 + n); > + > + return; > + } > + > + /* Copy 32 < size <= 64 bytes */ > + if (n <= 64) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + rte_mov32((uint8_t *)dst - 32 + n, > + (const uint8_t *)src - 32 + n); > + > + return; > + } > + > + /* Copy 64 bytes blocks */ > + for (; n >= 64; n -= 64) { > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > + dst = (uint8_t *)dst + 64; > + src = (const uint8_t *)src + 64; > + } > + > + /* Copy whatever left */ > + rte_mov64((uint8_t *)dst - 64 + n, > + (const uint8_t *)src - 64 + n); > +}