From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id C2D842B8E for ; Thu, 8 Dec 2016 03:17:53 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga103.fm.intel.com with ESMTP; 07 Dec 2016 18:17:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,316,1477983600"; d="scan'208";a="15188323" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162]) by orsmga002.jf.intel.com with ESMTP; 07 Dec 2016 18:17:51 -0800 Date: Thu, 8 Dec 2016 10:18:43 +0800 From: Yuanhan Liu To: Zhihong Wang Cc: dev@dpdk.org, thomas.monjalon@6wind.com, lei.a.yao@intel.com Message-ID: <20161208021843.GM31182@yliu-dev.sh.intel.com> References: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com> <1481074266-4461-1-git-send-email-zhihong.wang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1481074266-4461-1-git-send-email-zhihong.wang@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Dec 2016 02:17:54 -0000 On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote: > This patch optimizes rte_memcpy for well aligned cases, where both > dst and src addr are aligned to maximum MOV width. It introduces a > dedicated function called rte_memcpy_aligned to handle the aligned > cases with simplified instruction stream. The existing rte_memcpy > is renamed as rte_memcpy_generic. The selection between them 2 is > done at the entry of rte_memcpy. > > The existing rte_memcpy is for generic cases, it handles unaligned > copies and make store aligned, it even makes load aligned for micro > architectures like Ivy Bridge. However alignment handling comes at > a price: It adds extra load/store instructions, which can cause > complications sometime. > > DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example: > The copy is aligned, and remote, and there is header write along > which is also remote. In this case the memcpy instruction stream > should be simplified, to reduce extra load/store, therefore reduce > the probability of load/store buffer full caused pipeline stall, to > let the actual memcpy instructions be issued and let H/W prefetcher > goes to work as early as possible. > > This patch is tested on Ivy Bridge, Haswell and Skylake, it provides > up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging > from 64 to 1500 bytes. > > The test can also be conducted without NIC, by setting loopback > traffic between Virtio and Vhost. For example, modify the macro > TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h, > rebuild and start testpmd in both host and guest, then "start" on > one side and "start tx_first 32" on the other. > > > Signed-off-by: Zhihong Wang Reviewed-by: Yuanhan Liu --yliu