From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yuanhan.liu@linux.intel.com>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id C2D842B8E
 for <dev@dpdk.org>; Thu,  8 Dec 2016 03:17:53 +0100 (CET)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by fmsmga103.fm.intel.com with ESMTP; 07 Dec 2016 18:17:52 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,316,1477983600"; d="scan'208";a="15188323"
Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162])
 by orsmga002.jf.intel.com with ESMTP; 07 Dec 2016 18:17:51 -0800
Date: Thu, 8 Dec 2016 10:18:43 +0800
From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To: Zhihong Wang <zhihong.wang@intel.com>
Cc: dev@dpdk.org, thomas.monjalon@6wind.com, lei.a.yao@intel.com
Message-ID: <20161208021843.GM31182@yliu-dev.sh.intel.com>
References: <1480641582-56186-1-git-send-email-zhihong.wang@intel.com>
 <1481074266-4461-1-git-send-email-zhihong.wang@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1481074266-4461-1-git-send-email-zhihong.wang@intel.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Subject: Re: [dpdk-dev] [PATCH v2] eal: optimize aligned rte_memcpy
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Dec 2016 02:17:54 -0000

On Tue, Dec 06, 2016 at 08:31:06PM -0500, Zhihong Wang wrote:
> This patch optimizes rte_memcpy for well aligned cases, where both
> dst and src addr are aligned to maximum MOV width. It introduces a
> dedicated function called rte_memcpy_aligned to handle the aligned
> cases with simplified instruction stream. The existing rte_memcpy
> is renamed as rte_memcpy_generic. The selection between them 2 is
> done at the entry of rte_memcpy.
> 
> The existing rte_memcpy is for generic cases, it handles unaligned
> copies and make store aligned, it even makes load aligned for micro
> architectures like Ivy Bridge. However alignment handling comes at
> a price: It adds extra load/store instructions, which can cause
> complications sometime.
> 
> DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
> The copy is aligned, and remote, and there is header write along
> which is also remote. In this case the memcpy instruction stream
> should be simplified, to reduce extra load/store, therefore reduce
> the probability of load/store buffer full caused pipeline stall, to
> let the actual memcpy instructions be issued and let H/W prefetcher
> goes to work as early as possible.
> 
> This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
> up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
> from 64 to 1500 bytes.
> 
> The test can also be conducted without NIC, by setting loopback
> traffic between Virtio and Vhost. For example, modify the macro
> TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
> rebuild and start testpmd in both host and guest, then "start" on
> one side and "start tx_first 32" on the other.
> 
> 
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>

Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

	--yliu