From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 52BCE2B88 for ; Wed, 22 Feb 2017 02:35:14 +0100 (CET) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Feb 2017 17:35:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.35,191,1484035200"; d="scan'208";a="1133134087" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162]) by fmsmga002.fm.intel.com with ESMTP; 21 Feb 2017 17:35:10 -0800 Date: Wed, 22 Feb 2017 09:37:34 +0800 From: Yuanhan Liu To: Maxime Coquelin Cc: cunming.liang@intel.com, jianfeng.tan@intel.com, dev@dpdk.org Message-ID: <20170222013734.GJ18844@yliu-dev.sh.intel.com> References: <20170221173243.20779-1-maxime.coquelin@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170221173243.20779-1-maxime.coquelin@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Feb 2017 01:35:14 -0000 On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote: > This patch aligns the Virtio-net header on a cache-line boundary to > optimize cache utilization, as it puts the Virtio-net header (which > is always accessed) on the same cache line as the packet header. > > For example with an application that forwards packets at L2 level, > a single cache-line will be accessed with this patch, instead of > two before. I'm assuming you were testing pkt size <= (64 - hdr_size)? > In case of multi-buffers packets, next segments will be aligned on > a cache-line boundary, instead of cache-line boundary minus size of > vnet header before. The another thing is, this patch always makes the pkt data cache unaligned for the first packet, which makes Zhihong's optimization on memcpy (for big packet) useless. commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f Author: Zhihong Wang Date: Tue Dec 6 20:31:06 2016 -0500 eal: optimize aligned memcpy on x86 This patch optimizes rte_memcpy for well aligned cases, where both dst and src addr are aligned to maximum MOV width. It introduces a dedicated function called rte_memcpy_aligned to handle the aligned cases with simplified instruction stream. The existing rte_memcpy cases with simplified instruction stream. The existing rte_memcpy is renamed as rte_memcpy_generic. The selection between them 2 is done at the entry of rte_memcpy. The existing rte_memcpy is for generic cases, it handles unaligned copies and make store aligned, it even makes load aligned for micro architectures like Ivy Bridge. However alignment handling comes at a price: It adds extra load/store instructions, which can cause complications sometime. DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example: The copy is aligned, and remote, and there is header write along which is also remote. In this case the memcpy instruction stream should be simplified, to reduce extra load/store, therefore reduce the probability of load/store buffer full caused pipeline stall, to let the actual memcpy instructions be issued and let H/W prefetcher goes to work as early as possible. This patch is tested on Ivy Bridge, Haswell and Skylake, it provides up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging from 64 to 1500 bytes. The test can also be conducted without NIC, by setting loopback traffic between Virtio and Vhost. For example, modify the macro TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h, rebuild and start testpmd in both host and guest, then "start" on one side and "start tx_first 32" on the other. Signed-off-by: Zhihong Wang Reviewed-by: Yuanhan Liu Tested-by: Lei Yao > > Signed-off-by: Maxime Coquelin > --- > > Hi, > > I send this patch as RFC because I get strange results on SandyBridge. > > For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big > performance drop on SandyBridge (~-18%). > When running PVP benchmark on SandyBridge, I measure a +4% performance > gain though. > > So I'd like to call for testing on this patch, especially PVP-like testing > on newer architectures. > > Regarding SandyBridge, I would be interrested to know whether we should > take the performance drop into account, as we for example had one patch in > last release that cause a performance drop on SB we merged anyway. Sorry, would you remind me which patch it is? --yliu