From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id CC821DE3 for ; Fri, 13 Jan 2017 07:10:59 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga105.fm.intel.com with ESMTP; 12 Jan 2017 22:10:58 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,220,1477983600"; d="scan'208";a="29855377" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.67.162]) by orsmga002.jf.intel.com with ESMTP; 12 Jan 2017 22:10:55 -0800 Date: Fri, 13 Jan 2017 14:13:09 +0800 From: Yuanhan Liu To: Jan Viktorin Cc: Thomas Monjalon , Jianbo Liu , Jerin Jacob , Chao Zhu , dev@dpdk.org, Tan Jianfeng , Wang Zhihong , Olivier Matz , Maxime Coquelin , "Michael S. Tsirkin" , =?iso-8859-1?Q?Ors=E1k?= Michal Message-ID: <20170113061309.GF9770@yliu-dev.sh.intel.com> References: <1484108832-19907-1-git-send-email-yuanhan.liu@linux.intel.com> <1484108832-19907-2-git-send-email-yuanhan.liu@linux.intel.com> <1610499.AMUobBPor6@xps13> <20170112023058.GF2402@yliu-dev.sh.intel.com> <20170112160256.6915ff12.viktorin@rehivetech.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170112160256.6915ff12.viktorin@rehivetech.com> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jan 2017 06:11:00 -0000 On Thu, Jan 12, 2017 at 04:02:56PM +0100, Jan Viktorin wrote: > On Thu, 12 Jan 2017 10:30:58 +0800 > Yuanhan Liu wrote: > > > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote: > > > 2017-01-11 12:27, Yuanhan Liu: > > > > The fact that virtio net header is initiated to zero in PMD driver > > > > init stage means that these costly writes are unnecessary and could > > > > be avoided: > > > > > > > > if (hdr->csum_start != 0) > > > > hdr->csum_start = 0; > > > > > > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the > > > > performance drop introduced by TSO enabling is recovered: it could > > > > be up to 20% in micro benchmarking. > > > > > > This patch is adding a condition to assignments. > > > We need a benchmark on other architectures like ARM. Please anyone? > > > > I think the cost of condition should be way lower than the cost from the > > penalty introduced by the cache issue, that I don't see it would perform > > bad on other platforms. > > > > But, of course, testing is always welcome! > > > > --yliu > > Hello, > > we've done a synthetic measurement, principle briefly: Thanks! > > == Without condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > hdr->csum_start = 0; > hdr->csum_offset = 0; > hdr->flags = 0; > } > > end = gettimeofday(); > > > == With condition check == > > start = gettimeofday(); > > for (i = 0; i < 1024*1024*128; ++i) { > ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0); > ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0); > ASSIGN_UNLESS_EQUAL(hdr->flags, 0); > } > > end = gettimeofday(); But it's not the test methodology I'd expect. You are purely testing the instruction cycles. The drop on ARM is something more like "the if instruction takes more cycles than the simple assignment". This macro is used in the case that one process is heavily writing same value (0 here) again and again while another process is heavily read it also again and again. That means cache violation always happen. With this macro, however, this cache issue could be avoided, since no write happens. For such workload, I don't think it would behaviour worse on ARM. --yliu > == Results == > > Computed as total time of all threads: > > for i = 1..THREAD_COUNT: > result += end[i] - start[i] > > cpu threads without-check (ms) with-check > Xeon E5-2670 1 516 529 > Xeon E5-2670 2 1155 953 > Xeon E5-2670 8 8947 5044 > Xeon E5-2670 16 23335 16836 > Zynq-7020 (armv7) 1 6735 7205 > Zynq-7020 (armv7) 2 13753 14418 > > The advantage for Intel is evident when increasing the number > of threads. > > However, on 32-bit ARMs we might expect some performance drop. > > Regards > Jan > > > > > > > > > > [...] > > > > +/* avoid write operation when necessary, to lessen cache issues */ > > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do { \ > > > > + if ((var) != (val)) \ > > > > + (var) = (val); \ > > > > +} while (0)