From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 65ACC2965 for ; Thu, 14 Apr 2016 15:50:28 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP; 14 Apr 2016 06:50:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,484,1455004800"; d="scan'208";a="932404610" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.19]) by orsmga001.jf.intel.com with SMTP; 14 Apr 2016 06:50:25 -0700 Received: by (sSMTP sendmail emulation); Thu, 14 Apr 2016 14:50:24 +0025 Date: Thu, 14 Apr 2016 14:50:24 +0100 From: Bruce Richardson To: dev@dpdk.org Cc: Helin Zhang , Jingjing Wu Message-ID: <20160414135024.GA25840@bricha3-MOBL3> References: <1460628921-25635-1-git-send-email-bruce.richardson@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1460628921-25635-1-git-send-email-bruce.richardson@intel.com> Organization: Intel Shannon Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2016 13:50:28 -0000 On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed > that the code was unexpectedly causing stalls due to "Loads blocked by > Store Forwards". This can occur when a load from memory has to wait > due to the prior store being to the same address, but being of a smaller > size i.e. the stored value cannot be directly returned to the loader. > [See ref: https://software.intel.com/en-us/node/544454] > > These stalls are due to the way in which the data_len values are handled > in the driver. The lengths are extracted using vector operations, but those > 16-bit lengths are then assigned using scalar operations i.e. 16-bit > stores. > > These regular 16-bit stores actually have two effects in the code: > * they cause the "Loads blocked by Store Forwards" issues reported > * they also cause the previous loads in the RX function to actually be a > load followed by a store to an address on the stack, because the 16-bit > assignment can't be done to an xmm register. > > By converting the 16-bit stores operations into a sequence of SSE blend > operations, we can ensure that the descriptor loads only occur once, and > avoid both the additional store and loads from the stack, as well as the > stalls due to the second loads being blocked. > > Signed-off-by: Bruce Richardson > Self-NAK on this version. The blend instruction used is SSE4.1 so breaks the "default" build. Two obvious options to fix this: 1. Keep the old code with SSE4.1 #ifdefs separating old and new 2. Update the vpmd requirement to SSE4.1, and factor that in during runtime select of the RX code path. Personally, I prefer the second option. Any objections? /Bruce