From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id E73B54A63 for ; Thu, 14 Apr 2016 18:03:51 +0200 (CEST) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga103.jf.intel.com with ESMTP; 14 Apr 2016 09:03:05 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,485,1455004800"; d="scan'208";a="784938177" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga003.jf.intel.com with ESMTP; 14 Apr 2016 09:03:04 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id u3EG33uV023836; Thu, 14 Apr 2016 17:03:03 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id u3EG33Ch012107; Thu, 14 Apr 2016 17:03:03 +0100 Received: (from bricha3@localhost) by sivswdev01.ir.intel.com with id u3EG33G6012103; Thu, 14 Apr 2016 17:03:03 +0100 From: Bruce Richardson To: dev@dpdk.org Cc: Helin Zhang , Jingjing Wu , Bruce Richardson Date: Thu, 14 Apr 2016 17:02:36 +0100 Message-Id: <1460649757-11862-3-git-send-email-bruce.richardson@intel.com> X-Mailer: git-send-email 1.7.4.1 In-Reply-To: <1460649757-11862-1-git-send-email-bruce.richardson@intel.com> References: <1460628921-25635-1-git-send-email-bruce.richardson@intel.com> <1460649757-11862-1-git-send-email-bruce.richardson@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [dpdk-dev] [PATCH v2 2/3] i40e: improve performance of vector PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2016 16:03:52 -0000 An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed that the code was unexpectedly causing stalls due to "Loads blocked by Store Forwards". This can occur when a load from memory has to wait due to the prior store being to the same address, but being of a smaller size i.e. the stored value cannot be directly returned to the loader. [See ref: https://software.intel.com/en-us/node/544454] These stalls are due to the way in which the data_len values are handled in the driver. The lengths are extracted using vector operations, but those 16-bit lengths are then assigned using scalar operations i.e. 16-bit stores. These regular 16-bit stores actually have two effects in the code: * they cause the "Loads blocked by Store Forwards" issues reported * they also cause the previous loads in the RX function to actually be a load followed by a store to an address on the stack, because the 16-bit assignment can't be done to an xmm register. By converting the 16-bit store operations into a sequence of SSE blend operations, we can ensure that the descriptor loads only occur once, and avoid both the additional stores and loads from the stack, as well as the stalls due to the blocked loads. Signed-off-by: Bruce Richardson --- drivers/net/i40e/i40e_rxtx_vec.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 1e2fadd..9f67f9d 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -192,11 +192,7 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts) static inline void desc_pktlen_align(__m128i descs[4]) { - __m128i pktlen0, pktlen1, zero; - union { - uint16_t e[4]; - uint64_t dword; - } vol; + __m128i pktlen0, pktlen1; /* mask everything except pktlen field*/ const __m128i pktlen_msk = _mm_set_epi32(PKTLEN_MASK, PKTLEN_MASK, @@ -206,18 +202,18 @@ desc_pktlen_align(__m128i descs[4]) pktlen1 = _mm_unpackhi_epi32(descs[1], descs[3]); pktlen0 = _mm_unpackhi_epi32(pktlen0, pktlen1); - zero = _mm_xor_si128(pktlen0, pktlen0); - pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); - pktlen0 = _mm_packs_epi32(pktlen0, zero); - vol.dword = _mm_cvtsi128_si64(pktlen0); - /* let the descriptor byte 15-14 store the pkt len */ - *((uint16_t *)&descs[0]+7) = vol.e[0]; - *((uint16_t *)&descs[1]+7) = vol.e[1]; - *((uint16_t *)&descs[2]+7) = vol.e[2]; - *((uint16_t *)&descs[3]+7) = vol.e[3]; + pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); + + descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); } /* -- 2.5.5