* [dpdk-dev] [PATCH] i40e: improve performance of vector PMD @ 2016-04-14 10:15 Bruce Richardson 2016-04-14 13:50 ` Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson 0 siblings, 2 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 10:15 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu, Bruce Richardson An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed that the code was unexpectedly causing stalls due to "Loads blocked by Store Forwards". This can occur when a load from memory has to wait due to the prior store being to the same address, but being of a smaller size i.e. the stored value cannot be directly returned to the loader. [See ref: https://software.intel.com/en-us/node/544454] These stalls are due to the way in which the data_len values are handled in the driver. The lengths are extracted using vector operations, but those 16-bit lengths are then assigned using scalar operations i.e. 16-bit stores. These regular 16-bit stores actually have two effects in the code: * they cause the "Loads blocked by Store Forwards" issues reported * they also cause the previous loads in the RX function to actually be a load followed by a store to an address on the stack, because the 16-bit assignment can't be done to an xmm register. By converting the 16-bit stores operations into a sequence of SSE blend operations, we can ensure that the descriptor loads only occur once, and avoid both the additional store and loads from the stack, as well as the stalls due to the second loads being blocked. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- drivers/net/i40e/i40e_rxtx_vec.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 047aff5..d0a0cc9 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -192,11 +192,7 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts) static inline void desc_pktlen_align(__m128i descs[4]) { - __m128i pktlen0, pktlen1, zero; - union { - uint16_t e[4]; - uint64_t dword; - } vol; + __m128i pktlen0, pktlen1; /* mask everything except pktlen field*/ const __m128i pktlen_msk = _mm_set_epi32(PKTLEN_MASK, PKTLEN_MASK, @@ -206,18 +202,18 @@ desc_pktlen_align(__m128i descs[4]) pktlen1 = _mm_unpackhi_epi32(descs[1], descs[3]); pktlen0 = _mm_unpackhi_epi32(pktlen0, pktlen1); - zero = _mm_xor_si128(pktlen0, pktlen0); - pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); - pktlen0 = _mm_packs_epi32(pktlen0, zero); - vol.dword = _mm_cvtsi128_si64(pktlen0); - /* let the descriptor byte 15-14 store the pkt len */ - *((uint16_t *)&descs[0]+7) = vol.e[0]; - *((uint16_t *)&descs[1]+7) = vol.e[1]; - *((uint16_t *)&descs[2]+7) = vol.e[2]; - *((uint16_t *)&descs[3]+7) = vol.e[3]; + pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); + + descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); } /* -- 2.5.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD 2016-04-14 10:15 [dpdk-dev] [PATCH] i40e: improve performance of vector PMD Bruce Richardson @ 2016-04-14 13:50 ` Bruce Richardson 2016-04-14 14:00 ` Ananyev, Konstantin 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson 1 sibling, 1 reply; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 13:50 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed > that the code was unexpectedly causing stalls due to "Loads blocked by > Store Forwards". This can occur when a load from memory has to wait > due to the prior store being to the same address, but being of a smaller > size i.e. the stored value cannot be directly returned to the loader. > [See ref: https://software.intel.com/en-us/node/544454] > > These stalls are due to the way in which the data_len values are handled > in the driver. The lengths are extracted using vector operations, but those > 16-bit lengths are then assigned using scalar operations i.e. 16-bit > stores. > > These regular 16-bit stores actually have two effects in the code: > * they cause the "Loads blocked by Store Forwards" issues reported > * they also cause the previous loads in the RX function to actually be a > load followed by a store to an address on the stack, because the 16-bit > assignment can't be done to an xmm register. > > By converting the 16-bit stores operations into a sequence of SSE blend > operations, we can ensure that the descriptor loads only occur once, and > avoid both the additional store and loads from the stack, as well as the > stalls due to the second loads being blocked. > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > Self-NAK on this version. The blend instruction used is SSE4.1 so breaks the "default" build. Two obvious options to fix this: 1. Keep the old code with SSE4.1 #ifdefs separating old and new 2. Update the vpmd requirement to SSE4.1, and factor that in during runtime select of the RX code path. Personally, I prefer the second option. Any objections? /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD 2016-04-14 13:50 ` Bruce Richardson @ 2016-04-14 14:00 ` Ananyev, Konstantin 2016-04-14 15:33 ` Iremonger, Bernard 0 siblings, 1 reply; 10+ messages in thread From: Ananyev, Konstantin @ 2016-04-14 14:00 UTC (permalink / raw) To: Richardson, Bruce, dev; +Cc: Zhang, Helin, Wu, Jingjing > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > Sent: Thursday, April 14, 2016 2:50 PM > To: dev@dpdk.org > Cc: Zhang, Helin; Wu, Jingjing > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD > > On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed > > that the code was unexpectedly causing stalls due to "Loads blocked by > > Store Forwards". This can occur when a load from memory has to wait > > due to the prior store being to the same address, but being of a smaller > > size i.e. the stored value cannot be directly returned to the loader. > > [See ref: https://software.intel.com/en-us/node/544454] > > > > These stalls are due to the way in which the data_len values are handled > > in the driver. The lengths are extracted using vector operations, but those > > 16-bit lengths are then assigned using scalar operations i.e. 16-bit > > stores. > > > > These regular 16-bit stores actually have two effects in the code: > > * they cause the "Loads blocked by Store Forwards" issues reported > > * they also cause the previous loads in the RX function to actually be a > > load followed by a store to an address on the stack, because the 16-bit > > assignment can't be done to an xmm register. > > > > By converting the 16-bit stores operations into a sequence of SSE blend > > operations, we can ensure that the descriptor loads only occur once, and > > avoid both the additional store and loads from the stack, as well as the > > stalls due to the second loads being blocked. > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > > > Self-NAK on this version. The blend instruction used is SSE4.1 so breaks the > "default" build. > > Two obvious options to fix this: > 1. Keep the old code with SSE4.1 #ifdefs separating old and new > 2. Update the vpmd requirement to SSE4.1, and factor that in during runtime > select of the RX code path. > > Personally, I prefer the second option. Any objections? +1 for second one. > > /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD 2016-04-14 14:00 ` Ananyev, Konstantin @ 2016-04-14 15:33 ` Iremonger, Bernard 0 siblings, 0 replies; 10+ messages in thread From: Iremonger, Bernard @ 2016-04-14 15:33 UTC (permalink / raw) To: Ananyev, Konstantin, Richardson, Bruce, dev; +Cc: Zhang, Helin, Wu, Jingjing Hi Bruce, > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, > Konstantin > Sent: Thursday, April 14, 2016 3:00 PM > To: Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org > Cc: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing > <jingjing.wu@intel.com> > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector PMD > > > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson > > Sent: Thursday, April 14, 2016 2:50 PM > > To: dev@dpdk.org > > Cc: Zhang, Helin; Wu, Jingjing > > Subject: Re: [dpdk-dev] [PATCH] i40e: improve performance of vector > > PMD > > > > On Thu, Apr 14, 2016 at 11:15:21AM +0100, Bruce Richardson wrote: > > > An analysis of the i40e code using Intel® VTune™ Amplifier 2016 > > > showed that the code was unexpectedly causing stalls due to "Loads > > > blocked by Store Forwards". This can occur when a load from memory > > > has to wait due to the prior store being to the same address, but > > > being of a smaller size i.e. the stored value cannot be directly returned to > the loader. > > > [See ref: https://software.intel.com/en-us/node/544454] > > > > > > These stalls are due to the way in which the data_len values are > > > handled in the driver. The lengths are extracted using vector > > > operations, but those 16-bit lengths are then assigned using scalar > > > operations i.e. 16-bit stores. > > > > > > These regular 16-bit stores actually have two effects in the code: > > > * they cause the "Loads blocked by Store Forwards" issues reported > > > * they also cause the previous loads in the RX function to actually > > > be a load followed by a store to an address on the stack, because > > > the 16-bit assignment can't be done to an xmm register. > > > > > > By converting the 16-bit stores operations into a sequence of SSE > > > blend operations, we can ensure that the descriptor loads only occur > > > once, and avoid both the additional store and loads from the stack, > > > as well as the stalls due to the second loads being blocked. > > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > Self-NAK on this version. The blend instruction used is SSE4.1 so > > breaks the "default" build. > > > > Two obvious options to fix this: > > 1. Keep the old code with SSE4.1 #ifdefs separating old and new 2. > > Update the vpmd requirement to SSE4.1, and factor that in during > > runtime select of the RX code path. > > > > Personally, I prefer the second option. Any objections? > > +1 for second one. > > > > > /Bruce I am using the "default" build when building in VM's, will both options work for me? Regards, Bernard. ^ permalink raw reply [flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd 2016-04-14 10:15 [dpdk-dev] [PATCH] i40e: improve performance of vector PMD Bruce Richardson 2016-04-14 13:50 ` Bruce Richardson @ 2016-04-14 16:02 ` Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 1/3] i40e: require SSE4.1 support for vector driver Bruce Richardson ` (3 more replies) 1 sibling, 4 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 16:02 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu, Bruce Richardson This patchset improves the performance of the i40e SSE pmd by removing operations that triggered CPU stalls. It also shortens the code and cleans it up a little. The base requirement for using the SSE code path has been pushed up to SSE4.1 from SSE3, due to the use of the blend instruction. The instruction set level is now checked at runtime as part of the driver selection process Bruce Richardson (3): i40e: require SSE4.1 support for vector driver i40e: improve performance of vector PMD i40e: simplify SSE packet length extraction code drivers/net/i40e/Makefile | 6 ++++ drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- 2 files changed, 27 insertions(+), 38 deletions(-) -- 2.5.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH v2 1/3] i40e: require SSE4.1 support for vector driver 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson @ 2016-04-14 16:02 ` Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 2/3] i40e: improve performance of vector PMD Bruce Richardson ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 16:02 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu, Bruce Richardson Later commits to improve the driver will make use of the SSE4.1 _mm_blend_epi16 intrinsic, so: * set the compilation level to always have SSE4.1 support, * and add in a runtime check for SSE4.1 as part of the condition checks for vector driver selection. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- drivers/net/i40e/Makefile | 6 ++++++ drivers/net/i40e/i40e_rxtx_vec.c | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/drivers/net/i40e/Makefile b/drivers/net/i40e/Makefile index 6dd6eaa..56b20d5 100644 --- a/drivers/net/i40e/Makefile +++ b/drivers/net/i40e/Makefile @@ -102,6 +102,12 @@ SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += i40e_ethdev_vf.c SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += i40e_pf.c SRCS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += i40e_fdir.c +# vector PMD driver needs SSE4.1 support +ifeq ($(findstring RTE_MACHINE_CPUFLAG_SSE4_1,$(CFLAGS)),) +CFLAGS_i40e_rxtx_vec.o += -msse4.1 +endif + + # this lib depends upon: DEPDIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += lib/librte_eal lib/librte_ether DEPDIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += lib/librte_mempool lib/librte_mbuf diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 047aff5..1e2fadd 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -751,6 +751,10 @@ i40e_rx_vec_dev_conf_condition_check(struct rte_eth_dev *dev) struct rte_eth_rxmode *rxmode = &dev->data->dev_conf.rxmode; struct rte_fdir_conf *fconf = &dev->data->dev_conf.fdir_conf; + /* need SSE4.1 support */ + if (!rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) + return -1; + #ifndef RTE_LIBRTE_I40E_RX_OLFLAGS_ENABLE /* whithout rx ol_flags, no VP flag report */ if (rxmode->hw_vlan_strip != 0 || -- 2.5.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH v2 2/3] i40e: improve performance of vector PMD 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 1/3] i40e: require SSE4.1 support for vector driver Bruce Richardson @ 2016-04-14 16:02 ` Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 3/3] i40e: simplify SSE packet length extraction code Bruce Richardson 2016-04-17 8:32 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Zhe Tao 3 siblings, 0 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 16:02 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu, Bruce Richardson An analysis of the i40e code using Intel® VTune™ Amplifier 2016 showed that the code was unexpectedly causing stalls due to "Loads blocked by Store Forwards". This can occur when a load from memory has to wait due to the prior store being to the same address, but being of a smaller size i.e. the stored value cannot be directly returned to the loader. [See ref: https://software.intel.com/en-us/node/544454] These stalls are due to the way in which the data_len values are handled in the driver. The lengths are extracted using vector operations, but those 16-bit lengths are then assigned using scalar operations i.e. 16-bit stores. These regular 16-bit stores actually have two effects in the code: * they cause the "Loads blocked by Store Forwards" issues reported * they also cause the previous loads in the RX function to actually be a load followed by a store to an address on the stack, because the 16-bit assignment can't be done to an xmm register. By converting the 16-bit store operations into a sequence of SSE blend operations, we can ensure that the descriptor loads only occur once, and avoid both the additional stores and loads from the stack, as well as the stalls due to the blocked loads. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- drivers/net/i40e/i40e_rxtx_vec.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 1e2fadd..9f67f9d 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -192,11 +192,7 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts) static inline void desc_pktlen_align(__m128i descs[4]) { - __m128i pktlen0, pktlen1, zero; - union { - uint16_t e[4]; - uint64_t dword; - } vol; + __m128i pktlen0, pktlen1; /* mask everything except pktlen field*/ const __m128i pktlen_msk = _mm_set_epi32(PKTLEN_MASK, PKTLEN_MASK, @@ -206,18 +202,18 @@ desc_pktlen_align(__m128i descs[4]) pktlen1 = _mm_unpackhi_epi32(descs[1], descs[3]); pktlen0 = _mm_unpackhi_epi32(pktlen0, pktlen1); - zero = _mm_xor_si128(pktlen0, pktlen0); - pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); - pktlen0 = _mm_packs_epi32(pktlen0, zero); - vol.dword = _mm_cvtsi128_si64(pktlen0); - /* let the descriptor byte 15-14 store the pkt len */ - *((uint16_t *)&descs[0]+7) = vol.e[0]; - *((uint16_t *)&descs[1]+7) = vol.e[1]; - *((uint16_t *)&descs[2]+7) = vol.e[2]; - *((uint16_t *)&descs[3]+7) = vol.e[3]; + pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); + + descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); + pktlen0 = _mm_slli_epi64(pktlen0, 16); + descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); } /* -- 2.5.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH v2 3/3] i40e: simplify SSE packet length extraction code 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 1/3] i40e: require SSE4.1 support for vector driver Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 2/3] i40e: improve performance of vector PMD Bruce Richardson @ 2016-04-14 16:02 ` Bruce Richardson 2016-04-17 8:32 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Zhe Tao 3 siblings, 0 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-14 16:02 UTC (permalink / raw) To: dev; +Cc: Helin Zhang, Jingjing Wu, Bruce Richardson In Table 8-16 of the "Intel® Ethernet Controller XL710 Datasheet" it is stated that when the whole packet is written to a single buffer, the header length field in the descriptor will be 0. This means that when extracting the packet/data_len field from the descriptor in the driver we do not need to mask out the extra header-length bits. Inside the vector driver, this reduces the need to pull all four pktlen fields into a single register to work on. Instead of a shift and mask, we now need to only do a shift. Therefore, we can work on each descriptor independently, processing each using one shift intrinsic and a blend. This change makes the code shorter and easier to read, so we can pull it into the main descriptor processing loop instead of needing its own function. This in turn makes the descriptor processing in the loop as a whole slightly easier to read as it's more linear. In terms of performance, in testing this change shows little effect, with single-core perf tests showing a very slight improvement. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- drivers/net/i40e/i40e_rxtx_vec.c | 51 ++++++++++++++-------------------------- 1 file changed, 17 insertions(+), 34 deletions(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c index 9f67f9d..f7a62a8 100644 --- a/drivers/net/i40e/i40e_rxtx_vec.c +++ b/drivers/net/i40e/i40e_rxtx_vec.c @@ -184,37 +184,7 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts) #define desc_to_olflags_v(desc, rx_pkts) do {} while (0) #endif -#define PKTLEN_SHIFT (6) -#define PKTLEN_MASK (0x3FFF) -/* Handling the pkt len field is not aligned with 1byte, so shift is - * needed to let it align - */ -static inline void -desc_pktlen_align(__m128i descs[4]) -{ - __m128i pktlen0, pktlen1; - - /* mask everything except pktlen field*/ - const __m128i pktlen_msk = _mm_set_epi32(PKTLEN_MASK, PKTLEN_MASK, - PKTLEN_MASK, PKTLEN_MASK); - - pktlen0 = _mm_unpackhi_epi32(descs[0], descs[2]); - pktlen1 = _mm_unpackhi_epi32(descs[1], descs[3]); - pktlen0 = _mm_unpackhi_epi32(pktlen0, pktlen1); - - pktlen0 = _mm_srli_epi32(pktlen0, PKTLEN_SHIFT); - pktlen0 = _mm_and_si128(pktlen0, pktlen_msk); - - pktlen0 = _mm_packs_epi32(pktlen0, pktlen0); - - descs[3] = _mm_blend_epi16(descs[3], pktlen0, 0x80); - pktlen0 = _mm_slli_epi64(pktlen0, 16); - descs[2] = _mm_blend_epi16(descs[2], pktlen0, 0x80); - pktlen0 = _mm_slli_epi64(pktlen0, 16); - descs[1] = _mm_blend_epi16(descs[1], pktlen0, 0x80); - pktlen0 = _mm_slli_epi64(pktlen0, 16); - descs[0] = _mm_blend_epi16(descs[0], pktlen0, 0x80); -} +#define PKTLEN_SHIFT 10 /* * Notice: @@ -333,12 +303,17 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, rte_prefetch0(&rx_pkts[pos + 3]->cacheline1); } - /*shift the pktlen field*/ - desc_pktlen_align(descs); - /* avoid compiler reorder optimization */ rte_compiler_barrier(); + /* pkt 3,4 shift the pktlen field to be 16-bit aligned*/ + const __m128i len3 = _mm_slli_epi32(descs[3], PKTLEN_SHIFT); + const __m128i len2 = _mm_slli_epi32(descs[2], PKTLEN_SHIFT); + + /* merge the now-aligned packet length fields back in */ + descs[3] = _mm_blend_epi16(descs[3], len3, 0x80); + descs[2] = _mm_blend_epi16(descs[2], len2, 0x80); + /* D.1 pkt 3,4 convert format from desc to pktmbuf */ pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk); @@ -354,6 +329,14 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust); pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust); + /* pkt 1,2 shift the pktlen field to be 16-bit aligned*/ + const __m128i len1 = _mm_slli_epi32(descs[1], PKTLEN_SHIFT); + const __m128i len0 = _mm_slli_epi32(descs[0], PKTLEN_SHIFT); + + /* merge the now-aligned packet length fields back in */ + descs[1] = _mm_blend_epi16(descs[1], len1, 0x80); + descs[0] = _mm_blend_epi16(descs[0], len0, 0x80); + /* D.1 pkt 1,2 convert format from desc to pktmbuf */ pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk); pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk); -- 2.5.5 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson ` (2 preceding siblings ...) 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 3/3] i40e: simplify SSE packet length extraction code Bruce Richardson @ 2016-04-17 8:32 ` Zhe Tao 2016-04-27 16:30 ` Bruce Richardson 3 siblings, 1 reply; 10+ messages in thread From: Zhe Tao @ 2016-04-17 8:32 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev, Helin Zhang, Jingjing Wu On Thu, Apr 14, 2016 at 05:02:34PM +0100, Bruce Richardson wrote: > This patchset improves the performance of the i40e SSE pmd by removing > operations that triggered CPU stalls. It also shortens the code and > cleans it up a little. > > The base requirement for using the SSE code path has been pushed up to > SSE4.1 from SSE3, due to the use of the blend instruction. The instruction > set level is now checked at runtime as part of the driver selection process > > Bruce Richardson (3): > i40e: require SSE4.1 support for vector driver > i40e: improve performance of vector PMD > i40e: simplify SSE packet length extraction code > > drivers/net/i40e/Makefile | 6 ++++ > drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- > 2 files changed, 27 insertions(+), 38 deletions(-) > > -- > 2.5.5 Acked-by: Zhe Tao <zhe.tao@intel.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd 2016-04-17 8:32 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Zhe Tao @ 2016-04-27 16:30 ` Bruce Richardson 0 siblings, 0 replies; 10+ messages in thread From: Bruce Richardson @ 2016-04-27 16:30 UTC (permalink / raw) To: Zhe Tao; +Cc: dev, Helin Zhang, Jingjing Wu On Sun, Apr 17, 2016 at 04:32:10PM +0800, Zhe Tao wrote: > On Thu, Apr 14, 2016 at 05:02:34PM +0100, Bruce Richardson wrote: > > This patchset improves the performance of the i40e SSE pmd by removing > > operations that triggered CPU stalls. It also shortens the code and > > cleans it up a little. > > > > The base requirement for using the SSE code path has been pushed up to > > SSE4.1 from SSE3, due to the use of the blend instruction. The instruction > > set level is now checked at runtime as part of the driver selection process > > > > Bruce Richardson (3): > > i40e: require SSE4.1 support for vector driver > > i40e: improve performance of vector PMD > > i40e: simplify SSE packet length extraction code > > > > drivers/net/i40e/Makefile | 6 ++++ > > drivers/net/i40e/i40e_rxtx_vec.c | 59 ++++++++++++++-------------------------- > > 2 files changed, 27 insertions(+), 38 deletions(-) > > > > -- > > 2.5.5 > Acked-by: Zhe Tao <zhe.tao@intel.com> Applied to dpdk-next-net/rel_16_07 /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-04-27 16:30 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-14 10:15 [dpdk-dev] [PATCH] i40e: improve performance of vector PMD Bruce Richardson 2016-04-14 13:50 ` Bruce Richardson 2016-04-14 14:00 ` Ananyev, Konstantin 2016-04-14 15:33 ` Iremonger, Bernard 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 1/3] i40e: require SSE4.1 support for vector driver Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 2/3] i40e: improve performance of vector PMD Bruce Richardson 2016-04-14 16:02 ` [dpdk-dev] [PATCH v2 3/3] i40e: simplify SSE packet length extraction code Bruce Richardson 2016-04-17 8:32 ` [dpdk-dev] [PATCH v2 0/3] improve i40e vpmd Zhe Tao 2016-04-27 16:30 ` Bruce Richardson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).