* [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants @ 2014-07-24 18:23 Neil Horman 2014-07-24 20:41 ` Venkatesan, Venky 2014-07-26 11:15 ` Ananyev, Konstantin 0 siblings, 2 replies; 5+ messages in thread From: Neil Horman @ 2014-07-24 18:23 UTC (permalink / raw) To: dev The ixgbe pmd currently can't be built without enabling sse instructions at compile time. While sse extensions provide better performance, theres no reason that we can't still create builds to run on systems that don't support sse. If we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll functions, I've confirmed that the gcc compiler emits the appropriate sse instructions when the provided -march parameter indicates a machine that includes sse support, and emits generic code when see isn't available. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c index 09e19a3..5747072 100644 --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c @@ -38,8 +38,6 @@ #include "ixgbe_ethdev.h" #include "ixgbe_rxtx.h" -#include <nmmintrin.h> - #ifndef __INTEL_COMPILER #pragma GCC diagnostic ignored "-Wcast-qual" #endif @@ -294,8 +292,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, rte_compiler_barrier(); /* D.1 pkt 3,4 convert format from desc to pktmbuf */ - pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); - pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk); + pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk); + pkt_mb3 = __builtin_shuffle(descs[2], shuf_msk); /* C.1 4=>2 filter staterr info only */ sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]); @@ -310,8 +308,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port); /* D.1 pkt 1,2 convert format from desc to pktmbuf */ - pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk); - pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk); + pkt_mb2 = __builtin_shuffle(descs[1], shuf_msk); + pkt_mb1 = __builtin_shuffle(descs[0], shuf_msk); /* C.2 get 4 pkts staterr value */ zero = _mm_xor_si128(dd_check, dd_check); @@ -338,7 +336,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, pkt_mb1); /* C.4 calc avaialbe number of desc */ - var = _mm_popcnt_u64(_mm_cvtsi128_si64(staterr)); + var = __builtin_popcountll(_mm_cvtsi128_si64(staterr)); nb_pkts_recd += var; if (likely(var != RTE_IXGBE_DESCS_PER_LOOP)) break; -- 1.8.3.1 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants 2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman @ 2014-07-24 20:41 ` Venkatesan, Venky 2014-07-25 2:20 ` Neil Horman 2014-07-26 11:15 ` Ananyev, Konstantin 1 sibling, 1 reply; 5+ messages in thread From: Venkatesan, Venky @ 2014-07-24 20:41 UTC (permalink / raw) To: Neil Horman, dev Neil, Nice patch! One question - what gcc versions did you try this out on? We'll round out with checking the other versions. Regards, -Venky -----Original Message----- From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman Sent: Thursday, July 24, 2014 11:24 AM To: dev@dpdk.org Subject: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants The ixgbe pmd currently can't be built without enabling sse instructions at compile time. While sse extensions provide better performance, theres no reason that we can't still create builds to run on systems that don't support sse. If we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll functions, I've confirmed that the gcc compiler emits the appropriate sse instructions when the provided -march parameter indicates a machine that includes sse support, and emits generic code when see isn't available. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Thomas Monjalon <thomas.monjalon@6wind.com> --- lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c index 09e19a3..5747072 100644 --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c @@ -38,8 +38,6 @@ #include "ixgbe_ethdev.h" #include "ixgbe_rxtx.h" -#include <nmmintrin.h> - #ifndef __INTEL_COMPILER #pragma GCC diagnostic ignored "-Wcast-qual" #endif @@ -294,8 +292,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, rte_compiler_barrier(); /* D.1 pkt 3,4 convert format from desc to pktmbuf */ - pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); - pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk); + pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk); + pkt_mb3 = __builtin_shuffle(descs[2], shuf_msk); /* C.1 4=>2 filter staterr info only */ sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]); @@ -310,8 +308,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port); /* D.1 pkt 1,2 convert format from desc to pktmbuf */ - pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk); - pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk); + pkt_mb2 = __builtin_shuffle(descs[1], shuf_msk); + pkt_mb1 = __builtin_shuffle(descs[0], shuf_msk); /* C.2 get 4 pkts staterr value */ zero = _mm_xor_si128(dd_check, dd_check); @@ -338,7 +336,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, pkt_mb1); /* C.4 calc avaialbe number of desc */ - var = _mm_popcnt_u64(_mm_cvtsi128_si64(staterr)); + var = __builtin_popcountll(_mm_cvtsi128_si64(staterr)); nb_pkts_recd += var; if (likely(var != RTE_IXGBE_DESCS_PER_LOOP)) break; -- 1.8.3.1 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants 2014-07-24 20:41 ` Venkatesan, Venky @ 2014-07-25 2:20 ` Neil Horman 0 siblings, 0 replies; 5+ messages in thread From: Neil Horman @ 2014-07-25 2:20 UTC (permalink / raw) To: Venkatesan, Venky; +Cc: dev On Thu, Jul 24, 2014 at 08:41:33PM +0000, Venkatesan, Venky wrote: > Neil, > > Nice patch! One question - what gcc versions did you try this out on? We'll round out with checking the other versions. > Thanks! This was built using gcc 4.8.3. I had hoped to do something simmilar for the ACL library, but the sse instructions used there only emitted code when sse was enabled (IOW there was no genereic non-sse variant)? Neil ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants 2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman 2014-07-24 20:41 ` Venkatesan, Venky @ 2014-07-26 11:15 ` Ananyev, Konstantin 2014-07-28 13:32 ` Neil Horman 1 sibling, 1 reply; 5+ messages in thread From: Ananyev, Konstantin @ 2014-07-26 11:15 UTC (permalink / raw) To: Neil Horman, dev Hi Neil, > The ixgbe pmd currently can't be built without enabling sse instructions at > compile time. Actually it can, all you have to do is set RTE_IXGBE_INC_VECTOR=n in your config. > While sse extensions provide better performance, theres no reason > that we can't still create builds to run on systems that don't support sse. If > we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll > functions, I've confirmed that the gcc compiler emits the appropriate sse > instructions when the provided -march parameter indicates a machine that > includes sse support, and emits generic code when see isn't available. I don't think it is ok to blindly replace _mm_shuffle_epi8 with __builtin_shuffle. They are not identical. I tried your patch on IVB box (gcc 4.8.3, CONFIG_RTE_MACHINE="native"). The result is - ixgbe_recv_pkts_vec() functionality is broken. See below for more details. So my vote is NACK. Konstantin 1. Code changes: uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { ... __m128i shuf_msk; ... /* mask to shuffle from desc. to mbuf */ shuf_msk = _mm_set_epi8( 7, 6, 5, 4, /* octet 4~7, 32bits rss */ 0xFF, 0xFF, /* skip high 16 bits vlan_macip, zero out */ 15, 14, /* octet 14~15, low 16 bits vlan_macip */ 0xFF, 0xFF, /* skip high 16 bits pkt_len, zero out */ 13, 12, /* octet 12~13, low 16 bits pkt_len */ 0xFF, 0xFF, /* skip nb_segs and in_port, zero out */ 13, 12 /* octet 12~13, 16 bits data_len */ ); ... for (...) { __m128i descs[RTE_IXGBE_DESCS_PER_LOOP]; __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4; ... descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3)); ... - pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); + pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk); ... 2. Code generated before the patch (valid one): ... vmovdqa 0x4978d(%rip),%xmm0 /* load shuf_msk */ ... vmovdqu 0x30(%rdx),%xmm4 /* load desc[3] */ .... vpshufb %xmm0,%xmm4,%xmm8 .... 3. Code generated after the patch applied (broken one): ... vmovdqu 0x30(%rdx),%xmm3 ... vpunpcklqdq %xmm3,%xmm3,%xmm3 /* !!! ERROR - should be vpshufb !!!! */ 4. What happens here? My understanding: GCC treats __m128i as vector of two 64bit integers: /lib/gcc/x86_64-redhat-linux/4.8.3/include/emmintrin.h:typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__)); From https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html: "... Vector shuffling is available using functions __builtin_shuffle (vec, mask) and __builtin_shuffle (vec0, vec1, mask). Both functions construct a permutation of elements from one or two vectors and return a vector of the same type as the input vector(s). The mask is an integral vector with the same width (W) and element count (N) as the output vector. The elements of the input vectors are numbered in memory ordering of vec0 beginning at 0 and vec1 beginning at N. The elements of mask are considered modulo N in the single-operand case and modulo 2*N in the two-operand case." For m128i N = 2, so: __m128i x, __m128i msk; x = __builtin_shuffle(x, msk); means: index0 = msk[0..63] % 2; index1 = msk[64..127] % 2; x[0..63] = x[index0 * 64..index0*64+63]; x[64..127] = x[index1 * 64..index1*64+63]; In ixgbe_recv_pkts_vec() shuf_msk[0..63] % 2 == 0 and shuf_msk[64..127] % 2. So compiler makes optimisation: pkt_mb4[0..63] = descs[3] [0..63]; pkt_mb4[64..127] = descs[3] [0..63]; i.e: vpunpcklqdq %xmm3,%xmm3,%xmm3 BTW, changing to __builtin_shuffle((__v16qi) descs[3], (__v16qi)shuf_msk); wouldn't help either. In that case __builtin_shuffle will consider elements of mas modulo 16. While _mm_shuffle_epi8 (PSHUFB) is expected to zero destination byte if upper bit in the corresponding mask byte is 1. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants 2014-07-26 11:15 ` Ananyev, Konstantin @ 2014-07-28 13:32 ` Neil Horman 0 siblings, 0 replies; 5+ messages in thread From: Neil Horman @ 2014-07-28 13:32 UTC (permalink / raw) To: Ananyev, Konstantin; +Cc: dev On Sat, Jul 26, 2014 at 11:15:01AM +0000, Ananyev, Konstantin wrote: > > Hi Neil, > > > The ixgbe pmd currently can't be built without enabling sse instructions at > > compile time. > > Actually it can, all you have to do is set RTE_IXGBE_INC_VECTOR=n in your config. > > > While sse extensions provide better performance, theres no reason > > that we can't still create builds to run on systems that don't support sse. If > > we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll > > functions, I've confirmed that the gcc compiler emits the appropriate sse > > instructions when the provided -march parameter indicates a machine that > > includes sse support, and emits generic code when see isn't available. > > I don't think it is ok to blindly replace _mm_shuffle_epi8 with __builtin_shuffle. > They are not identical. > I tried your patch on IVB box (gcc 4.8.3, CONFIG_RTE_MACHINE="native"). > The result is - ixgbe_recv_pkts_vec() functionality is broken. > See below for more details. > So my vote is NACK. > Konstantin > > 1. Code changes: > uint16_t > ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, > uint16_t nb_pkts) > { > ... > __m128i shuf_msk; > ... > /* mask to shuffle from desc. to mbuf */ > shuf_msk = _mm_set_epi8( > 7, 6, 5, 4, /* octet 4~7, 32bits rss */ > 0xFF, 0xFF, /* skip high 16 bits vlan_macip, zero out */ > 15, 14, /* octet 14~15, low 16 bits vlan_macip */ > 0xFF, 0xFF, /* skip high 16 bits pkt_len, zero out */ > 13, 12, /* octet 12~13, low 16 bits pkt_len */ > 0xFF, 0xFF, /* skip nb_segs and in_port, zero out */ > 13, 12 /* octet 12~13, 16 bits data_len */ > ); > ... > for (...) { > __m128i descs[RTE_IXGBE_DESCS_PER_LOOP]; > __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4; > ... > descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3)); > ... > - pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk); > + pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk); > ... > > 2. Code generated before the patch (valid one): > ... > vmovdqa 0x4978d(%rip),%xmm0 /* load shuf_msk */ > ... > vmovdqu 0x30(%rdx),%xmm4 /* load desc[3] */ > .... > vpshufb %xmm0,%xmm4,%xmm8 > .... > > 3. Code generated after the patch applied (broken one): > ... > vmovdqu 0x30(%rdx),%xmm3 > ... > vpunpcklqdq %xmm3,%xmm3,%xmm3 /* !!! ERROR - should be vpshufb !!!! */ > > 4. What happens here? > My understanding: > > GCC treats __m128i as vector of two 64bit integers: > /lib/gcc/x86_64-redhat-linux/4.8.3/include/emmintrin.h:typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__)); > > From https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html: > "... Vector shuffling is available using functions __builtin_shuffle (vec, mask) and __builtin_shuffle (vec0, vec1, mask). Both functions construct a permutation of elements from one or two vectors and return a vector of the same type as the input vector(s). The mask is an integral vector with the same width (W) and element count (N) as the output vector. > > The elements of the input vectors are numbered in memory ordering of vec0 beginning at 0 and vec1 beginning at N. The elements of mask are considered modulo N in the single-operand case and modulo 2*N in the two-operand case." > > For m128i N = 2, so: > > __m128i x, __m128i msk; > x = __builtin_shuffle(x, msk); > > means: > > index0 = msk[0..63] % 2; > index1 = msk[64..127] % 2; > x[0..63] = x[index0 * 64..index0*64+63]; > x[64..127] = x[index1 * 64..index1*64+63]; > > In ixgbe_recv_pkts_vec() shuf_msk[0..63] % 2 == 0 and shuf_msk[64..127] % 2. > So compiler makes optimisation: > pkt_mb4[0..63] = descs[3] [0..63]; > pkt_mb4[64..127] = descs[3] [0..63]; > i.e: > vpunpcklqdq %xmm3,%xmm3,%xmm3 > > BTW, changing to > __builtin_shuffle((__v16qi) descs[3], (__v16qi)shuf_msk); > wouldn't help either. > In that case __builtin_shuffle will consider elements of mas modulo 16. > While _mm_shuffle_epi8 (PSHUFB) is expected to zero destination byte if upper bit in the corresponding mask byte is 1. > Ok, Sorry for the delayed response, I spend the weekend reading up on the differences between these instructions, and you're right, I thought those operations were equivalent, but the zeroing operation differentiates them. Sorry about that. That said, it would be really helpful for distribution packaging to be able to enable vectorized reception at run time. A compile time build variable really just isn't very helpful. I'll see if I can rework the patch to allow optional vectorized patch reception at run time. Neil > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-07-28 13:30 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman 2014-07-24 20:41 ` Venkatesan, Venky 2014-07-25 2:20 ` Neil Horman 2014-07-26 11:15 ` Ananyev, Konstantin 2014-07-28 13:32 ` Neil Horman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).