DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants
@ 2014-07-24 18:23 Neil Horman
  2014-07-24 20:41 ` Venkatesan, Venky
  2014-07-26 11:15 ` Ananyev, Konstantin
  0 siblings, 2 replies; 5+ messages in thread
From: Neil Horman @ 2014-07-24 18:23 UTC (permalink / raw)
  To: dev

The ixgbe pmd currently can't be built without enabling sse instructions at
compile time.  While sse extensions provide better performance, theres no reason
that we can't still create builds to run on systems that don't support sse.  If
we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll
functions, I've confirmed that the gcc compiler emits the appropriate sse
instructions when the provided -march parameter indicates a machine that
includes sse support, and emits generic code when see isn't available.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 09e19a3..5747072 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -38,8 +38,6 @@
 #include "ixgbe_ethdev.h"
 #include "ixgbe_rxtx.h"
 
-#include <nmmintrin.h>
-
 #ifndef __INTEL_COMPILER
 #pragma GCC diagnostic ignored "-Wcast-qual"
 #endif
@@ -294,8 +292,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rte_compiler_barrier();
 
 		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
-		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
-		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+		pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk);
+		pkt_mb3 = __builtin_shuffle(descs[2], shuf_msk);
 
 		/* C.1 4=>2 filter staterr info only */
 		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]);
@@ -310,8 +308,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port);
 
 		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
-		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
-		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+		pkt_mb2 = __builtin_shuffle(descs[1], shuf_msk);
+		pkt_mb1 = __builtin_shuffle(descs[0], shuf_msk);
 
 		/* C.2 get 4 pkts staterr value  */
 		zero = _mm_xor_si128(dd_check, dd_check);
@@ -338,7 +336,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				pkt_mb1);
 
 		/* C.4 calc avaialbe number of desc */
-		var = _mm_popcnt_u64(_mm_cvtsi128_si64(staterr));
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
 		nb_pkts_recd += var;
 		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
 			break;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants
  2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman
@ 2014-07-24 20:41 ` Venkatesan, Venky
  2014-07-25  2:20   ` Neil Horman
  2014-07-26 11:15 ` Ananyev, Konstantin
  1 sibling, 1 reply; 5+ messages in thread
From: Venkatesan, Venky @ 2014-07-24 20:41 UTC (permalink / raw)
  To: Neil Horman, dev

Neil, 

Nice patch! One question - what gcc versions did you try this out on? We'll round out with checking the other versions. 

Regards, 
-Venky

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Neil Horman
Sent: Thursday, July 24, 2014 11:24 AM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants

The ixgbe pmd currently can't be built without enabling sse instructions at compile time.  While sse extensions provide better performance, theres no reason that we can't still create builds to run on systems that don't support sse.  If we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll functions, I've confirmed that the gcc compiler emits the appropriate sse instructions when the provided -march parameter indicates a machine that includes sse support, and emits generic code when see isn't available.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 09e19a3..5747072 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -38,8 +38,6 @@
 #include "ixgbe_ethdev.h"
 #include "ixgbe_rxtx.h"
 
-#include <nmmintrin.h>
-
 #ifndef __INTEL_COMPILER
 #pragma GCC diagnostic ignored "-Wcast-qual"
 #endif
@@ -294,8 +292,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rte_compiler_barrier();
 
 		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
-		pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
-		pkt_mb3 = _mm_shuffle_epi8(descs[2], shuf_msk);
+		pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk);
+		pkt_mb3 = __builtin_shuffle(descs[2], shuf_msk);
 
 		/* C.1 4=>2 filter staterr info only */
 		sterr_tmp2 = _mm_unpackhi_epi32(descs[3], descs[2]); @@ -310,8 +308,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port);
 
 		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
-		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
-		pkt_mb1 = _mm_shuffle_epi8(descs[0], shuf_msk);
+		pkt_mb2 = __builtin_shuffle(descs[1], shuf_msk);
+		pkt_mb1 = __builtin_shuffle(descs[0], shuf_msk);
 
 		/* C.2 get 4 pkts staterr value  */
 		zero = _mm_xor_si128(dd_check, dd_check); @@ -338,7 +336,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				pkt_mb1);
 
 		/* C.4 calc avaialbe number of desc */
-		var = _mm_popcnt_u64(_mm_cvtsi128_si64(staterr));
+		var = __builtin_popcountll(_mm_cvtsi128_si64(staterr));
 		nb_pkts_recd += var;
 		if (likely(var != RTE_IXGBE_DESCS_PER_LOOP))
 			break;
--
1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants
  2014-07-24 20:41 ` Venkatesan, Venky
@ 2014-07-25  2:20   ` Neil Horman
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Horman @ 2014-07-25  2:20 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev

On Thu, Jul 24, 2014 at 08:41:33PM +0000, Venkatesan, Venky wrote:
> Neil, 
> 
> Nice patch! One question - what gcc versions did you try this out on? We'll round out with checking the other versions. 
> 
Thanks!  This was built using gcc 4.8.3.  I had hoped to do something simmilar
for the ACL library, but the sse instructions used there only emitted code when
sse was enabled (IOW there was no genereic non-sse variant)?
Neil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants
  2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman
  2014-07-24 20:41 ` Venkatesan, Venky
@ 2014-07-26 11:15 ` Ananyev, Konstantin
  2014-07-28 13:32   ` Neil Horman
  1 sibling, 1 reply; 5+ messages in thread
From: Ananyev, Konstantin @ 2014-07-26 11:15 UTC (permalink / raw)
  To: Neil Horman, dev


Hi Neil,
 
> The ixgbe pmd currently can't be built without enabling sse instructions at
> compile time.

Actually it can, all you have to do is set RTE_IXGBE_INC_VECTOR=n in your config.

>  While sse extensions provide better performance, theres no reason
> that we can't still create builds to run on systems that don't support sse.  If
> we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll
> functions, I've confirmed that the gcc compiler emits the appropriate sse
> instructions when the provided -march parameter indicates a machine that
> includes sse support, and emits generic code when see isn't available.

I don't think it is ok to blindly replace _mm_shuffle_epi8 with __builtin_shuffle.
They are not identical.
I tried your patch on IVB box (gcc 4.8.3, CONFIG_RTE_MACHINE="native").
The result is - ixgbe_recv_pkts_vec() functionality is broken.
See below for more details.
So my vote is NACK.
Konstantin 
 
1. Code changes:
uint16_t
ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
                uint16_t nb_pkts)
{
        ...
        __m128i shuf_msk;
        ...
        /* mask to shuffle from desc. to mbuf */
        shuf_msk = _mm_set_epi8(
                7, 6, 5, 4,  /* octet 4~7, 32bits rss */
                0xFF, 0xFF,  /* skip high 16 bits vlan_macip, zero out */
                15, 14,      /* octet 14~15, low 16 bits vlan_macip */
                0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
                13, 12,      /* octet 12~13, low 16 bits pkt_len */
                0xFF, 0xFF,  /* skip nb_segs and in_port, zero out */
                13, 12       /* octet 12~13, 16 bits data_len */
                );
         ...
         for (...) {
                __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
                __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
                ...
                descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3)); 
                ...
-               pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
+              pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk);
...

2. Code generated before the patch (valid one):
...
vmovdqa 0x4978d(%rip),%xmm0           /* load shuf_msk */
...
vmovdqu 0x30(%rdx),%xmm4                /* load desc[3] */
....
vpshufb %xmm0,%xmm4,%xmm8
....

3. Code generated after the patch applied (broken one):
...
vmovdqu 0x30(%rdx),%xmm3
...
vpunpcklqdq %xmm3,%xmm3,%xmm3       /* !!! ERROR - should be vpshufb !!!! */

4. What happens here?
My understanding:

GCC treats __m128i as vector of two 64bit integers:
/lib/gcc/x86_64-redhat-linux/4.8.3/include/emmintrin.h:typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));

From https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html:
"... Vector shuffling is available using functions __builtin_shuffle (vec, mask) and __builtin_shuffle (vec0, vec1, mask). Both functions construct a permutation of elements from one or two vectors and return a vector of the same type as the input vector(s). The mask is an integral vector with the same width (W) and element count (N) as the output vector.

The elements of the input vectors are numbered in memory ordering of vec0 beginning at 0 and vec1 beginning at N. The elements of mask are considered modulo N in the single-operand case and modulo 2*N in the two-operand case."

For m128i N = 2, so:

__m128i x, __m128i msk;
x = __builtin_shuffle(x, msk);

means:

index0 = msk[0..63] % 2;
index1 = msk[64..127] % 2;
x[0..63] =  x[index0 * 64..index0*64+63];
x[64..127] =  x[index1 * 64..index1*64+63];

In ixgbe_recv_pkts_vec() shuf_msk[0..63] % 2 == 0 and  shuf_msk[64..127] % 2.
So compiler makes optimisation:
pkt_mb4[0..63]  = descs[3] [0..63];
pkt_mb4[64..127]  = descs[3] [0..63];
i.e:
vpunpcklqdq %xmm3,%xmm3,%xmm3

BTW, changing to
__builtin_shuffle((__v16qi) descs[3], (__v16qi)shuf_msk);
wouldn't help either.
 In that case __builtin_shuffle will consider elements of mas modulo 16.
While _mm_shuffle_epi8 (PSHUFB) is expected to zero destination byte if upper bit in the corresponding mask byte is 1.

 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants
  2014-07-26 11:15 ` Ananyev, Konstantin
@ 2014-07-28 13:32   ` Neil Horman
  0 siblings, 0 replies; 5+ messages in thread
From: Neil Horman @ 2014-07-28 13:32 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Sat, Jul 26, 2014 at 11:15:01AM +0000, Ananyev, Konstantin wrote:
> 
> Hi Neil,
>  
> > The ixgbe pmd currently can't be built without enabling sse instructions at
> > compile time.
> 
> Actually it can, all you have to do is set RTE_IXGBE_INC_VECTOR=n in your config.
> 
> >  While sse extensions provide better performance, theres no reason
> > that we can't still create builds to run on systems that don't support sse.  If
> > we modify the ixgbe code to use the __builtin_shuffle and __builtin_popcountll
> > functions, I've confirmed that the gcc compiler emits the appropriate sse
> > instructions when the provided -march parameter indicates a machine that
> > includes sse support, and emits generic code when see isn't available.
> 
> I don't think it is ok to blindly replace _mm_shuffle_epi8 with __builtin_shuffle.
> They are not identical.
> I tried your patch on IVB box (gcc 4.8.3, CONFIG_RTE_MACHINE="native").
> The result is - ixgbe_recv_pkts_vec() functionality is broken.
> See below for more details.
> So my vote is NACK.
> Konstantin 
>  
> 1. Code changes:
> uint16_t
> ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
>                 uint16_t nb_pkts)
> {
>         ...
>         __m128i shuf_msk;
>         ...
>         /* mask to shuffle from desc. to mbuf */
>         shuf_msk = _mm_set_epi8(
>                 7, 6, 5, 4,  /* octet 4~7, 32bits rss */
>                 0xFF, 0xFF,  /* skip high 16 bits vlan_macip, zero out */
>                 15, 14,      /* octet 14~15, low 16 bits vlan_macip */
>                 0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
>                 13, 12,      /* octet 12~13, low 16 bits pkt_len */
>                 0xFF, 0xFF,  /* skip nb_segs and in_port, zero out */
>                 13, 12       /* octet 12~13, 16 bits data_len */
>                 );
>          ...
>          for (...) {
>                 __m128i descs[RTE_IXGBE_DESCS_PER_LOOP];
>                 __m128i pkt_mb1, pkt_mb2, pkt_mb3, pkt_mb4;
>                 ...
>                 descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3)); 
>                 ...
> -               pkt_mb4 = _mm_shuffle_epi8(descs[3], shuf_msk);
> +              pkt_mb4 = __builtin_shuffle(descs[3], shuf_msk);
> ...
> 
> 2. Code generated before the patch (valid one):
> ...
> vmovdqa 0x4978d(%rip),%xmm0           /* load shuf_msk */
> ...
> vmovdqu 0x30(%rdx),%xmm4                /* load desc[3] */
> ....
> vpshufb %xmm0,%xmm4,%xmm8
> ....
> 
> 3. Code generated after the patch applied (broken one):
> ...
> vmovdqu 0x30(%rdx),%xmm3
> ...
> vpunpcklqdq %xmm3,%xmm3,%xmm3       /* !!! ERROR - should be vpshufb !!!! */
> 
> 4. What happens here?
> My understanding:
> 
> GCC treats __m128i as vector of two 64bit integers:
> /lib/gcc/x86_64-redhat-linux/4.8.3/include/emmintrin.h:typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
> 
> From https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html:
> "... Vector shuffling is available using functions __builtin_shuffle (vec, mask) and __builtin_shuffle (vec0, vec1, mask). Both functions construct a permutation of elements from one or two vectors and return a vector of the same type as the input vector(s). The mask is an integral vector with the same width (W) and element count (N) as the output vector.
> 
> The elements of the input vectors are numbered in memory ordering of vec0 beginning at 0 and vec1 beginning at N. The elements of mask are considered modulo N in the single-operand case and modulo 2*N in the two-operand case."
> 
> For m128i N = 2, so:
> 
> __m128i x, __m128i msk;
> x = __builtin_shuffle(x, msk);
> 
> means:
> 
> index0 = msk[0..63] % 2;
> index1 = msk[64..127] % 2;
> x[0..63] =  x[index0 * 64..index0*64+63];
> x[64..127] =  x[index1 * 64..index1*64+63];
> 
> In ixgbe_recv_pkts_vec() shuf_msk[0..63] % 2 == 0 and  shuf_msk[64..127] % 2.
> So compiler makes optimisation:
> pkt_mb4[0..63]  = descs[3] [0..63];
> pkt_mb4[64..127]  = descs[3] [0..63];
> i.e:
> vpunpcklqdq %xmm3,%xmm3,%xmm3
> 
> BTW, changing to
> __builtin_shuffle((__v16qi) descs[3], (__v16qi)shuf_msk);
> wouldn't help either.
>  In that case __builtin_shuffle will consider elements of mas modulo 16.
> While _mm_shuffle_epi8 (PSHUFB) is expected to zero destination byte if upper bit in the corresponding mask byte is 1.
> 
Ok, Sorry for the delayed response, I spend the weekend reading up on the
differences between these instructions, and you're right, I thought those
operations were equivalent, but the zeroing operation differentiates them.
Sorry about that.

That said, it would be really helpful for distribution packaging to be able to
enable vectorized reception at run time.  A compile time build variable really
just isn't very helpful.  I'll see if I can rework the patch to allow optional
vectorized patch reception at run time.

Neil

>  
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-07-28 13:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-24 18:23 [dpdk-dev] [PATCH] ixgbe: convert sse intrinsics to use __builtin variants Neil Horman
2014-07-24 20:41 ` Venkatesan, Venky
2014-07-25  2:20   ` Neil Horman
2014-07-26 11:15 ` Ananyev, Konstantin
2014-07-28 13:32   ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).