From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 598932E8A for ; Tue, 5 Aug 2014 20:18:31 +0200 (CEST) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1XEjLp-0008Jv-6h; Tue, 05 Aug 2014 14:20:52 -0400 Date: Tue, 5 Aug 2014 14:20:35 -0400 From: Neil Horman To: "Ananyev, Konstantin" Message-ID: <20140805182035.GB20550@hmsreliant.think-freely.org> References: <1407166558-9532-1-git-send-email-nhorman@tuxdriver.com> <2601191342CEEE43887BDE71AB9772582134F98D@IRSMSX105.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2601191342CEEE43887BDE71AB9772582134F98D@IRSMSX105.ger.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH] acl: If build does not support sse4.2, emulate missing instructions with C code X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Aug 2014 18:18:32 -0000 On Tue, Aug 05, 2014 at 03:26:27PM +0000, Ananyev, Konstantin wrote: > Hi Neil, > > > From: Neil Horman [mailto:nhorman@tuxdriver.com] > > Sent: Monday, August 04, 2014 4:36 PM > > To: dev@dpdk.org > > Cc: Neil Horman; Thomas Monjalon; Ananyev, Konstantin; Richardson, Bruce > > Subject: [PATCH] acl: If build does not support sse4.2, emulate missing instructions with C code > > > > The ACL library makes extensive use of some SSE4.2 instructions, which means the > > default build can't compile this library. Work around the problem by testing > > the __SSE42__ definition in the acl_vects.h file and defining the macros there > > as intrinsics or c-level equivalants. Note this is a minimal patch, adjusting > > only the definitions that are currently used in the ACL library. > > > > My comments about actual implementations of c-level equivalents below. > None of them look correct to me. > Of course it could be fixed. > Though I am not sure that is a right way to proceed: > At first I really doubt that these equivalents will provide similar performance. > As you probably note - we do have a scalar version of rte_acl_classify(): rte_acl_classify_scalar(). > So I think it might be faster than vector one with 'emulated' instrincts. > Unfortunately it is all mixed right now into one file and 'scalar' version could use few sse4 instrincts through resolve_priority(). > Another thing - we consider to add another version of rte_acl_classify() that will use avx2 instrincts. > If we go the way you suggest - I am afraid will soon have to provide scalar equivalents for several AVX2 instrincts too. > So in summary that way (providing our own scalar equivalents of SIMD instrincts) seems to me slow, hard to maintain and error prone. > > What porbably can be done instead: rework acl_run.c a bit, so it provide: > rte_acl_classify_scalar() - could be build and used on all systems. > rte_acl_classify_sse() - could be build and used only on systems with sse4.2 and upper, return ENOTSUP on lower arch. > In future: rte_acl_classify_avx2 - could be build and used only on systems with avx2 and upper, return ENOTSUP on lower arch. > > I am looking at rte_acl right now anyway. > So will try to come up with something workable. > So, this is exactly the opposite of what Bruce and I just spent several days and a huge email thread that you clearly are aware of discussing run time versus compile time selection of paths. At this point I'm done ping ponging between your opposing viewpoints. If you want to implement something that does run time checking, I'm fine with it, but I'm not going back and forth until you two come to an agreement on this. > > Only compile tested so far, but I wanted to post it for early review so that > > others could aid in unit testing. > > Could you please stop submitting untested patches. What part of "Compile tested only" was hard to understand? I posted it early specficially because I expected comments like you've provided above. After I posted my first idea, you naked it, and after I started fixing it up to make run-time selections, Bruce came down in favor of common compile time functionality. Given our previous conversation, I figured you would be opposed to this approach, so whats the point in testing this if you're just going to kill the idea? > It is common (and good) practice to do at least some minimal testing before submitting your code changes. Never intended it to be accepted just tested/reviewed. Hence my comment "compile tested only". I admit I was unaware of the unit test below > At the end, it will save your own and other people time. > For ACL there is a simple UT, that could be run as: > ./x86_64-native-linuxapp-gcc/app/test ... > RTE>>acl_autotest > It takes just few seconds to run. It doesn't seem to work properly, at least not on any of my systems. With a system allocation 100 pages to hugepage memory: [root@hmsreliant app]# ./test -c 0x3 -n 2 EAL: Detected lcore 0 as core 0 on socket 0 EAL: Detected lcore 1 as core 1 on socket 0 EAL: Detected lcore 2 as core 2 on socket 0 EAL: Detected lcore 3 as core 3 on socket 0 EAL: Detected lcore 4 as core 0 on socket 0 EAL: Detected lcore 5 as core 1 on socket 0 EAL: Detected lcore 6 as core 2 on socket 0 EAL: Detected lcore 7 as core 3 on socket 0 EAL: Support maximum 64 logical core(s) by configuration. EAL: Detected 8 lcore(s) EAL: cannot open VFIO container, error 2 (No such file or directory) EAL: VFIO support could not be initialized EAL: Setting up memory... EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7fef07000000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7fef06c00000 (size = 0x200000) EAL: Ask a virtual area of 0x400000 bytes EAL: Virtual area found at 0x7fef06600000 (size = 0x400000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7fef06200000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7fef05e00000 (size = 0x200000) EAL: Ask a virtual area of 0x600000 bytes EAL: Virtual area found at 0x7fef05600000 (size = 0x600000) EAL: Ask a virtual area of 0x600000 bytes EAL: Virtual area found at 0x7fef04e00000 (size = 0x600000) EAL: Ask a virtual area of 0x800000 bytes EAL: Virtual area found at 0x7fef04400000 (size = 0x800000) EAL: Ask a virtual area of 0x400000 bytes EAL: Virtual area found at 0x7fef03e00000 (size = 0x400000) EAL: Ask a virtual area of 0xa00000 bytes EAL: Virtual area found at 0x7fef03200000 (size = 0xa00000) EAL: Ask a virtual area of 0x400000 bytes EAL: Virtual area found at 0x7fef02c00000 (size = 0x400000) EAL: Ask a virtual area of 0x400000 bytes EAL: Virtual area found at 0x7fef02600000 (size = 0x400000) EAL: Ask a virtual area of 0x1800000 bytes EAL: Virtual area found at 0x7fef00c00000 (size = 0x1800000) EAL: Ask a virtual area of 0x1200000 bytes EAL: Virtual area found at 0x7feeff800000 (size = 0x1200000) EAL: Ask a virtual area of 0x2600000 bytes EAL: Virtual area found at 0x7feefd000000 (size = 0x2600000) EAL: Ask a virtual area of 0xc00000 bytes EAL: Virtual area found at 0x7feefc200000 (size = 0xc00000) EAL: Ask a virtual area of 0x2200000 bytes EAL: Virtual area found at 0x7feef9e00000 (size = 0x2200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef9a00000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef9600000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef9200000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef8e00000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef8a00000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef8600000 (size = 0x200000) EAL: Ask a virtual area of 0x200000 bytes EAL: Virtual area found at 0x7feef8200000 (size = 0x200000) EAL: Ask a virtual area of 0x600000 bytes EAL: Virtual area found at 0x7feef7a00000 (size = 0x600000) EAL: Requesting 100 pages of size 2MB from socket 0 EAL: TSC frequency is ~3392297 KHz EAL: Master core 0 is ready (tid=73cf880) EAL: Core 1 is ready (tid=f71fe700) APP: HPET is not enabled, using TSC as default timer RTE>>acl_autotest ACL: allocation of 25166720 bytes on socket 9 for ACL_acl_ctx failed This hangs forever (well, at least 30 minutes, but thats sufficiently close to forever for me to wait). > Also if you plan further development for ACL, we probably can provide you with extra test-cases that we use for functional testing. That would be good please. Thanks Neil > Thanks > Konstantin > > > > > Signed-off-by: Neil Horman > > CC: Thomas Monjalon > > CC: "Konstantin Ananyev" > > CC: Bruce Richardson > > --- > > lib/librte_acl/acl_bld.c | 3 +- > > lib/librte_acl/acl_vect.h | 102 ++++++++++++++++++++++++++++++++++++++++++++-- > > 2 files changed, 100 insertions(+), 5 deletions(-) > > > > diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c > > index 873447b..de974a4 100644 > > --- a/lib/librte_acl/acl_bld.c > > +++ b/lib/librte_acl/acl_bld.c > > @@ -31,7 +31,6 @@ > > * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. > > */ > > > > -#include > > #include > > #include "tb_mem.h" > > #include "acl.h" > > @@ -1481,7 +1480,7 @@ acl_calc_wildness(struct rte_acl_build_rule *head, > > switch (rule->config->defs[n].type) { > > case RTE_ACL_FIELD_TYPE_BITMASK: > > wild = (size - > > - _mm_popcnt_u32(fld->mask_range.u8)) / > > + __builtin_popcountl(fld->mask_range.u8)) / > > size; > > break; > > > > diff --git a/lib/librte_acl/acl_vect.h b/lib/librte_acl/acl_vect.h > > index d813600..e5f391b 100644 > > --- a/lib/librte_acl/acl_vect.h > > +++ b/lib/librte_acl/acl_vect.h > > @@ -34,6 +34,10 @@ > > #ifndef _RTE_ACL_VECT_H_ > > #define _RTE_ACL_VECT_H_ > > > > +#ifdef __SSE4_1__ > > +#include > > +#endif > > + > > /** > > * @file > > * > > @@ -44,12 +48,12 @@ > > extern "C" { > > #endif > > > > + > > #define MM_ADD16(a, b) _mm_add_epi16(a, b) > > #define MM_ADD32(a, b) _mm_add_epi32(a, b) > > #define MM_ALIGNR8(a, b, c) _mm_alignr_epi8(a, b, c) > > #define MM_AND(a, b) _mm_and_si128(a, b) > > #define MM_ANDNOT(a, b) _mm_andnot_si128(a, b) > > -#define MM_BLENDV8(a, b, c) _mm_blendv_epi8(a, b, c) > > #define MM_CMPEQ16(a, b) _mm_cmpeq_epi16(a, b) > > #define MM_CMPEQ32(a, b) _mm_cmpeq_epi32(a, b) > > #define MM_CMPEQ8(a, b) _mm_cmpeq_epi8(a, b) > > @@ -59,7 +63,6 @@ extern "C" { > > #define MM_CVT32(a) _mm_cvtsi128_si32(a) > > #define MM_CVTU32(a) _mm_cvtsi32_si128(a) > > #define MM_INSERT16(a, c, b) _mm_insert_epi16(a, c, b) > > -#define MM_INSERT32(a, c, b) _mm_insert_epi32(a, c, b) > > #define MM_LOAD(a) _mm_load_si128(a) > > #define MM_LOADH_PI(a, b) _mm_loadh_pi(a, b) > > #define MM_LOADU(a) _mm_loadu_si128(a) > > @@ -82,7 +85,6 @@ extern "C" { > > #define MM_SRL32(a, b) _mm_srli_epi32(a, b) > > #define MM_STORE(a, b) _mm_store_si128(a, b) > > #define MM_STOREU(a, b) _mm_storeu_si128(a, b) > > -#define MM_TESTZ(a, b) _mm_testz_si128(a, b) > > #define MM_XOR(a, b) _mm_xor_si128(a, b) > > > > #define MM_SET16(a, b, c, d, e, f, g, h) \ > > @@ -93,6 +95,100 @@ extern "C" { > > _mm_set_epi8(c0, c1, c2, c3, c4, c5, c6, c7, \ > > c8, c9, cA, cB, cC, cD, cE, cF) > > > > + > > +#ifndef __SSE4_1__ > > +static inline xmm_t pblendvb(xmm_t dst, xmm_t src, xmm_t mask) > > +{ > > + unsigned char tmpd[16], tmps[16], tmpm[16]; > > + int i; > > + > > + MM_STOREU((xmm_t *)&tmpd, dst); > > + MM_STOREU((xmm_t *)&tmps, src); > > + MM_STOREU((xmm_t *)&tmpm, mask); > > + > > Generic comment: > You don't need to use char temp[16] or other stuff each time you need to swap between __m128i and char[16]; > Specially for that purpose inside rte_common_vect.h we have: > typedef union rte_xmm { > xmm_t m; > uint8_t u8[XMM_SIZE / sizeof(uint8_t)]; > uint16_t u16[XMM_SIZE / sizeof(uint16_t)]; > uint32_t u32[XMM_SIZE / sizeof(uint32_t)]; > uint64_t u64[XMM_SIZE / sizeof(uint64_t)]; > double pd[XMM_SIZE / sizeof(double)]; > } rte_xmm_t; > > > > + for (i = 0; i < 16; i++) > > + if (mask[i] & 0x8) > > Should be tmpm, not mask. > Wonder how did it compile without warnings? > > > + dst[i] = src[i]; > > + > > > Reading Intel Architecture Manual: > MASK <-- SRC3 > IF (MASK[7] = 1) THEN DEST[7:0]  SRC2[7:0]; > ELSE DEST[7:0] <-- SRC1[7:0]; > IF (MASK[15] = 1) THEN DEST[15:8]  SRC2[15:8]; > ELSE DEST[15:8] <-- SRC1[15:8]; > ... > So it should be 'tmpm[i] & 0x80' > > > + dst = MM_LOADU((xmm_t *)&tmpd); > > + > > + return dst; > > +} > > Taking into the account all of the above, I think it should be something like that: > > static inline xmm_t > pblendvb(xmm_t dst, xmm_t src, xmm_t mask) > { > rte_xmm_t d, m, s; > int32_t i; > > d.m = dst; > s.m = src; > m.m = mask; > > for (i = 0; i != RTE_DIM(m.u8); i++) > if ((m.u8[i] & INT8_MIN) != 0) > d.u8[i] = s.u8[i]; > > return (d.m); > } > > > + > > +#define MM_BLENDV8(a, b, c) pblendvb(a, b, c) > > + > > + > > +static inline int ptestz(xmm_t a, xmm_t b) > > +{ > > + unsigned long long tmpa[2], tmpb[2]; > > + > > + MM_STOREU((xmm_t *)&tmpa, a); > > + MM_STOREU((xmm_t *)&tmpb, b); > > + > > + if (tmpa[0] & tmpb[0]) > > + return 1; > > + if (tmpa[1] & tmpb[1]) > > + return 1; > > + > > + return 0; > > +} > > Again from IA manual: > (V)PTEST (128-bit version) > IF (SRC[127:0] BITWISE AND DEST[127:0] = 0) > THEN ZF <-- 1; > ELSE ZF <-- 0; > IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0) > THEN CF <-- 1; > ELSE CF <-- 0; > > So _mm_testz_si128 (__m128i s1, __m128i s2) returns 1 if the bitwise AND operation on s1 and s2 results in all zeros, else returns 0. > Yours one doing opposite stuff. > Should be: > > static inline int > ptestz(xmm_t a, xmm_t b) > { > rte_xmm_t x, y; > > x.m = a; > y.m = b; > > return ((x.u64[0] & y.u64[0]) == 0 && (x.u64[1] & y.u64[1]) == 0); > } > > > + > > +#define MM_TESTZ(a, b) ptestz(a, b) > > + > > +static inline xmm_t pinsrd(xmm_t dst, int32_t val, char off) > > +{ > > + unsigned long long tmpa[2]; > > + unsigned long long mask; > > + int32_t tmp; > > + > > + MM_STOREU((xmm_t *)&tmpa, dst); > > + > > + /* > > + * Inserting a dword is a bit odd as it can cross a word boundary > > + */ > > + > > + if (off > 32) { > > + /* > > + * If the offset is more than 32, then part of the > > + * inserted word will appear in the upper half of the xmm > > + * register. Grab the part of the value that crosses the 64 bit > > + * boundary. > > + */ > > + tmp = val >> (off - 32); > > + > > + /* > > + * Mask off the least significant bits of the upper longword > > + */ > > + mask = ~((1 << (off - 32)) - 1); > > + tmpa[1] &= mask; > > + > > + /* > > + * and insert the new value > > + */ > > + tmpa[1] |= tmp; > > + } > > + if (off < 64) { > > + /* > > + * If the offset is less than 64 bits, we also need to mask and > > + * assign the lower longword > > + */ > > + mask = (1 << off) - 1; > > + tmpa[0] &= mask; > > + tmpa[0] |= (val << off); > > + } > > + > > + dst = MM_LOADU((xmm_t *)&tmpa); > > + return dst; > > +} > > Again from IA manual: > PINSRD: SEL <-- COUNT[1:0]; > MASK <-- (0FFFFFFFFH << (SEL * 32)); > TEMP <-- (((SRC << (SEL *32)) AND MASK) ; > DEST <-- ((DEST AND NOT MASK) OR TEMP); > > So basically, all we need is to treat dst as array of 4 integers. > Off - is just an index [0-3] where to insert val: > > Static inline xmm_t > pinsrd(xmm_t dst, int32_t val, uint32_t off) > { > rte_xmm_t x; > int32_t idx; > > x.m = dst; > idx = off % RTE_DIM(x.u32); > x.u32[idx] = val; > return x.m; > } > > > + > > +#define MM_INSERT32(a, c, b) pinsrd(a, c, b) > > + > > +#else > > +#define MM_BLENDV8(a, b, c) _mm_blendv_epi8(a, b, c) > > +#define MM_TESTZ(a, b) _mm_testz_si128(a, b) > > +#define MM_INSERT32(a, c, b) _mm_insert_epi32(a, c, b) > > +#endif > > + > > #ifdef RTE_ARCH_X86_64 > > > > #define MM_CVT64(a) _mm_cvtsi128_si64(a) > > -- > > 1.8.3.1 >