From: Ravi Kerur <rkerur@gmail.com>
To: "Ondřej Bílka" <neleai@seznam.cz>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH v3] Implement memcmp using SIMD intrinsics
Date: Mon, 15 Jun 2015 13:47:38 -0700 [thread overview]
Message-ID: <CAFb4SLDaWiaGiES1tgW7y3r4AJzAAKzO-ecZv6M0p-=o4dET_w@mail.gmail.com> (raw)
In-Reply-To: <20150612083056.GA18090@domone>
On Fri, Jun 12, 2015 at 1:30 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> On Mon, May 18, 2015 at 01:01:42PM -0700, Ravi Kerur wrote:
> > Background:
> > After preliminary discussion with John (Zhihong) and Tim from Intel it
> was
> > decided that it would be beneficial to use AVX/SSE intrinsics for memcmp
> > similar to memcpy that had been implemeneted. In addition, we decided to
> use
> > librte_hash as a test candidate to test both functionality and
> performance.
> >
> > Further discussions lead to complete functionality implementation of
> memory
> > comparison and v3 code reflects that.
> >
> > Test was conducted on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu
> 14.04,
> > x86_64, 16GB DDR3 system.
> >
> > Ravi Kerur (1):
> > Implement memcmp using Intel SIMD instrinsics.
>
> As my previous mail got lost I am resending it.
>
> In short you shouldn't
> use sse2/avx2 for memcmp at all. In 95% of calls you find inequality in
> first 8 bytes so sse2 adds just unnecessary overhead versus checking
> these with.
>
>
Can you provide more details on how you found out 95% of the time
inequality results within first 8 bytes and how it applies to network
applications. Was any study or experiment done to understand from network
applications point of view? If yes, please share it.
Secondly, we (Intel engr and I) started off with non-avx and we have
slightly different version of what you have posted below for non-avx and at
that time we had focussed on 128 bytes comparison only and it couldn't beat
avx at all. No assumption on inequality i.e. byte difference can be
anywhere from 0th to 127th byte.
snippets of code below
__inline uint16_t bswap_16(uint16_t a)
{ return __builtin_bswap16(a);}
__inline uint32_t bswap_32(uint32_t a)
{ return __builtin_bswap32(a);}
__inline uint64_t bswap_64(uint64_t a)
{ return __builtin_bswap64(a);}
#define RTE_CMP_1(a, b) { \
uint8_t x = *(uint8_t *)(a); \
uint8_t y = *(uint8_t *)(b); \
if (x != y) return x - y; }
#define _RTE_CMP_1(a, b) \
return *(uint8_t *)(a) - *(uint8_t *)(b);
//****************************************
#define RTE_CMP_2(a, b) { \
uint16_t x = bswap_16(*(uint16_t *)(a)); \
uint16_t y = bswap_16(*(uint16_t *)(b)); \
if (x != y) return x - y; }
#define _RTE_CMP_2(a, b) { \
uint16_t x = bswap_16(*(uint16_t *)(a)); \
uint16_t y = bswap_16(*(uint16_t *)(b)); \
return x - y; }
//****************************************
#define RTE_CMP_4(a, b) { \
uint32_t x = bswap_32(*(uint32_t *)(a)); \
uint32_t y = bswap_32(*(uint32_t *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }
#define _RTE_CMP_4(a, b) { \
uint32_t x = bswap_32(*(uint32_t *)(a)); \
uint32_t y = bswap_32(*(uint32_t *)(b)); \
return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************
#define RTE_CMP_8(a, b) { \
uint64_t x = bswap_64(*(uint64_t *)(a)); \
uint64_t y = bswap_64(*(uint64_t *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }
#define _RTE_CMP_8(a, b) { \
uint64_t x = bswap_64(*(uint64_t *)(a)); \
uint64_t y = bswap_64(*(uint64_t *)(b)); \
return (x < y) ? -1 : (x > y) ? 1 : 0; }
static inline int_32 rte_memcmp(const void *_a, const void *_b, size_t
_size)
//*************************************************************************
{
uint8_t *a = (uint8_t *)_a;
uint8_t *b = (uint8_t *)_b;
ptrdiff_t size = _size;
uint64_t x, y;
ptrdiff_t i;
if (!size)
return 0;
RTE_CMP_1(a, b)
if (size >= 32)
goto cmp_long;
for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
{
RTE_CMP_8(a + 0, b + 0)
RTE_CMP_8(a + 8, b + 8)
}
...
}
Thanks.
> 190: 48 8b 4e 08 mov 0x8(%rsi),%rcx
> 194: 48 39 4f 08 cmp %rcx,0x8(%rdi)
> 198: 75 f3 jne 18d <memeq30+0xd>
>
> Also as you have full memcmp does in your gcc optimize out
> if (memcmp(x,y))
> like in mine?
>
> So run also implementation below in your benchmark, my guess is it will
> be faster.
>
> Original mail follows:
>
>
>
> Hi,
>
> I as glibc developer that wrote current strcmp code have some comments.
>
> First is that gcc builtins for *cmp are garbage that produce rep cmpsb
> which is slower than byte-by-byte loop. So compile your test again with
> -fno-builtin-memcmp and your performance gain will probably disappear.
>
> Then there is inlining. Its correct to do that for first 32 bytes and I
> plan to add header that does that check to improve performance. However
> not for bytes after 32'th. Thats very cold code, Only 5.6% calls reach
> 17th byte and 1.7% of calls read 33'th byte, so just do libcall to save
> size.
>
> That also makes avx2 pointless, for most string funtions avx2 doesn't
> give you gains as xmm for first 64 bytes has better latency and while
> loop is faster its also relatively cold as its almost never reached.
>
> For memcmp I posted on gcc list a sample implementation how it should do
> inlining. I found that gcc optimizes that better than expected and
> produces probably optimal header (see below and feel free to use it).
>
> When you care about sign then its better to load first 8 bytes, convert
> them to big endian where can you compare directly. When you don't gcc
> managed to optimize away bswap so you check 8 bytes with three
> instructions below. Now I think that in header we shouldn't use sse at
> all.
>
> 190: 48 8b 4e 08 mov 0x8(%rsi),%rcx
> 194: 48 39 4f 08 cmp %rcx,0x8(%rdi)
> 198: 75 f3 jne 18d <memeq30+0xd>
>
> As I mentioned statistics on my computer memcmp has following:
>
> calls 1430827
> average n: 7.4 n <= 0: 0.1% n <= 4: 36.3% n <= 8: 78.4% n <=
> 16: 94.4% n <= 24: 97.3% n <= 32: 98.3% n <= 48: 98.6% n <= 64:
> 99.9%
> s aligned to 4 bytes: 99.8% 8 bytes: 97.5% 16 bytes: 59.5%
> average *s access cache latency 3.6 l <= 8: 92.0% l <= 16: 96.1%
> l <= 32: 98.9% l <= 64: 99.4% l <= 128: 99.5%
> s2 aligned to 4 bytes: 24.1% 8 bytes: 13.1% 16 bytes: 8.2%
> s-s2 aligned to 4 bytes: 24.1% 8 bytes: 15.4% 16 bytes: 10.3%
> average *s2 access cache latency 1.5 l <= 8: 98.0% l <= 16:
> 99.6% l <= 32: 99.9% l <= 64: 100.0% l <= 128: 100.0%
> average capacity: 8.5 c <= 0: 0.0% c <= 4: 36.0% c <= 8: 78.3%
> c <= 16: 91.8% c <= 24: 94.8% c <= 32: 95.7% c <= 48: 96.1% c <= 64:
> 99.9%
>
> #include <string.h>
> #include <stdint.h>
>
> #undef memcmp
> #define memcmp(x, y, n) (__builtin_constant_p (n) && n < 64 ?
> __memcmp_inline (x, y, n) \
> : memcmp (x, y, n))
>
> #define LOAD8(x) (*((uint8_t *) (x)))
> #define LOAD32(x) (*((uint32_t *) (x)))
> #define LOAD64(x) (*((uint64_t *) (x)))
>
> #define CHECK(tp, n)
> #if __BYTE_ORDER == __LITTLE_ENDIAN
> # define SWAP32(x) __builtin_bswap32 (LOAD32 (x))
> # define SWAP64(x) __builtin_bswap64 (LOAD64 (x))
> #else
> # define SWAP32(x) LOAD32 (x)
> # define SWAP64(x) LOAD64 (x)
> #endif
>
> #define __ARCH_64BIT 1
>
> static __always_inline
> int
> check (uint64_t x, uint64_t y)
> {
> if (x == y)
> return 0;
> if (x > y)
> return 1;
>
> return -1;
> }
>
> static __always_inline
> int
> check_nonzero (uint64_t x, uint64_t y)
> {
> if (x > y)
> return 1;
>
> return -1;
> }
>
>
> static __always_inline
> int
> __memcmp_inline (void *x, void *y, size_t n)
> {
> #define CHECK1 if (LOAD8 (x + i) - LOAD8 (y + i)) \
> return check_nonzero (LOAD8 (x + i), LOAD8 (y + i)); i = i + 1;
> #define CHECK4 if (i == 0 ? SWAP32 (x + i) - SWAP32 (y + i)\
> : LOAD32 (x + i) - LOAD32 (y + i)) \
> return check_nonzero (SWAP32 (x + i), SWAP32 (y + i)); i = i + 4;
> #define CHECK8 if (i == 0 ? SWAP64 (x + i) - SWAP64 (y + i)\
> : LOAD64 (x + i) - LOAD64 (y + i)) \
> return check_nonzero (SWAP64 (x + i), SWAP64 (y + i)); i = i + 8;
>
> #define CHECK1FINAL(o) return check (LOAD8 (x + i + o), LOAD8 (y + i + o));
> #define CHECK4FINAL(o) return check (SWAP32 (x + i + o), SWAP32 (y + i +
> o));
> #define CHECK8FINAL(o) return check (SWAP64 (x + i + o), SWAP64 (y + i +
> o));
>
> #if __ARCH_64BIT == 0
> # undef CHECK8
> # undef CHECK8FINAL
> # define CHECK8 CHECK4 CHECK4
> # define CHECK8FINAL(o) CHECK4 CHECK4FINAL (o)
> #endif
>
> #define LOOP if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 } \
> if (i + 8 < n) { CHECK8 }
>
>
> long i = 0;
>
> switch (n % 8)
> {
> case 0:
> if (n == 0)
> return 0;
>
> LOOP; CHECK8FINAL (0);
> case 1:
> LOOP CHECK1FINAL (0);
> case 2:
> if (n == 2)
> {
> CHECK1 CHECK1FINAL (0);
> }
> LOOP CHECK4FINAL (-2);
> case 3:
> if (n == 3)
> {
> CHECK1 CHECK1 CHECK1FINAL (0);
> }
> LOOP CHECK4FINAL (-1);
> case 4:
> LOOP CHECK4FINAL (0);
> case 5:
> if (n == 5)
> {
> CHECK4 CHECK1FINAL (0);
> }
> #if __ARCH_64BIT
> LOOP CHECK8FINAL (-3);
> #else
> LOOP CHECK4 CHECK1FINAL (0);
> #endif
> case 6:
> if (n == 6)
> {
> CHECK4 CHECK4FINAL (-2);
> }
> LOOP CHECK8FINAL (-2);
> case 7:
> if (n == 7)
> {
> CHECK4 CHECK4FINAL (-1);
> }
> LOOP CHECK8FINAL (-1);
> }
> }
>
> int
> memcmp1 (char *x, char *y)
> {
> return memcmp (x, y, 1);
> }
> int
> memcmp10 (char *x, char *y)
> {
> return memcmp (x, y, 10);
> }
> int
> memcmp20 (char *x, char *y)
> {
> return memcmp (x, y, 20);
> }
> int
> memcmp30 (char *x, char *y)
> {
> return memcmp (x, y, 30);
> }
>
> int
> memeq1 (char *x, char *y)
> {
> return memcmp (x, y, 1) != 0;
> }
> int
> memeq10 (char *x, char *y)
> {
> return memcmp (x, y, 10) != 0;
> }
> int
> memeq20 (char *x, char *y)
> {
> return memcmp (x, y, 20) != 0;
> }
> int
> memeq30 (char *x, char *y)
> {
> return memcmp (x, y, 30) != 0;
> }
>
>
prev parent reply other threads:[~2015-06-15 20:47 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-18 20:01 Ravi Kerur
2015-05-18 20:01 ` [dpdk-dev] [PATCH v3] Implement memcmp using Intel SIMD instrinsics Ravi Kerur
2015-10-14 0:32 ` Stephen Hemminger
2016-01-28 3:08 ` [dpdk-dev] [dpdk-dev, " Zhihong Wang
2016-02-19 17:50 ` Ravi Kerur
2016-02-23 12:22 ` Wang, Zhihong
2016-02-24 4:00 ` Ravi Kerur
2015-06-12 8:30 ` [dpdk-dev] [PATCH v3] Implement memcmp using SIMD intrinsics Ondřej Bílka
2015-06-12 9:03 ` Bruce Richardson
2015-06-15 20:47 ` Ravi Kerur [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAFb4SLDaWiaGiES1tgW7y3r4AJzAAKzO-ecZv6M0p-=o4dET_w@mail.gmail.com' \
--to=rkerur@gmail.com \
--cc=dev@dpdk.org \
--cc=neleai@seznam.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).