DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1 0/2] rte_memcmp functions
@ 2016-03-07 22:59 Ravi Kerur
  2016-03-07 23:00 ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Ravi Kerur
  0 siblings, 1 reply; 13+ messages in thread
From: Ravi Kerur @ 2016-03-07 22:59 UTC (permalink / raw)
  To: dev

This patch provides AVX/SSE based memcmp implementation on x86. For 
other architectures supported by DPDK, rte_memcmp simply uses memcmp
function.

Following are preliminary performance numbers on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

RTE>>memcmp_perf_autotest 
 *** RTE memcmp equal performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          2,     4.8526 ***
 ***          5,     5.4023 ***
 ***          8,     4.5067 ***
 ***          9,     5.4024 ***
 ***         15,     7.2069 ***
 ***         16,     4.5027 ***
 ***         17,     4.5020 ***
 ***         31,     4.5020 ***
 ***         32,     4.5033 ***
 ***         33,     5.1377 ***
 ***         63,     6.9069 ***
 ***         64,     6.9472 ***
 ***         65,     9.6301 ***
 ***        127,    13.5122 ***
 ***        128,    10.8028 ***
 ***        129,    11.7058 ***
 ***        191,    14.4105 ***
 ***        192,    14.4251 ***
 ***        193,    16.2139 ***
 ***        255,    18.0125 ***
 ***        256,    17.1150 ***
 ***        257,    18.9129 ***
 ***        319,    20.7148 ***
 ***        320,    20.7161 ***
 ***        321,    22.5198 ***
 ***        383,    24.3169 ***
 ***        384,    22.5195 ***
 ***        385,    24.3197 ***
 ***        447,    26.1171 ***
 ***        448,    26.1289 ***
 ***        449,    27.9168 ***
 ***        511,    29.7252 ***
 ***        512,    29.7202 ***
 ***        513,    27.9253 ***
 ***        767,    38.7506 ***
 ***        768,    36.9327 ***
 ***        769,    38.7259 ***
 ***       1023,    49.5368 ***
 ***       1024,    49.5347 ***
 ***       1025,    46.8414 ***
 ***       1522,    68.4517 ***
 ***       1536,    68.4522 ***
 ***       1600,    67.5478 ***
 ***       2048,    87.3674 ***
 ***       2560,   106.2776 ***
 ***       3072,   125.1937 ***
 ***       3584,   144.1503 ***
 ***       4096,   163.0243 ***
 ***       4608,   181.9367 ***
 ***       5632,   219.7613 ***
 ***       6144,   238.6745 ***
 ***       6656,   257.6009 ***
 ***       7168,   276.5084 ***
 ***       7680,   295.4162 ***
 ***       8192,   314.3726 ***
 ***      16834,   746.1065 ***
 *** memcmp equal performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          2,     9.0100 ***
 ***          5,     8.1065 ***
 ***          8,     9.1944 ***
 ***          9,     9.0044 ***
 ***         15,     9.0084 ***
 ***         16,    10.0695 ***
 ***         17,     9.0109 ***
 ***         31,     9.9111 ***
 ***         32,     9.9085 ***
 ***         33,     9.9112 ***
 ***         63,    12.6098 ***
 ***         64,    12.6106 ***
 ***         65,    12.6060 ***
 ***        127,    19.8160 ***
 ***        128,    19.8145 ***
 ***        129,    20.7260 ***
 ***        191,    26.1214 ***
 ***        192,    26.1195 ***
 ***        193,    26.1158 ***
 ***        255,    30.6222 ***
 ***        256,    30.6267 ***
 ***        257,    31.5270 ***
 ***        319,    36.0264 ***
 ***        320,    36.0497 ***
 ***        321,    36.9247 ***
 ***        383,    40.5290 ***
 ***        384,    40.5265 ***
 ***        385,    41.4331 ***
 ***        447,    45.9317 ***
 ***        448,    45.9324 ***
 ***        449,    45.9302 ***
 ***        511,    50.4652 ***
 ***        512,    50.4379 ***
 ***        513,    51.3361 ***
 ***        767,    67.5552 ***
 ***        768,    67.5464 ***
 ***        769,    67.5462 ***
 ***       1023,    85.5579 ***
 ***       1024,    85.5610 ***
 ***       1025,    85.5582 ***
 ***       1522,   120.6860 ***
 ***       1536,   121.6064 ***
 ***       1600,   126.1075 ***
 ***       2048,   157.6208 ***
 ***       2560,   208.8309 ***
 ***       3072,   241.7587 ***
 ***       3584,   276.1556 ***
 ***       4096,   310.5865 ***
 ***       4608,   343.8918 ***
 ***       5632,   411.2264 ***
 ***       6144,   445.3057 ***
 ***       6656,   480.4620 ***
 ***       7168,   512.5769 ***
 ***       7680,   547.9394 ***
 ***       8192,   582.7687 ***
 ***      16834,  1456.4280 ***
 *** RTE memcmp greater than performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          1,    22.5862 ***
 ***          8,    24.9140 ***
 ***         15,    25.3942 ***
 ***         16,    22.1721 ***
 ***         32,    24.1650 ***
 ***         64,    25.0849 ***
 ***        128,    26.5515 ***
 ***        256,    28.7055 ***
 ***        512,    35.2811 ***
 ***       1024,    44.4520 ***
 ***       2048,    64.1331 ***
 ***       4096,   103.9949 ***
 ***       8192,   184.8077 ***
 ***      16384,   345.6785 ***
 *** memcmp greater than performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          1,    22.6340 ***
 ***          8,    25.5552 ***
 ***         15,    25.4223 ***
 ***         16,    25.1371 ***
 ***         32,    26.7381 ***
 ***         64,    27.4521 ***
 ***        128,    29.7323 ***
 ***        256,    35.8891 ***
 ***        512,    46.0419 ***
 ***       1024,   101.1564 ***
 ***       2048,   159.8415 ***
 ***       4096,   230.2136 ***
 ***       8192,   366.2912 ***
 ***      16384,   647.0217 ***
 *** RTE memcmp less than performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          1,    22.6627 ***
 ***          8,    26.2665 ***
 ***         15,    26.8192 ***
 ***         16,    21.7960 ***
 ***         32,    23.9878 ***
 ***         64,    24.2074 ***
 ***        128,    26.8111 ***
 ***        256,    28.3444 ***
 ***        512,    34.7882 ***
 ***       1024,    44.4824 ***
 ***       2048,    63.4154 ***
 ***       4096,   101.4360 ***
 ***       8192,   179.1029 ***
 ***      16384,   333.9357 ***
 *** memcmp less than performance test results ***
 *** Length (bytes), Ticks/Op. ***
 ***          1,    22.2894 ***
 ***          8,    24.9805 ***
 ***         15,    24.8632 ***
 ***         16,    24.3448 ***
 ***         32,    24.8554 ***
 ***         64,    25.7541 ***
 ***        128,    29.1831 ***
 ***        256,    36.2345 ***
 ***        512,    45.8233 ***
 ***       1024,   103.4597 ***
 ***       2048,   163.5588 ***
 ***       4096,   232.7368 ***
 ***       8192,   368.1143 ***
 ***      16384,   649.0326 ***
Test OK
RTE>>quit


Ravi Kerur (2):
  rte_memcmp functions using Intel AVX and SSE intrinsics
  Test cases for rte_memcmp functions

 app/test/Makefile                                  |  31 +-
 app/test/autotest_data.py                          |  19 +
 app/test/test_memcmp.c                             | 250 +++++++
 app/test/test_memcmp_perf.c                        | 396 +++++++++++
 .../common/include/arch/arm/rte_memcmp.h           |  60 ++
 .../common/include/arch/ppc_64/rte_memcmp.h        |  62 ++
 .../common/include/arch/tile/rte_memcmp.h          |  60 ++
 .../common/include/arch/x86/rte_memcmp.h           | 786 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memcmp.h | 175 +++++
 9 files changed, 1838 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_memcmp.c
 create mode 100644 app/test/test_memcmp_perf.c
 create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memcmp.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics
  2016-03-07 22:59 [dpdk-dev] [PATCH v1 0/2] rte_memcmp functions Ravi Kerur
@ 2016-03-07 23:00 ` Ravi Kerur
  2016-03-07 23:00   ` [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions Ravi Kerur
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ravi Kerur @ 2016-03-07 23:00 UTC (permalink / raw)
  To: dev

v1:
        This patch adds memcmp functionality using AVX and SSE
        intrinsics provided by Intel. For other architectures
        supported by DPDK regular memcmp function is used.

        Compiled and tested on Ubuntu 14.04(non-NUMA) and 15.10(NUMA)
        systems.

Signed-off-by: Ravi Kerur <rkerur@gmail.com>
---
 .../common/include/arch/arm/rte_memcmp.h           |  60 ++
 .../common/include/arch/ppc_64/rte_memcmp.h        |  62 ++
 .../common/include/arch/tile/rte_memcmp.h          |  60 ++
 .../common/include/arch/x86/rte_memcmp.h           | 786 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memcmp.h | 175 +++++
 5 files changed, 1143 insertions(+)
 create mode 100644 lib/librte_eal/common/include/arch/arm/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/tile/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memcmp.h

diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcmp.h b/lib/librte_eal/common/include/arch/arm/rte_memcmp.h
new file mode 100644
index 0000000..fcbacb4
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/arm/rte_memcmp.h
@@ -0,0 +1,60 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016 RehiveTech. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IBM Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMCMP_ARM_H_
+#define _RTE_MEMCMP_ARM_H_
+
+#include <stdint.h>
+#include <string.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "generic/rte_memcmp.h"
+
+#define rte_memcmp(dst, src, n)              \
+	({ (__builtin_constant_p(n)) ?       \
+	memcmp((dst), (src), (n)) :          \
+	rte_memcmp_func((dst), (src), (n)); })
+
+static inline bool
+rte_memcmp_func(void *dst, const void *src, size_t n)
+{
+	return memcmp(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_ARM_H_ */
diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h b/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
new file mode 100644
index 0000000..5839a2d
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
@@ -0,0 +1,62 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) IBM Corporation 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IBM Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMCMP_PPC_64_H_
+#define _RTE_MEMCMP_PPC_64_H_
+
+#include <stdint.h>
+#include <string.h>
+/*To include altivec.h, GCC version must  >= 4.8 */
+#include <altivec.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "generic/rte_memcmp.h"
+
+#define rte_memcmp(dst, src, n)              \
+	({ (__builtin_constant_p(n)) ?       \
+	memcmp((dst), (src), (n)) :          \
+	rte_memcmp_func((dst), (src), (n)); })
+
+static inline bool
+rte_memcmp_func(void *dst, const void *src, size_t n)
+{
+	return memcmp(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_PPC_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/tile/rte_memcmp.h b/lib/librte_eal/common/include/arch/tile/rte_memcmp.h
new file mode 100644
index 0000000..de35ac5
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/tile/rte_memcmp.h
@@ -0,0 +1,60 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) EZchip Semiconductor Ltd. 2016.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IBM Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMCMP_TILE_H_
+#define _RTE_MEMCMP_TILE_H_
+
+#include <stdint.h>
+#include <string.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "generic/rte_memcmp.h"
+
+#define rte_memcmp(dst, src, n)              \
+	({ (__builtin_constant_p(n)) ?       \
+	memcmp((dst), (src), (n)) :          \
+	rte_memcmp_func((dst), (src), (n)); })
+
+static inline bool
+rte_memcmp_func(void *dst, const void *src, size_t n)
+{
+	return memcmp(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_TILE_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcmp.h b/lib/librte_eal/common/include/arch/x86/rte_memcmp.h
new file mode 100644
index 0000000..00d0d31
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcmp.h
@@ -0,0 +1,786 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCMP_X86_64_H_
+#define _RTE_MEMCMP_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcmp().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <stdlib.h>
+
+#include <rte_vect.h>
+#include <rte_branch_prediction.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to compare.
+ * @return
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_memcmp(const void *src_1, const void *src,
+		size_t n) __attribute__((always_inline));
+
+/**
+ * Find the first different byte for comparison.
+ */
+static inline int
+rte_cmpffdb(const uint8_t *x, const uint8_t *y, size_t n)
+{
+	size_t i;
+
+	for (i = 0; i < n; i++)
+		if (x[i] != y[i])
+			return x[i] - y[i];
+	return 0;
+}
+
+/**
+ * Compare 0 to 15 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_memcmp_regular(const uint8_t *src_1u, const uint8_t *src_2u, size_t n)
+{
+	int ret = 1;
+
+	/**
+	 * Compare less than 16 bytes
+	 */
+	if (n & 0x01) {
+		ret = (*(const uint8_t *)src_1u ==
+			*(const uint8_t *)src_2u);
+
+		if ((ret != 1))
+			goto exit_1;
+
+		n -= 0x1;
+		src_1u += 0x1;
+		src_2u += 0x1;
+	}
+
+	if (n & 0x02) {
+		ret = (*(const uint16_t *)src_1u ==
+			*(const uint16_t *)src_2u);
+
+		if ((ret != 1))
+			goto exit_2;
+
+		n -= 0x2;
+		src_1u += 0x2;
+		src_2u += 0x2;
+	}
+
+	if (n & 0x04) {
+		ret = (*(const uint32_t *)src_1u ==
+			*(const uint32_t *)src_2u);
+
+		if ((ret != 1))
+			goto exit_4;
+
+		n -= 0x4;
+		src_1u += 0x4;
+		src_2u += 0x4;
+	}
+
+	if (n & 0x08) {
+		ret = (*(const uint64_t *)src_1u ==
+			*(const uint64_t *)src_2u);
+
+		if ((ret != 1))
+			goto exit_8;
+
+		n -= 0x8;
+		src_1u += 0x8;
+		src_2u += 0x8;
+	}
+
+	return !ret;
+
+exit_1:
+	return rte_cmpffdb(src_1u, src_2u, 1);
+exit_2:
+	return rte_cmpffdb(src_1u, src_2u, 2);
+exit_4:
+	return rte_cmpffdb(src_1u, src_2u, 4);
+exit_8:
+	return rte_cmpffdb(src_1u, src_2u, 8);
+}
+
+/**
+ * Compare 16 bytes between two locations.
+ * locations should not overlap.
+ */
+static inline int
+rte_cmp16(const void *src_1, const void *src_2)
+{
+	__m128i xmm0, xmm1, xmm2;
+
+	xmm0 = _mm_lddqu_si128((const __m128i *)src_1);
+	xmm1 = _mm_lddqu_si128((const __m128i *)src_2);
+
+	xmm2 = _mm_xor_si128(xmm0, xmm1);
+
+	if (unlikely(!_mm_testz_si128(xmm2, xmm2))) {
+		__m128i idx =
+			_mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
+
+		/*
+		 * Reverse byte order
+		 */
+		xmm0 = _mm_shuffle_epi8(xmm0, idx);
+		xmm1 = _mm_shuffle_epi8(xmm1, idx);
+
+		/*
+		* Compare unsigned bytes with instructions for signed bytes
+		*/
+		xmm0 = _mm_xor_si128(xmm0, _mm_set1_epi8(0x80));
+		xmm1 = _mm_xor_si128(xmm1, _mm_set1_epi8(0x80));
+
+		return _mm_movemask_epi8(xmm0 > xmm1) - _mm_movemask_epi8(xmm1 > xmm0);
+	}
+
+	return 0;
+}
+
+/**
+ * AVX2 implementation below
+ */
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+
+static inline int
+rte_cmp32(const void *src_1, const void *src_2)
+{
+	__m256i    ff = _mm256_set1_epi32(-1);
+	__m256i    idx = _mm256_setr_epi8(
+			15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
+			15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
+	__m256i    sign = _mm256_set1_epi32(0x80000000);
+	__m256i    mm11, mm21;
+	__m256i    eq, gt0, gt1;
+
+	mm11 = _mm256_lddqu_si256((const __m256i *)src_1);
+	mm21 = _mm256_lddqu_si256((const __m256i *)src_2);
+
+	eq = _mm256_cmpeq_epi32(mm11, mm21);
+	/* Not equal */
+	if (!_mm256_testc_si256(eq, ff)) {
+		mm11 = _mm256_shuffle_epi8(mm11, idx);
+		mm21 = _mm256_shuffle_epi8(mm21, idx);
+
+		mm11 = _mm256_xor_si256(mm11, sign);
+		mm21 = _mm256_xor_si256(mm21, sign);
+		mm11 = _mm256_permute2f128_si256(mm11, mm11, 0x01);
+		mm21 = _mm256_permute2f128_si256(mm21, mm21, 0x01);
+
+		gt0 = _mm256_cmpgt_epi32(mm11, mm21);
+		gt1 = _mm256_cmpgt_epi32(mm21, mm11);
+		return _mm256_movemask_ps(_mm256_castsi256_ps(gt0)) - _mm256_movemask_ps(_mm256_castsi256_ps(gt1));
+	}
+
+	return 0;
+}
+
+/**
+ * Compare 48 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp48(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 0 * 32,
+			(const uint8_t *)src_2 + 0 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 1 * 32,
+			(const uint8_t *)src_2 + 1 * 32);
+	return ret;
+}
+
+/**
+ * Compare 64 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp64(const void *src_1, const void *src_2)
+{
+	const __m256i *src1 = (const __m256i *)src_1;
+	const __m256i *src2 = (const __m256i *)src_2;
+
+	__m256i mm11 = _mm256_lddqu_si256(src1);
+	__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
+	__m256i mm21 = _mm256_lddqu_si256(src2);
+	__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
+
+	__m256i mm1 = _mm256_xor_si256(mm11, mm21);
+	__m256i mm2 = _mm256_xor_si256(mm12, mm22);
+	__m256i mm = _mm256_or_si256(mm1, mm2);
+
+	if (unlikely(!_mm256_testz_si256(mm, mm))) {
+
+		__m256i idx = _mm256_setr_epi8(
+				15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
+				15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
+		__m256i sign = _mm256_set1_epi32(0x80000000);
+		__m256i gt0, gt1;
+
+		/*
+		 * Find out which of the two 32-byte blocks
+		 * are different.
+		 */
+		if (_mm256_testz_si256(mm1, mm1)) {
+			mm11 = mm12;
+			mm21 = mm22;
+			mm1 = mm2;
+		}
+
+		mm11 = _mm256_shuffle_epi8(mm11, idx);
+		mm21 = _mm256_shuffle_epi8(mm21, idx);
+
+		mm11 = _mm256_xor_si256(mm11, sign);
+		mm21 = _mm256_xor_si256(mm21, sign);
+		mm11 = _mm256_permute2f128_si256(mm11, mm11, 0x01);
+		mm21 = _mm256_permute2f128_si256(mm21, mm21, 0x01);
+
+		gt0 = _mm256_cmpgt_epi32(mm11, mm21);
+		gt1 = _mm256_cmpgt_epi32(mm21, mm11);
+		return _mm256_movemask_ps(_mm256_castsi256_ps(gt0)) - _mm256_movemask_ps(_mm256_castsi256_ps(gt1));
+	}
+
+	return 0;
+}
+
+/**
+ * Compare 128 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp128(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp64((const uint8_t *)src_1 + 0 * 64,
+			(const uint8_t *)src_2 + 0 * 64);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp64((const uint8_t *)src_1 + 1 * 64,
+			(const uint8_t *)src_2 + 1 * 64);
+}
+
+/**
+ * Compare 256 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp256(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp64((const uint8_t *)src_1 + 0 * 64,
+			(const uint8_t *)src_2 + 0 * 64);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp64((const uint8_t *)src_1 + 1 * 64,
+			(const uint8_t *)src_2 + 1 * 64);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp64((const uint8_t *)src_1 + 2 * 64,
+			(const uint8_t *)src_2 + 2 * 64);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp64((const uint8_t *)src_1 + 3 * 64,
+			(const uint8_t *)src_2 + 3 * 64);
+}
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to compare.
+ * @return
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
+{
+	const uint8_t *src_1 = (const uint8_t *)_src_1;
+	const uint8_t *src_2 = (const uint8_t *)_src_2;
+	int ret = 0;
+
+	if (n < 16)
+		return rte_memcmp_regular(src_1, src_2, n);
+
+	if (n <= 32) {
+		ret = rte_cmp16(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+	if (n <= 48) {
+		ret = rte_cmp32(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+	if (n <= 64) {
+		ret = rte_cmp32(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		ret = rte_cmp16(src_1 + 32, src_2 + 32);
+
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+CMP_BLOCK_LESS_THAN_512:
+	if (n <= 512) {
+		if (n >= 256) {
+			ret = rte_cmp256(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			src_1 = src_1 + 256;
+			src_2 = src_2 + 256;
+			n -= 256;
+		}
+		if (n >= 128) {
+			ret = rte_cmp128(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			src_1 = src_1 + 128;
+			src_2 = src_2 + 128;
+			n -= 128;
+		}
+		if (n >= 64) {
+			n -= 64;
+			ret = rte_cmp64(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			src_1 = src_1 + 64;
+			src_2 = src_2 + 64;
+		}
+		if (n > 32) {
+			ret = rte_cmp32(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			ret = rte_cmp32(src_1 - 32 + n, src_2 - 32 + n);
+			return ret;
+		}
+		if (n > 0)
+			ret = rte_cmp32(src_1 - 32 + n, src_2 - 32 + n);
+
+		return ret;
+	}
+
+	while (n > 512) {
+		ret = rte_cmp256(src_1 + 0 * 256, src_2 + 0 * 256);
+		if (unlikely(ret != 0))
+			return ret;
+
+		ret = rte_cmp256(src_1 + 1 * 256, src_2 + 1 * 256);
+		if (unlikely(ret != 0))
+			return ret;
+
+		src_1 = src_1 + 512;
+		src_2 = src_2 + 512;
+		n -= 512;
+	}
+	goto CMP_BLOCK_LESS_THAN_512;
+}
+
+#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+/**
+ * Compare 32 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp32(const void *src_1, const void *src_2)
+{
+	const __m128i *src1 = (const __m128i *)src_1;
+	const __m128i *src2 = (const __m128i *)src_2;
+
+	__m128i mm11 = _mm_lddqu_si128(src1);
+	__m128i mm12 = _mm_lddqu_si128(src1 + 1);
+	__m128i mm21 = _mm_lddqu_si128(src2);
+	__m128i mm22 = _mm_lddqu_si128(src2 + 1);
+
+	__m128i mm1 = _mm_xor_si128(mm11, mm21);
+	__m128i mm2 = _mm_xor_si128(mm12, mm22);
+	__m128i mm = _mm_or_si128(mm1, mm2);
+
+	if (unlikely(!_mm_testz_si128(mm, mm))) {
+
+		__m128i idx =
+			_mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
+		/*
+		 * Find out which of the two 16-byte blocks
+		 * are different.
+		 */
+		if (_mm_testz_si128(mm1, mm1)) {
+			mm11 = mm12;
+			mm21 = mm22;
+			mm1 = mm2;
+		}
+
+		/*
+		 * Reverse byte order.
+		 */
+		mm11 = _mm_shuffle_epi8(mm11, idx);
+		mm21 = _mm_shuffle_epi8(mm21, idx);
+
+		/*
+		 * Compare unsigned bytes with instructions for
+		 * signed bytes.
+		 */
+		mm11 = _mm_xor_si128(mm11, _mm_set1_epi8(0x80));
+		mm21 = _mm_xor_si128(mm21, _mm_set1_epi8(0x80));
+
+		return _mm_movemask_epi8(mm11 > mm21) -
+				_mm_movemask_epi8(mm21 > mm11);
+	}
+
+	return 0;
+}
+
+/**
+ * Compare 48 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp48(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 0 * 16,
+			(const uint8_t *)src_2 + 0 * 16);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 1 * 16,
+			(const uint8_t *)src_2 + 1 * 16);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp16((const uint8_t *)src_1 + 2 * 16,
+			(const uint8_t *)src_2 + 2 * 16);
+}
+
+/**
+ * Compare 64 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp64(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 0 * 16,
+			(const uint8_t *)src_2 + 0 * 16);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 1 * 16,
+			(const uint8_t *)src_2 + 1 * 16);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 2 * 16,
+			(const uint8_t *)src_2 + 2 * 16);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp16((const uint8_t *)src_1 + 3 * 16,
+			(const uint8_t *)src_2 + 3 * 16);
+}
+
+/**
+ * Compare 128 bytes or its multiple between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp128(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 0 * 32,
+			(const uint8_t *)src_2 + 0 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 1 * 32,
+			(const uint8_t *)src_2 + 1 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 2 * 32,
+			(const uint8_t *)src_2 + 2 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp32((const uint8_t *)src_1 + 3 * 32,
+			(const uint8_t *)src_2 + 3 * 32);
+}
+
+/**
+ * Compare 256 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp256(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 0 * 32,
+			(const uint8_t *)src_2 + 0 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 1 * 32,
+			(const uint8_t *)src_2 + 1 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 2 * 32,
+			(const uint8_t *)src_2 + 2 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 3 * 32,
+			(const uint8_t *)src_2 + 3 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 4 * 32,
+			(const uint8_t *)src_2 + 4 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 5 * 32,
+			(const uint8_t *)src_2 + 5 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 6 * 32,
+			(const uint8_t *)src_2 + 6 * 32);
+
+	if (unlikely(ret != 0))
+		return ret;
+
+	return rte_cmp32((const uint8_t *)src_1 + 7 * 32,
+			(const uint8_t *)src_2 + 7 * 32);
+}
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to compare.
+ * @return
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
+{
+	const uint8_t *src_1 = (const uint8_t *)_src_1;
+	const uint8_t *src_2 = (const uint8_t *)_src_2;
+	int ret = 0;
+
+	if (n < 16)
+		return rte_memcmp_regular(src_1, src_2, n);
+
+	if (n <= 32) {
+		ret = rte_cmp16(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+	if (n <= 48) {
+		ret = rte_cmp32(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+	if (n <= 64) {
+		ret = rte_cmp32(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		ret = rte_cmp16(src_1 + 32, src_2 + 32);
+
+		if (unlikely(ret != 0))
+			return ret;
+
+		return rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+	}
+
+	if (n <= 512) {
+		if (n >= 256) {
+			ret = rte_cmp256(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+
+			src_1 = src_1 + 256;
+			src_2 = src_2 + 256;
+			n -= 256;
+		}
+
+CMP_BLOCK_LESS_THAN_256:
+		if (n >= 128) {
+			ret = rte_cmp128(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+
+			src_1 = src_1 + 128;
+			src_2 = src_2 + 128;
+			n -= 128;
+		}
+
+		if (n >= 64) {
+			ret = rte_cmp64(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+
+			src_1 = src_1 + 64;
+			src_2 = src_2 + 64;
+			n -= 64;
+		}
+
+		if (n >= 32) {
+			ret = rte_cmp32(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			src_1 = src_1 + 32;
+			src_2 = src_2 + 32;
+			n -= 32;
+		}
+		if (n > 16) {
+			ret = rte_cmp16(src_1, src_2);
+			if (unlikely(ret != 0))
+				return ret;
+			ret = rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+			return ret;
+		}
+		if (n > 0)
+			ret = rte_cmp16(src_1 - 16 + n, src_2 - 16 + n);
+
+		return ret;
+	}
+
+	for (; n >= 256; n -= 256) {
+		ret = rte_cmp256(src_1, src_2);
+		if (unlikely(ret != 0))
+			return ret;
+
+		src_1 = src_1 + 256;
+		src_2 = src_2 + 256;
+	}
+
+	goto CMP_BLOCK_LESS_THAN_256;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/generic/rte_memcmp.h b/lib/librte_eal/common/include/generic/rte_memcmp.h
new file mode 100644
index 0000000..1f8f2bd
--- /dev/null
+++ b/lib/librte_eal/common/include/generic/rte_memcmp.h
@@ -0,0 +1,175 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCMP_H_
+#define _RTE_MEMCMP_H_
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memcmp().
+ */
+
+/**
+ * Find the first different bit for comparison.
+ */
+static inline int
+rte_cmpffd(uint32_t x, uint32_t y);
+
+/**
+ * Find the first different byte for comparison.
+ */
+static inline int
+rte_cmpffdb(const uint8_t *x, const uint8_t *y, size_t n);
+
+/**
+ * Compare 16 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp16(const void *src_1, const void *src_2);
+
+/**
+ * Compare 32 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp32(const void *src_1, const void *src_2);
+
+/**
+ * Compare 64 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp64(const void *src_1, const void *src_2);
+
+/**
+ * Compare 48 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp48(const void *src_1, const void *src_2);
+
+/**
+ * Compare 128 bytes between two locations using
+ * optimised instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp128(const void *src_1, const void *src_2);
+
+/**
+ * Compare 256 bytes or greater between two locations using
+ * optimised instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static inline int
+rte_cmp256(const void *src_1, const void *src_2);
+
+#ifdef __DOXYGEN__
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   zero if src_1 equal src_2
+ *   -ve if src_1 less than src_2
+ *   +ve if src_1 greater than src_2
+ */
+static int
+rte_memcmp(const void *dst, const void *src, size_t n);
+
+#endif /* __DOXYGEN__ */
+
+/*
+ * memcmp() function used by rte_memcmp macro
+ */
+static inline int
+rte_memcmp_func(void *dst, const void *src, size_t n) __attribute__((always_inline));
+
+#endif /* _RTE_MEMCMP_H_ */
-- 
1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2016-03-07 23:00 ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Ravi Kerur
@ 2016-03-07 23:00   ` Ravi Kerur
  2016-05-26  9:05     ` Wang, Zhihong
  2016-05-25  8:56   ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Thomas Monjalon
  2016-05-26  8:57   ` Wang, Zhihong
  2 siblings, 1 reply; 13+ messages in thread
From: Ravi Kerur @ 2016-03-07 23:00 UTC (permalink / raw)
  To: dev

v1:
        This patch adds test cases for rte_memcmp functions.
        New rte_memcmp functions can be tested via 'make test'
        and 'testpmd' utility.

        Compiled and tested on Ubuntu 14.04(non-NUMA) and
        15.10(NUMA) systems.

Signed-off-by: Ravi Kerur <rkerur@gmail.com>
---
 app/test/Makefile           |  31 +++-
 app/test/autotest_data.py   |  19 +++
 app/test/test_memcmp.c      | 250 ++++++++++++++++++++++++++++
 app/test/test_memcmp_perf.c | 396 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 695 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_memcmp.c
 create mode 100644 app/test/test_memcmp_perf.c

diff --git a/app/test/Makefile b/app/test/Makefile
index ec33e1a..f6ecaa9 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -82,6 +82,9 @@ SRCS-y += test_logs.c
 SRCS-y += test_memcpy.c
 SRCS-y += test_memcpy_perf.c
 
+SRCS-y += test_memcmp.c
+SRCS-y += test_memcmp_perf.c
+
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_thash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_perf.c
@@ -160,14 +163,40 @@ CFLAGS += $(WERROR_FLAGS)
 
 CFLAGS += -D_GNU_SOURCE
 
-# Disable VTA for memcpy test
+# Disable VTA for memcpy and memcmp tests
 ifeq ($(CC), gcc)
 ifeq ($(shell test $(GCC_VERSION) -ge 44 && echo 1), 1)
 CFLAGS_test_memcpy.o += -fno-var-tracking-assignments
 CFLAGS_test_memcpy_perf.o += -fno-var-tracking-assignments
+
+CFLAGS_test_memcmp.o += -fno-var-tracking-assignments
+CFLAGS_test_memcmp_perf.o += -fno-var-tracking-assignments
+
 endif
 endif
 
+CMP_AVX2_SUPPORT=$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+	grep -q AVX2 && echo 1)
+
+ifeq ($(CMP_AVX2_SUPPORT), 1)
+	ifeq ($(CC), icc)
+		CFLAGS_test_memcmp.o += -march=core-avx2
+		CFLAGS_test_memcmp_perf.o += -march=core-avx2
+	else
+		CFLAGS_test_memcmp.o += -mavx2
+		CFLAGS_test_memcmp_perf.o += -mavx2
+	endif
+else
+	ifeq ($(CC), icc)
+		CFLAGS_test_memcmp.o += -march=core-sse4.1
+		CFLAGS_test_memcmp_perf.o += -march=core-sse4.1
+	else
+		CFLAGS_test_memcmp.o += -msse4.1
+		CFLAGS_test_memcmp_perf.o += -msse4.1
+	endif
+endif
+
+
 # this application needs libraries first
 DEPDIRS-y += lib drivers
 
diff --git a/app/test/autotest_data.py b/app/test/autotest_data.py
index 6f34d6b..5113327 100644
--- a/app/test/autotest_data.py
+++ b/app/test/autotest_data.py
@@ -186,6 +186,12 @@ parallel_test_group_list = [
 		 "Report" :	None,
 		},
 		{
+		 "Name" :	"Memcmp autotest",
+		 "Command" : 	"memcmp_autotest",
+		 "Func" :	default_autotest,
+		 "Report" :	None,
+		},
+		{
 		 "Name" :	"Memzone autotest",
 		 "Command" : 	"memzone_autotest",
 		 "Func" :	default_autotest,
@@ -398,6 +404,19 @@ non_parallel_test_group_list = [
 	]
 },
 {
+	"Prefix":	"memcmp_perf",
+	"Memory" :	per_sockets(512),
+	"Tests" :
+	[
+		{
+		 "Name" :	"Memcmp performance autotest",
+		 "Command" : 	"memcmp_perf_autotest",
+		 "Func" :	default_autotest,
+		 "Report" :	None,
+		},
+	]
+},
+{
 	"Prefix":	"hash_perf",
 	"Memory" :	per_sockets(512),
 	"Tests" :
diff --git a/app/test/test_memcmp.c b/app/test/test_memcmp.c
new file mode 100644
index 0000000..e3b0bf7
--- /dev/null
+++ b/app/test/test_memcmp.c
@@ -0,0 +1,250 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cycles.h>
+#include <rte_random.h>
+#include <rte_memory.h>
+#include <rte_eal.h>
+#include <rte_memcmp.h>
+
+#include "test.h"
+
+/*******************************************************************************
+ * Memcmp function performance test configuration section.
+ * Each performance test will be performed HASHTEST_ITERATIONS times.
+ *
+ * The five arrays below control what tests are performed. Every combination
+ * from the array entries is tested.
+ */
+static size_t memcmp_sizes[] = {
+	1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255,
+	256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, 1522, 1600,
+	2048, 3072, 4096, 5120, 6144, 7168, 8192, 16384
+};
+
+/******************************************************************************/
+
+#define RTE_MEMCMP_LENGTH_MAX 16384
+
+/*
+ * Test a memcmp equal function.
+ */
+static int run_memcmp_eq_func_test(uint32_t len)
+{
+	uint32_t i, rc;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL) {
+		printf("\nkey_1 is null\n");
+		return -1;
+	}
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		rte_free(key_1);
+		printf("\nkey_2 is null\n");
+		return -1;
+	}
+
+	for (i = 0; i < len; i++)
+		key_1[i] = 1;
+
+	for (i = 0; i < len; i++)
+		key_2[i] = 1;
+
+	rc = rte_memcmp(key_1, key_2, len);
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return rc;
+}
+
+/*
+ * Test memcmp equal functions.
+ */
+static int run_memcmp_eq_func_tests(void)
+{
+	unsigned i;
+
+	for (i = 0;
+	     i < sizeof(memcmp_sizes) / sizeof(memcmp_sizes[0]);
+	     i++) {
+		if (run_memcmp_eq_func_test(memcmp_sizes[i])) {
+			printf("Comparing equal %zd bytes failed\n", memcmp_sizes[i]);
+			return 1;
+		}
+	}
+	printf("RTE memcmp for equality successful\n");
+	return 0;
+}
+
+/*
+ * Test a memcmp less than function.
+ */
+static int run_memcmp_lt_func_test(uint32_t len)
+{
+	uint32_t i, rc;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL)
+		return -1;
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		rte_free(key_1);
+		return -1;
+	}
+
+	for (i = 0; i < len; i++)
+		key_1[i] = 1;
+
+	for (i = 0; i < len; i++)
+		key_2[i] = 2;
+
+	rc = rte_memcmp(key_1, key_2, len);
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return rc;
+}
+
+/*
+ * Test memcmp less than functions.
+ */
+static int run_memcmp_lt_func_tests(void)
+{
+	unsigned i;
+
+	for (i = 0;
+	     i < sizeof(memcmp_sizes) / sizeof(memcmp_sizes[0]);
+	     i++) {
+		if (!(run_memcmp_lt_func_test(memcmp_sizes[i]) < 0)) {
+			printf("Comparing less than for %zd bytes failed\n", memcmp_sizes[i]);
+			return 1;
+		}
+	}
+	printf("RTE memcmp for less than successful\n");
+	return 0;
+}
+
+/*
+ * Test a memcmp greater than function.
+ */
+static int run_memcmp_gt_func_test(uint32_t len)
+{
+	uint32_t i, rc;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL)
+		return -1;
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		rte_free(key_1);
+		return -1;
+	}
+
+	for (i = 0; i < len; i++)
+		key_1[i] = 2;
+
+	for (i = 0; i < len; i++)
+		key_2[i] = 1;
+
+	rc = rte_memcmp(key_1, key_2, len);
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return rc;
+}
+
+/*
+ * Test memcmp less than functions.
+ */
+static int run_memcmp_gt_func_tests(void)
+{
+	unsigned i;
+
+	for (i = 0;
+	     i < sizeof(memcmp_sizes) / sizeof(memcmp_sizes[0]);
+	     i++) {
+		if (!(run_memcmp_gt_func_test(memcmp_sizes[i]) > 0)) {
+			printf("Comparing greater than for %zd bytes failed\n", memcmp_sizes[i]);
+			return 1;
+		}
+	}
+	printf("RTE memcmp for greater than successful\n");
+	return 0;
+}
+
+/*
+ * Do all unit and performance tests.
+ */
+static int
+test_memcmp(void)
+{
+	if (run_memcmp_eq_func_tests())
+		return -1;
+
+	if (run_memcmp_gt_func_tests())
+		return -1;
+
+	if (run_memcmp_lt_func_tests())
+		return -1;
+
+	return 0;
+}
+
+static struct test_command memcmp_cmd = {
+	.command = "memcmp_autotest",
+	.callback = test_memcmp,
+};
+REGISTER_TEST_COMMAND(memcmp_cmd);
diff --git a/app/test/test_memcmp_perf.c b/app/test/test_memcmp_perf.c
new file mode 100644
index 0000000..4c0f4d9
--- /dev/null
+++ b/app/test/test_memcmp_perf.c
@@ -0,0 +1,396 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <sys/queue.h>
+#include <sys/times.h>
+
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cycles.h>
+#include <rte_random.h>
+#include <rte_memory.h>
+#include <rte_memcmp.h>
+
+#include "test.h"
+
+/*******************************************************************************
+ * Memcmp function performance test configuration section. Each performance test
+ * will be performed MEMCMP_ITERATIONS times.
+ *
+ * The five arrays below control what tests are performed. Every combination
+ * from the array entries is tested.
+ */
+#define MEMCMP_ITERATIONS (500 * 500 * 500)
+
+static size_t memcmp_sizes[] = {
+	2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
+	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
+	385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
+	1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
+	5632, 6144, 6656, 7168, 7680, 8192, 16834
+};
+
+static size_t memcmp_lt_gt_sizes[] = {
+	1, 8, 15, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
+};
+
+/******************************************************************************/
+
+static int
+run_single_memcmp_eq_perf_test(uint32_t len, int func_type, uint64_t iterations)
+{
+	uint32_t i, j;
+
+	double begin = 0, end = 0;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+	int rc = 0;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL) {
+		printf("\nkey_1 mem alloc failure\n");
+		return -1;
+	}
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		printf("\nkey_2 mem alloc failure\n");
+		rte_free(key_2);
+		return -1;
+	}
+
+	/* Prepare inputs for the current iteration */
+	for (j = 0; j < len; j++)
+		key_1[j] = key_2[j] = j / 64;
+
+	begin = rte_rdtsc();
+
+	/* Perform operation, and measure time it takes */
+	for (i = 0; i < iterations; i++) {
+
+		switch (func_type) {
+		case 1:
+			rc += rte_memcmp(key_1, key_2, len);
+			break;
+		case 2:
+			rc += memcmp(key_1, key_2, len);
+			break;
+		default:
+			break;
+		}
+
+	}
+
+	end = rte_rdtsc() - begin;
+
+	printf(" *** %10i, %10.4f ***\n", len, (double)(end/iterations));
+
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return rc;
+}
+
+/*
+ * Run all memcmp table performance tests.
+ */
+static int run_all_memcmp_eq_perf_tests(void)
+{
+	unsigned i;
+
+	printf(" *** RTE memcmp equal performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_sizes) / sizeof(memcmp_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_eq_perf_test(memcmp_sizes[i], 1,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+
+	printf(" *** memcmp equal performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_sizes) / sizeof(memcmp_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_eq_perf_test(memcmp_sizes[i], 2,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+	return 0;
+}
+
+static int
+run_single_memcmp_lt_perf_test(uint32_t len, int func_type,
+					uint64_t iterations)
+{
+	uint32_t i, j;
+
+	double begin = 0, end = 0;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+	int rc;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL) {
+		printf("\nKey_1 lt mem alloc failure\n");
+		return -1;
+	}
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		printf("\nKey_2 lt mem alloc failure\n");
+		rte_free(key_1);
+		return -1;
+	}
+
+	/* Prepare inputs for the current iteration */
+	for (j = 0; j < len; j++)
+		key_1[j] = 1;
+
+	for (j = 0; j < len; j++)
+		key_2[j] = 1;
+
+	/* Perform operation, and measure time it takes */
+	for (i = 0; i < iterations; i++) {
+
+		key_2[i % len] = 2;
+
+		switch (func_type) {
+		case 1:
+			begin = rte_rdtsc();
+			rc = rte_memcmp(key_1, key_2, len);
+			end += rte_rdtsc() - begin;
+			break;
+		case 2:
+			begin = rte_rdtsc();
+			rc = memcmp(key_1, key_2, len);
+			end += rte_rdtsc() - begin;
+			break;
+		default:
+			break;
+		}
+
+		key_2[i % len] = 1;
+
+		if (!(rc < 0)) {
+			printf("\nrc %d i %d\n", rc, i);
+			return -1;
+		}
+	}
+
+	printf(" *** %10i, %10.4f ***\n", len, (double)(end/iterations));
+
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return 0;
+}
+
+/*
+ * Run all memcmp table performance tests.
+ */
+static int run_all_memcmp_lt_perf_tests(void)
+{
+	unsigned i;
+
+	printf(" *** RTE memcmp less than performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_lt_gt_sizes) / sizeof(memcmp_lt_gt_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_lt_perf_test(memcmp_lt_gt_sizes[i], 1,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+
+	printf(" *** memcmp less than performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_lt_gt_sizes) / sizeof(memcmp_lt_gt_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_lt_perf_test(memcmp_lt_gt_sizes[i], 2,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+	return 0;
+}
+
+static int
+run_single_memcmp_gt_perf_test(uint32_t len, int func_type,
+					uint64_t iterations)
+{
+	uint32_t i, j;
+
+	double begin = 0, end = 0;
+
+	uint8_t *volatile key_1 = NULL;
+	uint8_t *volatile key_2 = NULL;
+	int rc;
+
+	key_1 = rte_zmalloc("memcmp key_1", len * sizeof(uint8_t), 16);
+	if (key_1 == NULL) {
+		printf("\nkey_1 gt mem alloc failure\n");
+		return -1;
+	}
+
+	key_2 = rte_zmalloc("memcmp key_2", len * sizeof(uint8_t), 16);
+	if (key_2 == NULL) {
+		printf("\nkey_2 gt mem alloc failure\n");
+		rte_free(key_1);
+		return -1;
+	}
+
+	/* Prepare inputs for the current iteration */
+	for (j = 0; j < len; j++)
+		key_1[j] = 1;
+
+	for (j = 0; j < len; j++)
+		key_2[j] = 1;
+
+	/* Perform operation, and measure time it takes */
+	for (i = 0; i < iterations; i++) {
+		key_1[i % len] = 2;
+
+		switch (func_type) {
+		case 1:
+			begin = rte_rdtsc();
+			rc = rte_memcmp(key_1, key_2, len);
+			end += rte_rdtsc() - begin;
+			break;
+		case 2:
+			begin = rte_rdtsc();
+			rc = memcmp(key_1, key_2, len);
+			end += rte_rdtsc() - begin;
+			break;
+		default:
+			break;
+		}
+
+		key_1[i % len] = 1;
+
+		if (!(rc > 0)) {
+			printf("\nrc %d i %d\n", rc, i);
+			for (i = 0; i < len; i++)
+				printf("\nkey_1 %d key_2 %d mod %d\n", key_1[i], key_2[i], (i % len));
+			return -1;
+		}
+	}
+
+	printf(" *** %10i, %10.4f ***\n", len, (double)(end/iterations));
+
+	rte_free(key_1);
+	rte_free(key_2);
+
+	return 0;
+}
+
+/*
+ * Run all memcmp table performance tests.
+ */
+static int run_all_memcmp_gt_perf_tests(void)
+{
+	unsigned i;
+
+	printf(" *** RTE memcmp greater than performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_lt_gt_sizes) / sizeof(memcmp_lt_gt_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_gt_perf_test(memcmp_lt_gt_sizes[i], 1,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+
+	printf(" *** memcmp greater than performance test results ***\n");
+	printf(" *** Length (bytes), Ticks/Op. ***\n");
+
+	/* Loop through every combination of test parameters */
+	for (i = 0;
+	     i < sizeof(memcmp_lt_gt_sizes) / sizeof(memcmp_lt_gt_sizes[0]);
+	     i++) {
+		/* Perform test */
+		if (run_single_memcmp_gt_perf_test(memcmp_lt_gt_sizes[i], 2,
+						MEMCMP_ITERATIONS) != 0)
+			return -1;
+	}
+	return 0;
+}
+
+/*
+ * Do all performance tests.
+ */
+static int
+test_memcmp_perf(void)
+{
+	if (run_all_memcmp_eq_perf_tests() != 0)
+		return -1;
+
+	if (run_all_memcmp_gt_perf_tests() != 0)
+		return -1;
+
+	if (run_all_memcmp_lt_perf_tests() != 0)
+		return -1;
+
+
+	return 0;
+}
+
+static struct test_command memcmp_perf_cmd = {
+	.command = "memcmp_perf_autotest",
+	.callback = test_memcmp_perf,
+};
+REGISTER_TEST_COMMAND(memcmp_perf_cmd);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics
  2016-03-07 23:00 ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Ravi Kerur
  2016-03-07 23:00   ` [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions Ravi Kerur
@ 2016-05-25  8:56   ` Thomas Monjalon
  2016-05-26  8:57   ` Wang, Zhihong
  2 siblings, 0 replies; 13+ messages in thread
From: Thomas Monjalon @ 2016-05-25  8:56 UTC (permalink / raw)
  To: dev, Zhihong Wang; +Cc: Ravi Kerur

2016-03-07 15:00, Ravi Kerur:
> v1:
>         This patch adds memcmp functionality using AVX and SSE
>         intrinsics provided by Intel. For other architectures
>         supported by DPDK regular memcmp function is used.

Anyone to review this patch please? Zhihong?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics
  2016-03-07 23:00 ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Ravi Kerur
  2016-03-07 23:00   ` [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions Ravi Kerur
  2016-05-25  8:56   ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Thomas Monjalon
@ 2016-05-26  8:57   ` Wang, Zhihong
  2018-12-20 23:30     ` Ferruh Yigit
  2 siblings, 1 reply; 13+ messages in thread
From: Wang, Zhihong @ 2016-05-26  8:57 UTC (permalink / raw)
  To: Ravi Kerur, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> Sent: Tuesday, March 8, 2016 7:01 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and
> SSE intrinsics
> 
> v1:
>         This patch adds memcmp functionality using AVX and SSE
>         intrinsics provided by Intel. For other architectures
>         supported by DPDK regular memcmp function is used.
> 
>         Compiled and tested on Ubuntu 14.04(non-NUMA) and 15.10(NUMA)
>         systems.
> 
[...]

> +	if (unlikely(!_mm_testz_si128(xmm2, xmm2))) {
> +		__m128i idx =
> +			_mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);

line over 80 characters ;)

> +
> +		/*
> +		 * Reverse byte order
> +		 */
> +		xmm0 = _mm_shuffle_epi8(xmm0, idx);
> +		xmm1 = _mm_shuffle_epi8(xmm1, idx);
> +
> +		/*
> +		* Compare unsigned bytes with instructions for signed bytes
> +		*/
> +		xmm0 = _mm_xor_si128(xmm0, _mm_set1_epi8(0x80));
> +		xmm1 = _mm_xor_si128(xmm1, _mm_set1_epi8(0x80));
> +
> +		return _mm_movemask_epi8(xmm0 > xmm1) -
> _mm_movemask_epi8(xmm1 > xmm0);
> +	}
> +
> +	return 0;
> +}

[...]

> +static inline int
> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> +{
> +	const uint8_t *src_1 = (const uint8_t *)_src_1;
> +	const uint8_t *src_2 = (const uint8_t *)_src_2;
> +	int ret = 0;
> +
> +	if (n < 16)
> +		return rte_memcmp_regular(src_1, src_2, n);
[...]
> +
> +	while (n > 512) {
> +		ret = rte_cmp256(src_1 + 0 * 256, src_2 + 0 * 256);

Thanks for the great work!

Seems to me there's a big improvement area before going into detailed
instruction layout tuning that -- No unalignment handling here for large
size memcmp.

So almost without a doubt the performance will be low in micro-architectures
like Sandy Bridge if the start address is unaligned, which might be a
common case.

> +		if (unlikely(ret != 0))
> +			return ret;
> +
> +		ret = rte_cmp256(src_1 + 1 * 256, src_2 + 1 * 256);
> +		if (unlikely(ret != 0))
> +			return ret;
> +
> +		src_1 = src_1 + 512;
> +		src_2 = src_2 + 512;
> +		n -= 512;
> +	}
> +	goto CMP_BLOCK_LESS_THAN_512;
> +}
> +
> +#else /* RTE_MACHINE_CPUFLAG_AVX2 */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2016-03-07 23:00   ` [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions Ravi Kerur
@ 2016-05-26  9:05     ` Wang, Zhihong
  2016-06-06 18:31       ` Ravi Kerur
  0 siblings, 1 reply; 13+ messages in thread
From: Wang, Zhihong @ 2016-05-26  9:05 UTC (permalink / raw)
  To: Ravi Kerur, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> Sent: Tuesday, March 8, 2016 7:01 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> 
> v1:
>         This patch adds test cases for rte_memcmp functions.
>         New rte_memcmp functions can be tested via 'make test'
>         and 'testpmd' utility.
> 
>         Compiled and tested on Ubuntu 14.04(non-NUMA) and
>         15.10(NUMA) systems.
[...]

> +/************************************************************
> *******************
> + * Memcmp function performance test configuration section. Each performance
> test
> + * will be performed MEMCMP_ITERATIONS times.
> + *
> + * The five arrays below control what tests are performed. Every combination
> + * from the array entries is tested.
> + */
> +#define MEMCMP_ITERATIONS (500 * 500 * 500)


Maybe less iteration will make the test faster without compromise precison?


> +
> +static size_t memcmp_sizes[] = {
> +	2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> +	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> +	385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> +	1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> +	5632, 6144, 6656, 7168, 7680, 8192, 16834
> +};
> +
[...]
> +/*
> + * Do all performance tests.
> + */
> +static int
> +test_memcmp_perf(void)
> +{
> +	if (run_all_memcmp_eq_perf_tests() != 0)
> +		return -1;
> +
> +	if (run_all_memcmp_gt_perf_tests() != 0)
> +		return -1;
> +
> +	if (run_all_memcmp_lt_perf_tests() != 0)
> +		return -1;
> +


Perhaps unaligned test cases are needed here.
How do you think?


> +
> +	return 0;
> +}
> +
> +static struct test_command memcmp_perf_cmd = {
> +	.command = "memcmp_perf_autotest",
> +	.callback = test_memcmp_perf,
> +};
> +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> --
> 1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2016-05-26  9:05     ` Wang, Zhihong
@ 2016-06-06 18:31       ` Ravi Kerur
  2016-06-07 11:09         ` Wang, Zhihong
  0 siblings, 1 reply; 13+ messages in thread
From: Ravi Kerur @ 2016-06-06 18:31 UTC (permalink / raw)
  To: Wang, Zhihong, Thomas Monjalon; +Cc: dev

Zhilong, Thomas,

If there is enough interest within DPDK community I can work on adding
support for 'unaligned access' and 'test cases' for it. Please let me know
either way.

Thanks,
Ravi


On Thu, May 26, 2016 at 2:05 AM, Wang, Zhihong <zhihong.wang@intel.com>
wrote:

>
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Tuesday, March 8, 2016 7:01 AM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> >
> > v1:
> >         This patch adds test cases for rte_memcmp functions.
> >         New rte_memcmp functions can be tested via 'make test'
> >         and 'testpmd' utility.
> >
> >         Compiled and tested on Ubuntu 14.04(non-NUMA) and
> >         15.10(NUMA) systems.
> [...]
>
> > +/************************************************************
> > *******************
> > + * Memcmp function performance test configuration section. Each
> performance
> > test
> > + * will be performed MEMCMP_ITERATIONS times.
> > + *
> > + * The five arrays below control what tests are performed. Every
> combination
> > + * from the array entries is tested.
> > + */
> > +#define MEMCMP_ITERATIONS (500 * 500 * 500)
>
>
> Maybe less iteration will make the test faster without compromise precison?
>
>
> > +
> > +static size_t memcmp_sizes[] = {
> > +     2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> > +     129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> > +     385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> > +     1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > +     5632, 6144, 6656, 7168, 7680, 8192, 16834
> > +};
> > +
> [...]
> > +/*
> > + * Do all performance tests.
> > + */
> > +static int
> > +test_memcmp_perf(void)
> > +{
> > +     if (run_all_memcmp_eq_perf_tests() != 0)
> > +             return -1;
> > +
> > +     if (run_all_memcmp_gt_perf_tests() != 0)
> > +             return -1;
> > +
> > +     if (run_all_memcmp_lt_perf_tests() != 0)
> > +             return -1;
> > +
>
>
> Perhaps unaligned test cases are needed here.
> How do you think?
>
>
> > +
> > +     return 0;
> > +}
> > +
> > +static struct test_command memcmp_perf_cmd = {
> > +     .command = "memcmp_perf_autotest",
> > +     .callback = test_memcmp_perf,
> > +};
> > +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> > --
> > 1.9.1
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2016-06-06 18:31       ` Ravi Kerur
@ 2016-06-07 11:09         ` Wang, Zhihong
  2017-01-02 20:41           ` Thomas Monjalon
  0 siblings, 1 reply; 13+ messages in thread
From: Wang, Zhihong @ 2016-06-07 11:09 UTC (permalink / raw)
  To: Ravi Kerur, Thomas Monjalon; +Cc: dev



> -----Original Message-----
> From: Ravi Kerur [mailto:rkerur@gmail.com]
> Sent: Tuesday, June 7, 2016 2:32 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>; Thomas Monjalon
> <thomas.monjalon@6wind.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> 
> Zhilong, Thomas,
> 
> If there is enough interest within DPDK community I can work on adding support
> for 'unaligned access' and 'test cases' for it. Please let me know either way.
> 


Hi Ravi,

This rte_memcmp is proved with better performance than glibc's in aligned
cases, I think it has good value to DPDK lib.

Though we don't have memcmp in critical pmd data path, it offers a better
choice for applications who do.


Thanks
Zhihong


> Thanks,
> Ravi
> 
> 
> On Thu, May 26, 2016 at 2:05 AM, Wang, Zhihong <zhihong.wang@intel.com>
> wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Tuesday, March 8, 2016 7:01 AM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
> >
> > v1:
> >         This patch adds test cases for rte_memcmp functions.
> >         New rte_memcmp functions can be tested via 'make test'
> >         and 'testpmd' utility.
> >
> >         Compiled and tested on Ubuntu 14.04(non-NUMA) and
> >         15.10(NUMA) systems.
> [...]
> 
> > +/************************************************************
> > *******************
> > + * Memcmp function performance test configuration section. Each performance
> > test
> > + * will be performed MEMCMP_ITERATIONS times.
> > + *
> > + * The five arrays below control what tests are performed. Every combination
> > + * from the array entries is tested.
> > + */
> > +#define MEMCMP_ITERATIONS (500 * 500 * 500)
> 
> 
> Maybe less iteration will make the test faster without compromise precison?
> 
> 
> > +
> > +static size_t memcmp_sizes[] = {
> > +     2, 5, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> > +     129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384,
> > +     385, 447, 448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024,
> > +     1025, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > +     5632, 6144, 6656, 7168, 7680, 8192, 16834
> > +};
> > +
> [...]
> > +/*
> > + * Do all performance tests.
> > + */
> > +static int
> > +test_memcmp_perf(void)
> > +{
> > +     if (run_all_memcmp_eq_perf_tests() != 0)
> > +             return -1;
> > +
> > +     if (run_all_memcmp_gt_perf_tests() != 0)
> > +             return -1;
> > +
> > +     if (run_all_memcmp_lt_perf_tests() != 0)
> > +             return -1;
> > +
> 
> 
> Perhaps unaligned test cases are needed here.
> How do you think?
> 
> 
> > +
> > +     return 0;
> > +}
> > +
> > +static struct test_command memcmp_perf_cmd = {
> > +     .command = "memcmp_perf_autotest",
> > +     .callback = test_memcmp_perf,
> > +};
> > +REGISTER_TEST_COMMAND(memcmp_perf_cmd);
> > --
> > 1.9.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2016-06-07 11:09         ` Wang, Zhihong
@ 2017-01-02 20:41           ` Thomas Monjalon
  2017-01-09  5:29             ` Wang, Zhihong
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Monjalon @ 2017-01-02 20:41 UTC (permalink / raw)
  To: Wang, Zhihong, Ravi Kerur; +Cc: dev

2016-06-07 11:09, Wang, Zhihong:
> From: Ravi Kerur [mailto:rkerur@gmail.com]
> > Zhilong, Thomas,
> > 
> > If there is enough interest within DPDK community I can work on adding support
> > for 'unaligned access' and 'test cases' for it. Please let me know either way.
> 
> Hi Ravi,
> 
> This rte_memcmp is proved with better performance than glibc's in aligned
> cases, I think it has good value to DPDK lib.
> 
> Though we don't have memcmp in critical pmd data path, it offers a better
> choice for applications who do.

Re-thinking about this series, could it be some values to have a rte_memcmp
implementation?
What is the value compared to glibc one? Why not working on glibc?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2017-01-02 20:41           ` Thomas Monjalon
@ 2017-01-09  5:29             ` Wang, Zhihong
  2017-01-09 11:08               ` Thomas Monjalon
  0 siblings, 1 reply; 13+ messages in thread
From: Wang, Zhihong @ 2017-01-09  5:29 UTC (permalink / raw)
  To: Thomas Monjalon, Ravi Kerur; +Cc: dev



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Tuesday, January 3, 2017 4:41 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>; Ravi Kerur
> <rkerur@gmail.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp
> functions
> 
> 2016-06-07 11:09, Wang, Zhihong:
> > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > > Zhilong, Thomas,
> > >
> > > If there is enough interest within DPDK community I can work on adding
> support
> > > for 'unaligned access' and 'test cases' for it. Please let me know either
> way.
> >
> > Hi Ravi,
> >
> > This rte_memcmp is proved with better performance than glibc's in aligned
> > cases, I think it has good value to DPDK lib.
> >
> > Though we don't have memcmp in critical pmd data path, it offers a better
> > choice for applications who do.
> 
> Re-thinking about this series, could it be some values to have a rte_memcmp
> implementation?

I think this series (rte_memcmp included) could help:

 1. Potentially better performance in hot paths.

 2. Agile for tuning.

 3. Avoid performance complications -- unusual but possible,
    like the glibc memset issue I met while working on vhost
    enqueue.

> What is the value compared to glibc one? Why not working on glibc?

As to working on glibc, wider design consideration and test
coverage might be needed, and we'll face different release
cycles, can we have the same agility? Also working with old
glibc could be a problem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2017-01-09  5:29             ` Wang, Zhihong
@ 2017-01-09 11:08               ` Thomas Monjalon
  2017-01-11  1:28                 ` Wang, Zhihong
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Monjalon @ 2017-01-09 11:08 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: Ravi Kerur, dev

2017-01-09 05:29, Wang, Zhihong:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > 2016-06-07 11:09, Wang, Zhihong:
> > > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > > > Zhilong, Thomas,
> > > >
> > > > If there is enough interest within DPDK community I can work on adding
> > support
> > > > for 'unaligned access' and 'test cases' for it. Please let me know either
> > way.
> > >
> > > Hi Ravi,
> > >
> > > This rte_memcmp is proved with better performance than glibc's in aligned
> > > cases, I think it has good value to DPDK lib.
> > >
> > > Though we don't have memcmp in critical pmd data path, it offers a better
> > > choice for applications who do.
> > 
> > Re-thinking about this series, could it be some values to have a rte_memcmp
> > implementation?
> 
> I think this series (rte_memcmp included) could help:
> 
>  1. Potentially better performance in hot paths.
> 
>  2. Agile for tuning.
> 
>  3. Avoid performance complications -- unusual but possible,
>     like the glibc memset issue I met while working on vhost
>     enqueue.
> 
> > What is the value compared to glibc one? Why not working on glibc?
> 
> As to working on glibc, wider design consideration and test
> coverage might be needed, and we'll face different release
> cycles, can we have the same agility? Also working with old
> glibc could be a problem.

Probably we need both: add the optimized version in DPDK while working
on a glibc optimization.
This strategy could be applicable to memcpy, memcmp and memset.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions
  2017-01-09 11:08               ` Thomas Monjalon
@ 2017-01-11  1:28                 ` Wang, Zhihong
  0 siblings, 0 replies; 13+ messages in thread
From: Wang, Zhihong @ 2017-01-11  1:28 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Ravi Kerur, dev



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Monday, January 9, 2017 7:09 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: Ravi Kerur <rkerur@gmail.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp
> functions
> 
> 2017-01-09 05:29, Wang, Zhihong:
> > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> > > 2016-06-07 11:09, Wang, Zhihong:
> > > > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > > > > Zhilong, Thomas,
> > > > >
> > > > > If there is enough interest within DPDK community I can work on
> adding
> > > support
> > > > > for 'unaligned access' and 'test cases' for it. Please let me know either
> > > way.
> > > >
> > > > Hi Ravi,
> > > >
> > > > This rte_memcmp is proved with better performance than glibc's in
> aligned
> > > > cases, I think it has good value to DPDK lib.
> > > >
> > > > Though we don't have memcmp in critical pmd data path, it offers a
> better
> > > > choice for applications who do.
> > >
> > > Re-thinking about this series, could it be some values to have a
> rte_memcmp
> > > implementation?
> >
> > I think this series (rte_memcmp included) could help:
> >
> >  1. Potentially better performance in hot paths.
> >
> >  2. Agile for tuning.
> >
> >  3. Avoid performance complications -- unusual but possible,
> >     like the glibc memset issue I met while working on vhost
> >     enqueue.
> >
> > > What is the value compared to glibc one? Why not working on glibc?
> >
> > As to working on glibc, wider design consideration and test
> > coverage might be needed, and we'll face different release
> > cycles, can we have the same agility? Also working with old
> > glibc could be a problem.
> 
> Probably we need both: add the optimized version in DPDK while working
> on a glibc optimization.
> This strategy could be applicable to memcpy, memcmp and memset.

This does help in the long run if turned out feasible.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics
  2016-05-26  8:57   ` Wang, Zhihong
@ 2018-12-20 23:30     ` Ferruh Yigit
  0 siblings, 0 replies; 13+ messages in thread
From: Ferruh Yigit @ 2018-12-20 23:30 UTC (permalink / raw)
  To: Wang, Zhihong, Ravi Kerur; +Cc: dpdk-dev, Thomas Monjalon

On 5/26/2016 9:57 AM, zhihong.wang at intel.com (Wang, Zhihong) wrote:
>> -----Original Message-----
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ravi Kerur
>> Sent: Tuesday, March 8, 2016 7:01 AM
>> To: dev at dpdk.org
>> Subject: [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and
>> SSE intrinsics
>>
>> v1:
>>         This patch adds memcmp functionality using AVX and SSE
>>         intrinsics provided by Intel. For other architectures
>>         supported by DPDK regular memcmp function is used.
>>
>>         Compiled and tested on Ubuntu 14.04(non-NUMA) and 15.10(NUMA)
>>         systems.
>>
> [...]
> 
>> +	if (unlikely(!_mm_testz_si128(xmm2, xmm2))) {
>> +		__m128i idx =
>> +			_mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
> 
> line over 80 characters ;)
> 
>> +
>> +		/*
>> +		 * Reverse byte order
>> +		 */
>> +		xmm0 = _mm_shuffle_epi8(xmm0, idx);
>> +		xmm1 = _mm_shuffle_epi8(xmm1, idx);
>> +
>> +		/*
>> +		* Compare unsigned bytes with instructions for signed bytes
>> +		*/
>> +		xmm0 = _mm_xor_si128(xmm0, _mm_set1_epi8(0x80));
>> +		xmm1 = _mm_xor_si128(xmm1, _mm_set1_epi8(0x80));
>> +
>> +		return _mm_movemask_epi8(xmm0 > xmm1) -
>> _mm_movemask_epi8(xmm1 > xmm0);
>> +	}
>> +
>> +	return 0;
>> +}
> 
> [...]
> 
>> +static inline int
>> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
>> +{
>> +	const uint8_t *src_1 = (const uint8_t *)_src_1;
>> +	const uint8_t *src_2 = (const uint8_t *)_src_2;
>> +	int ret = 0;
>> +
>> +	if (n < 16)
>> +		return rte_memcmp_regular(src_1, src_2, n);
> [...]
>> +
>> +	while (n > 512) {
>> +		ret = rte_cmp256(src_1 + 0 * 256, src_2 + 0 * 256);
> 
> Thanks for the great work!
> 
> Seems to me there's a big improvement area before going into detailed
> instruction layout tuning that -- No unalignment handling here for large
> size memcmp.
> 
> So almost without a doubt the performance will be low in micro-architectures
> like Sandy Bridge if the start address is unaligned, which might be a
> common case.

Patch is waiting for comment for a long time, since 2016 May. Updating patch
status as rejected.

Anyone planning to work on vectorized version of rte_memcmp() can benefit from
this patch:
https://patches.dpdk.org/patch/11156/
https://patches.dpdk.org/patch/11157/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-12-20 23:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07 22:59 [dpdk-dev] [PATCH v1 0/2] rte_memcmp functions Ravi Kerur
2016-03-07 23:00 ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Ravi Kerur
2016-03-07 23:00   ` [dpdk-dev] [PATCH v1 2/2] Test cases for rte_memcmp functions Ravi Kerur
2016-05-26  9:05     ` Wang, Zhihong
2016-06-06 18:31       ` Ravi Kerur
2016-06-07 11:09         ` Wang, Zhihong
2017-01-02 20:41           ` Thomas Monjalon
2017-01-09  5:29             ` Wang, Zhihong
2017-01-09 11:08               ` Thomas Monjalon
2017-01-11  1:28                 ` Wang, Zhihong
2016-05-25  8:56   ` [dpdk-dev] [PATCH v1 1/2] rte_memcmp functions using Intel AVX and SSE intrinsics Thomas Monjalon
2016-05-26  8:57   ` Wang, Zhihong
2018-12-20 23:30     ` Ferruh Yigit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).