DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v2] Implement rte_memcmp with AVX/SSE instructions.
@ 2015-05-08 21:19 Ravi Kerur
  2015-05-08 21:19 ` [dpdk-dev] [PATCH v2] Implement memcmp using " Ravi Kerur
  0 siblings, 1 reply; 21+ messages in thread
From: Ravi Kerur @ 2015-05-08 21:19 UTC (permalink / raw)
  To: dev

Background:
After preliminary discussion with John (Zhihong) and Tim from Intel it was
decided that it would be beneficial to use AVX/SSE instructions for memcmp
similar to memcpy being implemeneted. In addition, we decided to use
librte_hash as a test candidate to test both functionality and performance.

Currently memcmp in librte_hash is used for key comparisons whose length
can vary and max key length is defined to 64 bytes. Preliminary tests on
memory comparison alone shows using AVX/SSE instructions takes 1/3rd
CPU ticks compared with regular memcmp function. Furthermore,
hash_perf_autotest shows better results in all categories. Please note
that memory comparison is a small portion in hash functionality and CPU
Ticks/Op is for hash operations (Add on Empty, Add update, Lookup). Only
hash lookup results are shown below. I can send complete results if
interested.

Test was conducted on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu
14.04, x86_64, 16GB DDR3 system.

PS: I would like to keep "rte_memcmp" simple with return codes

0 - match
1 - no-match

since usage in DPDK is for equality or inequality and I have not seen
any instance where less-than/greater-than comparison is needed. Hence
"if (unlikely(...))" portion in the code will probably be removed and it
will be made specific to DPDK rather than being generic.

/*************Existing code**********************************/
 *** Hash table performance test results ***
Hash Func.  , Operation      , Key size (bytes), Entries, Entries per bucket, Errors  , Avg. bucket entries, Ticks/Op.
rte_hash_crc, Lookup         , 16              , 1024   , 1                 , 10000   , 0.00               , 88.55
rte_hash_crc, Lookup         , 16              , 1024   , 2                 , 10000   , 0.00               , 99.28
rte_hash_crc, Lookup         , 16              , 1024   , 4                 , 10000   , 0.00               , 106.73
rte_hash_crc, Lookup         , 16              , 1024   , 8                 , 10000   , 0.00               , 126.99
rte_hash_crc, Lookup         , 16              , 1024   , 16                , 10000   , 0.00               , 159.80

rte_hash_crc, Lookup         , 16              , 1048576, 1                 , 51      , 0.01               , 175.23
rte_hash_crc, Lookup         , 16              , 1048576, 2                 , 2       , 0.02               , 171.24
rte_hash_crc, Lookup         , 16              , 1048576, 4                 , 0       , 0.04               , 145.48
rte_hash_crc, Lookup         , 16              , 1048576, 8                 , 0       , 0.08               , 162.35
rte_hash_crc, Lookup         , 16              , 1048576, 16                , 0       , 0.15               , 182.42

jhash       , Lookup         , 16              , 1048576, 1                 , 33      , 0.01               , 219.71
jhash       , Lookup         , 16              , 1048576, 2                 , 1       , 0.02               , 216.44
jhash       , Lookup         , 16              , 1048576, 4                 , 0       , 0.04               , 188.29
jhash       , Lookup         , 16              , 1048576, 8                 , 0       , 0.08               , 203.70
jhash       , Lookup         , 16              , 1048576, 16                , 0       , 0.15               , 229.50

/**************New AVX/SSE code******************************/
Hash Func.  , Operation      , Key size (bytes), Entries, Entries per bucket, Errors  , Avg. bucket entries, Ticks/Op.
rte_hash_crc, Lookup         , 16              , 1024   , 1                 , 10000   , 0.00               , 85.69
rte_hash_crc, Lookup         , 16              , 1024   , 2                 , 10000   , 0.00               , 93.95
rte_hash_crc, Lookup         , 16              , 1024   , 4                 , 10000   , 0.00               , 102.80
rte_hash_crc, Lookup         , 16              , 1024   , 8                 , 10000   , 0.00               , 122.60
rte_hash_crc, Lookup         , 16              , 1024   , 16                , 10000   , 0.00               , 156.58

rte_hash_crc, Lookup         , 16              , 1048576, 1                 , 41      , 0.01               , 156.84
rte_hash_crc, Lookup         , 16              , 1048576, 2                 , 0       , 0.02               , 157.90
rte_hash_crc, Lookup         , 16              , 1048576, 4                 , 0       , 0.04               , 134.92
rte_hash_crc, Lookup         , 16              , 1048576, 8                 , 0       , 0.08               , 150.99
rte_hash_crc, Lookup         , 16              , 1048576, 16                , 0       , 0.15               , 174.08

jhash       , Lookup         , 16              , 1048576, 1                 , 45      , 0.01               , 212.03
jhash       , Lookup         , 16              , 1048576, 2                 , 2       , 0.02               , 210.65
jhash       , Lookup         , 16              , 1048576, 4                 , 0       , 0.04               , 185.90
jhash       , Lookup         , 16              , 1048576, 8                 , 0       , 0.08               , 201.35
jhash       , Lookup         , 16              , 1048576, 16                , 0       , 0.15               , 223.54

Ravi Kerur (1):
  Implement memcmp using AVX/SSE instructions.

 app/test/test_hash_perf.c                          |  36 +-
 .../common/include/arch/ppc_64/rte_memcmp.h        |  62 +++
 .../common/include/arch/x86/rte_memcmp.h           | 421 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memcmp.h | 131 +++++++
 lib/librte_hash/rte_hash.c                         |  59 ++-
 5 files changed, 675 insertions(+), 34 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memcmp.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 21:19 [dpdk-dev] [PATCH v2] Implement rte_memcmp with AVX/SSE instructions Ravi Kerur
@ 2015-05-08 21:19 ` Ravi Kerur
  2015-05-08 22:29   ` Matt Laswell
  2015-05-12  8:13   ` Linhaifeng
  0 siblings, 2 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-08 21:19 UTC (permalink / raw)
  To: dev

This patch replaces memcmp in librte_hash with rte_memcmp which is
implemented with AVX/SSE instructions.

Preliminary results on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu
14.04 x86_64 shows comparisons using AVX/SSE instructions taking 1/3rd
CPU ticks for 16, 32, 48 and 64 bytes comparison. In addition,
hash_perf_autotest results shows using new comparison function results in
faster completion of hash operations than existing memcmp in all categories.

Signed-off-by: Ravi Kerur <rkerur@gmail.com>
---
 app/test/test_hash_perf.c                          |  36 +-
 .../common/include/arch/ppc_64/rte_memcmp.h        |  62 +++
 .../common/include/arch/x86/rte_memcmp.h           | 421 +++++++++++++++++++++
 lib/librte_eal/common/include/generic/rte_memcmp.h | 131 +++++++
 lib/librte_hash/rte_hash.c                         |  59 ++-
 5 files changed, 675 insertions(+), 34 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcmp.h
 create mode 100644 lib/librte_eal/common/include/generic/rte_memcmp.h

diff --git a/app/test/test_hash_perf.c b/app/test/test_hash_perf.c
index 6eabb21..6887629 100644
--- a/app/test/test_hash_perf.c
+++ b/app/test/test_hash_perf.c
@@ -440,7 +440,7 @@ run_single_tbl_perf_test(const struct rte_hash *h, hash_operation func,
 		uint32_t *invalid_pos_count)
 {
 	uint64_t begin, end, ticks = 0;
-	uint8_t *key = NULL;
+	uint8_t * volatile key = NULL;
 	uint32_t *bucket_occupancies = NULL;
 	uint32_t num_buckets, i, j;
 	int32_t pos;
@@ -547,30 +547,30 @@ run_tbl_perf_test(struct tbl_perf_test_params *params)
 	case ADD_UPDATE:
 		num_iterations = params->num_iterations;
 		params->num_iterations = params->entries;
-		run_single_tbl_perf_test(handle, rte_hash_add_key, params,
-				&avg_occupancy, &invalid_pos);
-		params->num_iterations = num_iterations;
 		ticks = run_single_tbl_perf_test(handle, rte_hash_add_key,
 				params, &avg_occupancy, &invalid_pos);
+		params->num_iterations = num_iterations;
+		ticks += run_single_tbl_perf_test(handle, rte_hash_add_key,
+				params, &avg_occupancy, &invalid_pos);
 		break;
 	case DELETE:
 		num_iterations = params->num_iterations;
 		params->num_iterations = params->entries;
-		run_single_tbl_perf_test(handle, rte_hash_add_key, params,
-				&avg_occupancy, &invalid_pos);
+		ticks = run_single_tbl_perf_test(handle, rte_hash_add_key,
+				params, &avg_occupancy, &invalid_pos);
 
 		params->num_iterations = num_iterations;
-		ticks = run_single_tbl_perf_test(handle, rte_hash_del_key,
+		ticks += run_single_tbl_perf_test(handle, rte_hash_del_key,
 				params, &avg_occupancy, &invalid_pos);
 		break;
 	case LOOKUP:
 		num_iterations = params->num_iterations;
 		params->num_iterations = params->entries;
-		run_single_tbl_perf_test(handle, rte_hash_add_key, params,
-				&avg_occupancy, &invalid_pos);
+		ticks = run_single_tbl_perf_test(handle, rte_hash_add_key,
+				params, &avg_occupancy, &invalid_pos);
 
 		params->num_iterations = num_iterations;
-		ticks = run_single_tbl_perf_test(handle, rte_hash_lookup,
+		ticks += run_single_tbl_perf_test(handle, rte_hash_lookup,
 				params, &avg_occupancy, &invalid_pos);
 		break;
 	default: return -1;
@@ -623,10 +623,15 @@ static int run_all_tbl_perf_tests(void)
 static void run_hash_func_test(rte_hash_function f, uint32_t init_val,
 		uint32_t key_len)
 {
-	static uint8_t key[RTE_HASH_KEY_LENGTH_MAX];
+	static uint8_t * volatile key;
 	uint64_t ticks = 0, start, end;
 	unsigned i, j;
 
+	key = rte_zmalloc("func hash key",
+			  key_len * sizeof(uint8_t), 16);
+	if (key == NULL)
+		return;
+
 	for (i = 0; i < HASHTEST_ITERATIONS; i++) {
 
 		for (j = 0; j < key_len; j++)
@@ -638,8 +643,11 @@ static void run_hash_func_test(rte_hash_function f, uint32_t init_val,
 		ticks += end - start;
 	}
 
-	printf("%-12s, %-18u, %-13u, %.02f\n", get_hash_name(f), (unsigned) key_len,
-			(unsigned) init_val, (double)ticks / HASHTEST_ITERATIONS);
+	rte_free(key);
+
+	printf("%-12s, %-18u, %-13u, %.02f\n",
+		get_hash_name(f), (unsigned) key_len, (unsigned) init_val,
+		(double)ticks / HASHTEST_ITERATIONS);
 }
 
 /*
@@ -687,7 +695,7 @@ fbk_hash_perf_test(void)
 		.socket_id = rte_socket_id(),
 	};
 	struct rte_fbk_hash_table *handle = NULL;
-	uint32_t *keys = NULL;
+	uint32_t * volatile keys = NULL;
 	unsigned indexes[TEST_SIZE];
 	uint64_t lookup_time = 0;
 	unsigned added = 0;
diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h b/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
new file mode 100644
index 0000000..7f99ee1
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_memcmp.h
@@ -0,0 +1,62 @@
+/*
+ *   BSD LICENSE
+ *
+ *   Copyright (C) IBM Corporation 2014.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of IBM Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef _RTE_MEMCMP_PPC_64_H_
+#define _RTE_MEMCMP_PPC_64_H_
+
+#include <stdint.h>
+#include <string.h>
+/*To include altivec.h, GCC version must  >= 4.8 */
+#include <altivec.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "generic/rte_memcmp.h"
+
+#define rte_memcmp(dst, src, n)              \
+	({ (__builtin_constant_p(n)) ?       \
+	memcmp((dst), (src), (n)) :          \
+	rte_memcmp_func((dst), (src), (n)); })
+
+static inline bool
+rte_memcmp_func(void *dst, const void *src, size_t n)
+{
+	return memcmp(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_PPC_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcmp.h b/lib/librte_eal/common/include/arch/x86/rte_memcmp.h
new file mode 100644
index 0000000..b2bdeec
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcmp.h
@@ -0,0 +1,421 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCMP_X86_64_H_
+#define _RTE_MEMCMP_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcmp().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_branch_prediction.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to compare.
+ * @return
+ *   true if equal otherwise false.
+ */
+static inline int
+rte_memcmp(const void *src_1, const void *src,
+		size_t n) __attribute__((always_inline));
+
+/**
+ * Compare 16 bytes between two locations.
+ * locations should not overlap.
+ */
+static inline int
+rte_cmp16(const void *src_1, const void *src_2)
+{
+	__m128i xmm0, xmm1, xmm2;
+	int ret = 0;
+
+	xmm0 = _mm_lddqu_si128((const __m128i *)src_1);
+	xmm1 = _mm_lddqu_si128((const __m128i *)src_2);
+	xmm2 = _mm_xor_si128(xmm0, xmm1);
+
+	if (unlikely(!_mm_testz_si128(xmm2, xmm2))) {
+
+		const uint64_t mm11 = *(const uint64_t *)src_1;
+		const uint64_t mm12 = *((const uint64_t *)src_1 + 1);
+
+		const uint64_t mm21 = *(const uint64_t *)src_2;
+		const uint64_t mm22 = *((const uint64_t *)src_2 + 1);
+
+		if (mm11 == mm21)
+			(mm12 < mm22) ? (ret = -1) : (ret = 1);
+		else
+			(mm11 < mm21) ? (ret = -1) : (ret = 1);
+	}
+
+	return ret;
+}
+
+/**
+ * AVX2 implementation below
+ */
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+
+/**
+ * Compare 32 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp32(const void *src_1, const void *src_2)
+{
+	const __m128i* src1 = (const __m128i*)src_1;
+	const __m128i* src2 = (const __m128i*)src_2;
+
+	__m128i mm11 = _mm_lddqu_si128(src1);
+	__m128i mm12 = _mm_lddqu_si128(src1 + 1);
+	__m128i mm21 = _mm_lddqu_si128(src2);
+	__m128i mm22 = _mm_lddqu_si128(src2 + 1);
+
+	__m128i mm1 = _mm_xor_si128(mm11, mm21);
+	__m128i mm2 = _mm_xor_si128(mm12, mm22);
+	__m128i mm = _mm_or_si128(mm1, mm2);
+
+	if (unlikely(!_mm_testz_si128(mm, mm))) {
+
+		/*
+		 * Find out which of the two 16-byte blocks
+		 * are different.
+		 */
+		if (_mm_testz_si128(mm1, mm1)) {
+			mm11 = mm12;
+			mm21 = mm22;
+			mm1 = mm2;
+		}
+
+		// Produce the comparison result
+		__m128i mm_cmp = _mm_cmpgt_epi8(mm21, mm11);
+		__m128i mm_rcmp = _mm_cmpgt_epi8(mm11, mm21);
+		mm_cmp = _mm_xor_si128(mm1, mm_cmp);
+		mm_rcmp = _mm_xor_si128(mm1, mm_rcmp);
+
+		uint32_t cmp = _mm_movemask_epi8(mm_cmp);
+		uint32_t rcmp = _mm_movemask_epi8(mm_rcmp);
+		cmp = (cmp - 1u) ^ cmp;
+		rcmp = (rcmp - 1u) ^ rcmp;
+		return (int32_t)cmp - (int32_t)rcmp;
+	}
+
+	return 0;
+}
+
+static inline int
+rte_cmp64 (const void* src_1, const void* src_2)
+{
+	const __m256i* src1 = (const __m256i*)src_1;
+	const __m256i* src2 = (const __m256i*)src_2;
+
+	__m256i mm11 = _mm256_lddqu_si256(src1);
+	__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
+	__m256i mm21 = _mm256_lddqu_si256(src2);
+	__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
+
+	__m256i mm1 = _mm256_xor_si256(mm11, mm21);
+	__m256i mm2 = _mm256_xor_si256(mm12, mm22);
+	__m256i mm = _mm256_or_si256(mm1, mm2);
+
+	if (unlikely(!_mm256_testz_si256(mm, mm))) {
+		/*
+		 * Find out which of the two 32-byte blocks
+		 * are different.
+		 */
+		if (_mm256_testz_si256(mm1, mm1)) {
+			mm11 = mm12;
+			mm21 = mm22;
+			mm1 = mm2;
+		}
+
+		// Produce the comparison result
+		__m256i mm_cmp = _mm256_cmpgt_epi8(mm21, mm11);
+		__m256i mm_rcmp = _mm256_cmpgt_epi8(mm11, mm21);
+		mm_cmp = _mm256_xor_si256(mm1, mm_cmp);
+		mm_rcmp = _mm256_xor_si256(mm1, mm_rcmp);
+
+		uint32_t cmp = _mm256_movemask_epi8(mm_cmp);
+		uint32_t rcmp = _mm256_movemask_epi8(mm_rcmp);
+		cmp = (cmp - 1u) ^ cmp;
+		rcmp = (rcmp - 1u) ^ rcmp;
+		return (int32_t)cmp - (int32_t)rcmp;
+	}
+
+	return 0;
+}
+
+static inline int
+rte_cmp128 (const void* src_1, const void* src_2)
+{
+	const __m256i* src1 = (const __m256i*)src_1;
+	const __m256i* src2 = (const __m256i*)src_2;
+	const size_t n = 2;
+	size_t i;
+
+	for (i = 0; i < n; ++i, src1 += 2, src2 += 2) {
+		__m256i mm11 = _mm256_lddqu_si256(src1);
+		__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
+		__m256i mm21 = _mm256_lddqu_si256(src2);
+		__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
+
+		__m256i mm1 = _mm256_xor_si256(mm11, mm21);
+		__m256i mm2 = _mm256_xor_si256(mm12, mm22);
+		__m256i mm = _mm256_or_si256(mm1, mm2);
+
+		if (unlikely(!_mm256_testz_si256(mm, mm))) {
+			/*
+			 * Find out which of the two 32-byte blocks
+			 * are different.
+			 */
+			if (_mm256_testz_si256(mm1, mm1)) {
+				mm11 = mm12;
+				mm21 = mm22;
+				mm1 = mm2;
+			}
+
+			// Produce the comparison result
+			__m256i mm_cmp = _mm256_cmpgt_epi8(mm21, mm11);
+			__m256i mm_rcmp = _mm256_cmpgt_epi8(mm11, mm21);
+			mm_cmp = _mm256_xor_si256(mm1, mm_cmp);
+			mm_rcmp = _mm256_xor_si256(mm1, mm_rcmp);
+
+			uint32_t cmp = _mm256_movemask_epi8(mm_cmp);
+			uint32_t rcmp = _mm256_movemask_epi8(mm_rcmp);
+			cmp = (cmp - 1u) ^ cmp;
+			rcmp = (rcmp - 1u) ^ rcmp;
+			return (int32_t)cmp - (int32_t)rcmp;
+		}
+	}
+
+	return 0;
+}
+#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+/**
+ * Compare 32 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp32(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp16((const uint8_t *)src_1 + 0 * 16,
+				(const uint8_t *)src_2 + 0 * 16);
+
+	if (likely(ret == 0))
+		return rte_cmp16((const uint8_t *)src_1 + 1 * 16,
+				(const uint8_t *)src_2 + 1 * 16);
+
+	return ret;
+}
+
+/**
+ * Compare 64 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp64(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 0 * 32,
+				(const uint8_t *)src_2 + 0 * 32);
+
+	if (likely(ret == 0))
+		return rte_cmp32((const uint8_t *)src_1 + 1 * 32,
+				(const uint8_t *)src_2 + 1 * 32);
+
+	return ret;
+}
+
+/**
+ * Compare 128 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp128(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp64((const uint8_t *)src_1 + 0 * 64,
+			(const uint8_t *)src_2 + 0 * 64);
+
+	if (likely(ret == 0))
+		return rte_cmp64((const uint8_t *)src_1 + 1 * 64,
+				(const uint8_t *)src_2 + 1 * 64);
+
+	return ret;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+
+/**
+ * Compare 48 bytes between two locations.
+ * Locations should not overlap.
+ */
+static inline int
+rte_cmp48(const void *src_1, const void *src_2)
+{
+	int ret;
+
+	ret = rte_cmp32((const uint8_t *)src_1 + 0 * 32,
+			(const uint8_t *)src_2 + 0 * 32);
+
+	if (likely(ret == 0))
+		return rte_cmp16((const uint8_t *)src_1 + 1 * 32,
+				(const uint8_t *)src_2 + 1 * 32);
+	return ret;
+}
+
+static inline int
+rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t n)
+{
+	int ret = 1;
+
+	/**
+	 * Compare less than 16 bytes
+	 */
+	if (n & 0x08) {
+		ret = (*(const uint64_t *)src_1u ==
+				*(const uint64_t *)src_2u);
+		if (likely(ret == 1)) {
+			n -= 0x8;
+			src_1u += 0x8;
+			src_2u += 0x8;
+		} else {
+			goto exit;
+		}
+	}
+
+	if (n & 0x04) {
+		ret = (*(const uint32_t *)src_1u ==
+				*(const uint32_t *)src_2u);
+		if (likely(ret == 1)) {
+			n -= 0x4;
+			src_1u += 0x4;
+			src_2u += 0x4;
+		} else {
+			goto exit;
+		}
+	}
+
+	if (n & 0x02) {
+		ret = (*(const uint16_t *)src_1u ==
+				*(const const uint16_t *)src_2u);
+
+		if (likely(ret == 1)) {
+			n -= 0x2;
+			src_1u += 0x2;
+			src_2u += 0x2;
+		} else {
+			goto exit;
+		}
+	}
+
+	if (n & 0x01) {
+		ret = (*(const uint8_t *)src_1u ==
+				*(const uint8_t *)src_2u);
+		if (likely(ret == 1)) {
+			return 0;
+		} else {
+			goto exit;
+		}
+	}
+
+	return !ret;
+exit:
+
+	return src_1u < src_2u ? -1 : 1;
+}
+
+static inline int
+rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
+{
+	const uint8_t *src_1 = (const uint8_t *)_src_1;
+	const uint8_t *src_2 = (const uint8_t *)_src_2;
+	int ret = 0;
+
+	if (n & 0x80)
+		return rte_cmp128(src_1, src_2);
+
+	if (n & 0x40)
+		return rte_cmp64(src_1, src_2);
+
+	if (n & 0x20) {
+		ret = rte_cmp32(src_1, src_2);
+		n -= 0x20;
+		src_1 += 0x20;
+		src_2 += 0x20;
+	}
+
+	if ((n & 0x10) && likely(ret == 0)) {
+		ret = rte_cmp16(src_1, src_2);
+		n -= 0x10;
+		src_1 += 0x10;
+		src_2 += 0x10;
+	}
+
+	if (n && likely(ret == 0))
+		ret = rte_memcmp_remainder(src_1, src_2, n);
+
+	return ret;
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCMP_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/generic/rte_memcmp.h b/lib/librte_eal/common/include/generic/rte_memcmp.h
new file mode 100644
index 0000000..db9626b
--- /dev/null
+++ b/lib/librte_eal/common/include/generic/rte_memcmp.h
@@ -0,0 +1,131 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCMP_H_
+#define _RTE_MEMCMP_H_
+
+/**
+ * @file
+ *
+ * Functions for vectorised implementation of memcmp().
+ */
+
+/**
+ * Compare 16 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ */
+static inline int
+rte_cmp16(const void *src_1, const void *src_2);
+
+/**
+ * Compare 32 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ */
+static inline int
+rte_cmp32(const void *src_1, const void *src_2);
+
+/**
+ * Compare 64 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ */
+static inline int
+rte_cmp64(const void *src_1, const void *src_2);
+
+/**
+ * Compare 48 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src
+ *   Pointer to the second source of the data.
+ */
+static inline int
+rte_cmp48(const void *src_1, const void *src_2);
+
+/**
+ * Compare 128 bytes between two locations using optimised
+ * instructions. The locations should not overlap.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ */
+static inline int
+rte_cmp128(const void *src_1, const void *src_2);
+
+#ifdef __DOXYGEN__
+
+/**
+ * Compare bytes between two locations. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param src_1
+ *   Pointer to the first source of the data.
+ * @param src_2
+ *   Pointer to the second source of the data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   true if match otherwise false.
+ */
+static int
+rte_memcmp(const void *dst, const void *src, size_t n);
+
+#endif /* __DOXYGEN__ */
+
+/*
+ * memcmp() function used by rte_memcmp macro
+ */
+static inline int
+rte_memcmp_func(void *dst, const void *src, size_t n) __attribute__((always_inline));
+
+#endif /* _RTE_MEMCMP_H_ */
diff --git a/lib/librte_hash/rte_hash.c b/lib/librte_hash/rte_hash.c
index 9245716..075da62 100644
--- a/lib/librte_hash/rte_hash.c
+++ b/lib/librte_hash/rte_hash.c
@@ -42,6 +42,7 @@
 #include <rte_memory.h>         /* for definition of RTE_CACHE_LINE_SIZE */
 #include <rte_log.h>
 #include <rte_memcpy.h>
+#include <rte_memcmp.h>
 #include <rte_prefetch.h>
 #include <rte_branch_prediction.h>
 #include <rte_memzone.h>
@@ -299,6 +300,7 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h,
 	uint8_t *key_bucket;
 	uint32_t bucket_index, i;
 	int32_t pos;
+	const void * volatile key_1 = key;
 
 	/* Get the hash signature and bucket index */
 	sig |= h->sig_msb;
@@ -308,10 +310,13 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h,
 
 	/* Check if key is already present in the hash */
 	for (i = 0; i < h->bucket_entries; i++) {
-		if ((sig == sig_bucket[i]) &&
-		    likely(memcmp(key, get_key_from_bucket(h, key_bucket, i),
-				  h->key_len) == 0)) {
-			return bucket_index * h->bucket_entries + i;
+		if (sig == sig_bucket[i]) {
+
+			const void * volatile key_2 =
+				get_key_from_bucket(h, key_bucket, i);
+
+			if (likely(rte_memcmp(key_1, key_2, h->key_len) == 0))
+				return bucket_index * h->bucket_entries + i;
 		}
 	}
 
@@ -350,6 +355,8 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h,
 	uint8_t *key_bucket;
 	uint32_t bucket_index, i;
 
+	const void * volatile key_1 = key;
+
 	/* Get the hash signature and bucket index */
 	sig = sig | h->sig_msb;
 	bucket_index = sig & h->bucket_bitmask;
@@ -358,11 +365,14 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h,
 
 	/* Check if key is already present in the hash */
 	for (i = 0; i < h->bucket_entries; i++) {
-		if ((sig == sig_bucket[i]) &&
-		    likely(memcmp(key, get_key_from_bucket(h, key_bucket, i),
-				  h->key_len) == 0)) {
-			sig_bucket[i] = NULL_SIGNATURE;
-			return bucket_index * h->bucket_entries + i;
+		if (sig == sig_bucket[i]) {
+			const void * volatile key_2 =
+				get_key_from_bucket(h, key_bucket, i);
+
+			if (likely(rte_memcmp(key_1, key_2, h->key_len) == 0)) {
+				sig_bucket[i] = NULL_SIGNATURE;
+				return bucket_index * h->bucket_entries + i;
+			}
 		}
 	}
 
@@ -392,6 +402,8 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h,
 	uint8_t *key_bucket;
 	uint32_t bucket_index, i;
 
+	const void * volatile key_1 = key;
+
 	/* Get the hash signature and bucket index */
 	sig |= h->sig_msb;
 	bucket_index = sig & h->bucket_bitmask;
@@ -400,10 +412,13 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h,
 
 	/* Check if key is already present in the hash */
 	for (i = 0; i < h->bucket_entries; i++) {
-		if ((sig == sig_bucket[i]) &&
-		    likely(memcmp(key, get_key_from_bucket(h, key_bucket, i),
-				  h->key_len) == 0)) {
-			return bucket_index * h->bucket_entries + i;
+		if (sig == sig_bucket[i]) {
+
+			const void * volatile key_2 =
+				get_key_from_bucket(h, key_bucket, i);
+
+			if (likely(rte_memcmp(key_1, key_2, h->key_len) == 0))
+				return bucket_index * h->bucket_entries + i;
 		}
 	}
 
@@ -456,13 +471,17 @@ rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 		positions[i] = -ENOENT;
 
 		for (j = 0; j < h->bucket_entries; j++) {
-			if ((sigs[i] == sig_bucket[j]) &&
-			    likely(memcmp(keys[i],
-					  get_key_from_bucket(h, key_bucket, j),
-					  h->key_len) == 0)) {
-				positions[i] = bucket_index *
-					h->bucket_entries + j;
-				break;
+			if (sigs[i] == sig_bucket[j]) {
+
+				const void * volatile key_1 = keys[i];
+				const void * volatile key_2 =
+					get_key_from_bucket(h, key_bucket, j);
+				if (likely(rte_memcmp(key_1, key_2,
+							h->key_len) == 0)) {
+					positions[i] = bucket_index *
+							h->bucket_entries + j;
+					break;
+				}
 			}
 		}
 	}
-- 
1.9.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 21:19 ` [dpdk-dev] [PATCH v2] Implement memcmp using " Ravi Kerur
@ 2015-05-08 22:29   ` Matt Laswell
  2015-05-08 22:54     ` Ravi Kerur
  2015-05-12  8:13   ` Linhaifeng
  1 sibling, 1 reply; 21+ messages in thread
From: Matt Laswell @ 2015-05-08 22:29 UTC (permalink / raw)
  To: Ravi Kerur; +Cc: dev

On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:

> This patch replaces memcmp in librte_hash with rte_memcmp which is
> implemented with AVX/SSE instructions.
>
> +static inline int
> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> +{
> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> +       int ret = 0;
> +
> +       if (n & 0x80)
> +               return rte_cmp128(src_1, src_2);
> +
> +       if (n & 0x40)
> +               return rte_cmp64(src_1, src_2);
> +
> +       if (n & 0x20) {
> +               ret = rte_cmp32(src_1, src_2);
> +               n -= 0x20;
> +               src_1 += 0x20;
> +               src_2 += 0x20;
> +       }
>
>
Pardon me for butting in, but this seems incorrect for the first two cases
listed above, as the function as written will only compare the first 128 or
64 bytes of each source and return the result.  The pattern expressed in
the 32 byte case appears more correct, as it compares the first 32 bytes
and then lets later pieces of the function handle the smaller remaining
bits of the sources. Also, if this function is to handle arbitrarily large
source data, the 128 byte case needs to be in a loop.

What am I missing?

--
Matt Laswell
infinite io, inc.
laswell@infiniteio.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 22:29   ` Matt Laswell
@ 2015-05-08 22:54     ` Ravi Kerur
  2015-05-08 23:25       ` Matt Laswell
  2015-05-11  9:51       ` Ananyev, Konstantin
  0 siblings, 2 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-08 22:54 UTC (permalink / raw)
  To: Matt Laswell; +Cc: dev

On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com> wrote:

>
>
> On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
>
>> This patch replaces memcmp in librte_hash with rte_memcmp which is
>> implemented with AVX/SSE instructions.
>>
>> +static inline int
>> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
>> +{
>> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
>> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
>> +       int ret = 0;
>> +
>> +       if (n & 0x80)
>> +               return rte_cmp128(src_1, src_2);
>> +
>> +       if (n & 0x40)
>> +               return rte_cmp64(src_1, src_2);
>> +
>> +       if (n & 0x20) {
>> +               ret = rte_cmp32(src_1, src_2);
>> +               n -= 0x20;
>> +               src_1 += 0x20;
>> +               src_2 += 0x20;
>> +       }
>>
>>
> Pardon me for butting in, but this seems incorrect for the first two cases
> listed above, as the function as written will only compare the first 128 or
> 64 bytes of each source and return the result.  The pattern expressed in
> the 32 byte case appears more correct, as it compares the first 32 bytes
> and then lets later pieces of the function handle the smaller remaining
> bits of the sources. Also, if this function is to handle arbitrarily large
> source data, the 128 byte case needs to be in a loop.
>
> What am I missing?
>

Current max hash key length supported is 64 bytes, hence no comparison is
done after 64 bytes. 128 bytes comparison is added to measure performance
only and there is no use-case as of now. With the current use-cases its not
required but if there is a need to handle large arbitrary data upto 128
bytes it can be modified.

>
> --
> Matt Laswell
> infinite io, inc.
> laswell@infiniteio.com
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 22:54     ` Ravi Kerur
@ 2015-05-08 23:25       ` Matt Laswell
  2015-05-11  9:51       ` Ananyev, Konstantin
  1 sibling, 0 replies; 21+ messages in thread
From: Matt Laswell @ 2015-05-08 23:25 UTC (permalink / raw)
  To: Ravi Kerur; +Cc: dev

On Fri, May 8, 2015 at 5:54 PM, Ravi Kerur <rkerur@gmail.com> wrote:

>
>
> On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com>
> wrote:
>
>>
>>
>> On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
>>
>>> This patch replaces memcmp in librte_hash with rte_memcmp which is
>>> implemented with AVX/SSE instructions.
>>>
>>> +static inline int
>>> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
>>> +{
>>> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
>>> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
>>> +       int ret = 0;
>>> +
>>> +       if (n & 0x80)
>>> +               return rte_cmp128(src_1, src_2);
>>> +
>>> +       if (n & 0x40)
>>> +               return rte_cmp64(src_1, src_2);
>>> +
>>> +       if (n & 0x20) {
>>> +               ret = rte_cmp32(src_1, src_2);
>>> +               n -= 0x20;
>>> +               src_1 += 0x20;
>>> +               src_2 += 0x20;
>>> +       }
>>>
>>>
>> Pardon me for butting in, but this seems incorrect for the first two
>> cases listed above, as the function as written will only compare the first
>> 128 or 64 bytes of each source and return the result.  The pattern
>> expressed in the 32 byte case appears more correct, as it compares the
>> first 32 bytes and then lets later pieces of the function handle the
>> smaller remaining bits of the sources. Also, if this function is to handle
>> arbitrarily large source data, the 128 byte case needs to be in a loop.
>>
>> What am I missing?
>>
>
> Current max hash key length supported is 64 bytes, hence no comparison is
> done after 64 bytes. 128 bytes comparison is added to measure performance
> only and there is no use-case as of now. With the current use-cases its not
> required but if there is a need to handle large arbitrary data upto 128
> bytes it can be modified.
>

Ah, gotcha.  I misunderstood and thought that this was meant to be a
generic AVX/SSE enabled memcmp() replacement, and that the use of it in
rte_hash was meant merely as a test case.   If it's more limited than that,
carry on, though you might want to make a note of it in the documentation.
I suspect others will misinterpret the name as I did.

--
Matt Laswell
infinite io, inc.
laswell@infiniteio.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 22:54     ` Ravi Kerur
  2015-05-08 23:25       ` Matt Laswell
@ 2015-05-11  9:51       ` Ananyev, Konstantin
  2015-05-11 17:42         ` Ravi Kerur
  1 sibling, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2015-05-11  9:51 UTC (permalink / raw)
  To: Ravi Kerur, Matt Laswell; +Cc: dev

Hi Ravi,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> Sent: Friday, May 08, 2015 11:55 PM
> To: Matt Laswell
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> 
> On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com> wrote:
> 
> >
> >
> > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
> >
> >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> >> implemented with AVX/SSE instructions.
> >>
> >> +static inline int
> >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> >> +{
> >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> >> +       int ret = 0;
> >> +
> >> +       if (n & 0x80)
> >> +               return rte_cmp128(src_1, src_2);
> >> +
> >> +       if (n & 0x40)
> >> +               return rte_cmp64(src_1, src_2);
> >> +
> >> +       if (n & 0x20) {
> >> +               ret = rte_cmp32(src_1, src_2);
> >> +               n -= 0x20;
> >> +               src_1 += 0x20;
> >> +               src_2 += 0x20;
> >> +       }
> >>
> >>
> > Pardon me for butting in, but this seems incorrect for the first two cases
> > listed above, as the function as written will only compare the first 128 or
> > 64 bytes of each source and return the result.  The pattern expressed in
> > the 32 byte case appears more correct, as it compares the first 32 bytes
> > and then lets later pieces of the function handle the smaller remaining
> > bits of the sources. Also, if this function is to handle arbitrarily large
> > source data, the 128 byte case needs to be in a loop.
> >
> > What am I missing?
> >
> 
> Current max hash key length supported is 64 bytes, hence no comparison is
> done after 64 bytes. 128 bytes comparison is added to measure performance
> only and there is no use-case as of now. With the current use-cases its not
> required but if there is a need to handle large arbitrary data upto 128
> bytes it can be modified.

So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results, right?
While on PPC will work as expected (as it calls memcpu underneath)? 
That looks really weird to me.
If you plan to use rte_memcmp only for hash comparisons, then probably
you should put it somewhere into librte_hash and name it accordingly: rte_hash_key_cmp() or something. 
And put a big comment around it, that it only works with particular lengths.
If you want it to be a generic function inside EAL, then it probably need to handle different lengths properly
on all supported architectures. 
Konstantin

> 
> >
> > --
> > Matt Laswell
> > infinite io, inc.
> > laswell@infiniteio.com
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-11  9:51       ` Ananyev, Konstantin
@ 2015-05-11 17:42         ` Ravi Kerur
       [not found]           ` <2601191342CEEE43887BDE71AB9772582142E44A@irsmsx105.ger.corp.intel.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Ravi Kerur @ 2015-05-11 17:42 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

Hi Konstantin,


On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <
konstantin.ananyev@intel.com> wrote:

> Hi Ravi,
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Friday, May 08, 2015 11:55 PM
> > To: Matt Laswell
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com>
> wrote:
> >
> > >
> > >
> > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
> > >
> > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > >> implemented with AVX/SSE instructions.
> > >>
> > >> +static inline int
> > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > >> +{
> > >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> > >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> > >> +       int ret = 0;
> > >> +
> > >> +       if (n & 0x80)
> > >> +               return rte_cmp128(src_1, src_2);
> > >> +
> > >> +       if (n & 0x40)
> > >> +               return rte_cmp64(src_1, src_2);
> > >> +
> > >> +       if (n & 0x20) {
> > >> +               ret = rte_cmp32(src_1, src_2);
> > >> +               n -= 0x20;
> > >> +               src_1 += 0x20;
> > >> +               src_2 += 0x20;
> > >> +       }
> > >>
> > >>
> > > Pardon me for butting in, but this seems incorrect for the first two
> cases
> > > listed above, as the function as written will only compare the first
> 128 or
> > > 64 bytes of each source and return the result.  The pattern expressed
> in
> > > the 32 byte case appears more correct, as it compares the first 32
> bytes
> > > and then lets later pieces of the function handle the smaller remaining
> > > bits of the sources. Also, if this function is to handle arbitrarily
> large
> > > source data, the 128 byte case needs to be in a loop.
> > >
> > > What am I missing?
> > >
> >
> > Current max hash key length supported is 64 bytes, hence no comparison is
> > done after 64 bytes. 128 bytes comparison is added to measure performance
> > only and there is no use-case as of now. With the current use-cases its
> not
> > required but if there is a need to handle large arbitrary data upto 128
> > bytes it can be modified.
>
> So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results,
> right?
> While on PPC will work as expected (as it calls memcpu underneath)?
> That looks really weird to me.
> If you plan to use rte_memcmp only for hash comparisons, then probably
> you should put it somewhere into librte_hash and name it accordingly:
> rte_hash_key_cmp() or something.
> And put a big comment around it, that it only works with particular
> lengths.
> If you want it to be a generic function inside EAL, then it probably need
> to handle different lengths properly
> on all supported architectures.
> Konstantin
>
>
Let me just explain it here and probably add it to document as well.

rte_memcmp is not

1. a replacement to memcmp

2.  restricted to hash key comparison

rte_memcmp is

1. optimized comparison for 16 to 128 bytes, v1 patch series had this
support. Changed some of the logic in v2 due to concerns raised for
unavailable use-cases beyond 64 bytes comparison. With minor tuning over
the weekend I am able to get better performance for anything between 16 to
128 bytes comparison.

2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK are
for equality or inequality hence "less than" or "greater than"
implementation in rte_memcmp doesn't make sense and will be removed in
subsequent patches, it will return 0 or 1 for equal/unequal cases.

rte_hash will be the first candidate to move to rte_memcmp and subsequently
rte_lpm6 which uses 16 bytes comparison will be moved

Later on RING_SIZE which uses large size for comparison will be moved. I am
currently studying/understanding that logic and will make changes to
rte_memcmp to support that.

I don't want to make lot of changes in one shot and see that patch series
die a slow death with no takers.

Thanks,
Ravi

>
> > >
> > > --
> > > Matt Laswell
> > > infinite io, inc.
> > > laswell@infiniteio.com
> > >
> > >
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
       [not found]           ` <2601191342CEEE43887BDE71AB9772582142E44A@irsmsx105.ger.corp.intel.com>
@ 2015-05-11 19:35             ` Ananyev, Konstantin
  2015-05-11 20:46               ` Ravi Kerur
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2015-05-11 19:35 UTC (permalink / raw)
  To: Ravi Kerur (rkerur@gmail.com); +Cc: dev


Hi Ravi,

> 
> From: Ravi Kerur [mailto:rkerur@gmail.com]
> Sent: Monday, May 11, 2015 6:43 PM
> To: Ananyev, Konstantin
> Cc: Matt Laswell; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> 
> Hi Konstantin,
> 
> 
> On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <konstantin.ananyev@intel.com> wrote:
> Hi Ravi,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > Sent: Friday, May 08, 2015 11:55 PM
> > To: Matt Laswell
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> >
> > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com> wrote:
> >
> > >
> > >
> > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
> > >
> > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > >> implemented with AVX/SSE instructions.
> > >>
> > >> +static inline int
> > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > >> +{
> > >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> > >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> > >> +       int ret = 0;
> > >> +
> > >> +       if (n & 0x80)
> > >> +               return rte_cmp128(src_1, src_2);
> > >> +
> > >> +       if (n & 0x40)
> > >> +               return rte_cmp64(src_1, src_2);
> > >> +
> > >> +       if (n & 0x20) {
> > >> +               ret = rte_cmp32(src_1, src_2);
> > >> +               n -= 0x20;
> > >> +               src_1 += 0x20;
> > >> +               src_2 += 0x20;
> > >> +       }
> > >>
> > >>
> > > Pardon me for butting in, but this seems incorrect for the first two cases
> > > listed above, as the function as written will only compare the first 128 or
> > > 64 bytes of each source and return the result.  The pattern expressed in
> > > the 32 byte case appears more correct, as it compares the first 32 bytes
> > > and then lets later pieces of the function handle the smaller remaining
> > > bits of the sources. Also, if this function is to handle arbitrarily large
> > > source data, the 128 byte case needs to be in a loop.
> > >
> > > What am I missing?
> > >
> >
> > Current max hash key length supported is 64 bytes, hence no comparison is
> > done after 64 bytes. 128 bytes comparison is added to measure performance
> > only and there is no use-case as of now. With the current use-cases its not
> > required but if there is a need to handle large arbitrary data upto 128
> > bytes it can be modified.
> So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results, right?
> While on PPC will work as expected (as it calls memcpu underneath)?
> That looks really weird to me.
> If you plan to use rte_memcmp only for hash comparisons, then probably
> you should put it somewhere into librte_hash and name it accordingly: rte_hash_key_cmp() or something.
> And put a big comment around it, that it only works with particular lengths.
> If you want it to be a generic function inside EAL, then it probably need to handle different lengths properly
> on all supported architectures.
> Konstantin
> 
> 
> Let me just explain it here and probably add it to document as well.
> 
> rte_memcmp is not
> 
> 1. a replacement to memcmp
> 
> 2.  restricted to hash key comparison
> 
> rte_memcmp is
> 
> 1. optimized comparison for 16 to 128 bytes, v1 patch series had this support. Changed some of the logic in v2 due to concerns raised
> for unavailable use-cases beyond 64 bytes comparison.

>From what I see in v2 it supposed to work correctly for len in [0,64] and  len=128, right?
Not sure I get it: so for v1 it was able to handle any length correctly, but then you removed it?
If so, I wonder what was the reason? Make it faster?

Another thing that looks strange to me:
While all rte_cmp*() uses actual data values for comparison results,
rte_memcmp_remainder() return value depends not only on data values but also on data locations:

+static inline int
+rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t n)
+{
...
exit:
+
+	return src_1u < src_2u ? -1 : 1;
+}

If you just test for equal/not equal that doesn't really matter.
If this is supposed to be a 'proper' comparison function, then the result is sort of unpredictable.

> With minor tuning over the weekend I am able to get better performance for
> anything between 16 to 128 bytes comparison.
> 
> 2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK are for equality or inequality hence "less than" or "greater than"
> implementation in rte_memcmp doesn't make sense and will be removed in subsequent patches, it will return 0 or 1 for
> equal/unequal cases.

If you don't plan your function to follow memcmp() semantics and syntax, why to name it rte_memcmp()?
I  think that will make a lot of confusion around.
Why not to name it differently(and put a clear comment in the declaration of course)?

> 
> rte_hash will be the first candidate to move to rte_memcmp and subsequently rte_lpm6 which uses 16 bytes comparison will be
> moved
> 
> Later on RING_SIZE which uses large size for comparison will be moved. I am currently studying/understanding that logic and will make
> changes to rte_memcmp to support that.

Sorry, didn't get you here.
Konstantin

> 
> I don't want to make lot of changes in one shot and see that patch series die a slow death with no takers.
> 
> Thanks,
> Ravi
> 
> >
> > >
> > > --
> > > Matt Laswell
> > > infinite io, inc.
> > > laswell@infiniteio.com
> > >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-11 19:35             ` Ananyev, Konstantin
@ 2015-05-11 20:46               ` Ravi Kerur
  2015-05-11 22:29                 ` Don Provan
       [not found]                 ` <2601191342CEEE43887BDE71AB9772582142EBB5@irsmsx105.ger.corp.intel.com>
  0 siblings, 2 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-11 20:46 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

Hi Konstantin,


On Mon, May 11, 2015 at 12:35 PM, Ananyev, Konstantin <
konstantin.ananyev@intel.com> wrote:

>
> Hi Ravi,
>
> >
> > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > Sent: Monday, May 11, 2015 6:43 PM
> > To: Ananyev, Konstantin
> > Cc: Matt Laswell; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> > Hi Konstantin,
> >
> >
> > On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <
> konstantin.ananyev@intel.com> wrote:
> > Hi Ravi,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > > Sent: Friday, May 08, 2015 11:55 PM
> > > To: Matt Laswell
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> > >
> > > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com>
> wrote:
> > >
> > > >
> > > >
> > > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
> > > >
> > > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > > >> implemented with AVX/SSE instructions.
> > > >>
> > > >> +static inline int
> > > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > > >> +{
> > > >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> > > >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> > > >> +       int ret = 0;
> > > >> +
> > > >> +       if (n & 0x80)
> > > >> +               return rte_cmp128(src_1, src_2);
> > > >> +
> > > >> +       if (n & 0x40)
> > > >> +               return rte_cmp64(src_1, src_2);
> > > >> +
> > > >> +       if (n & 0x20) {
> > > >> +               ret = rte_cmp32(src_1, src_2);
> > > >> +               n -= 0x20;
> > > >> +               src_1 += 0x20;
> > > >> +               src_2 += 0x20;
> > > >> +       }
> > > >>
> > > >>
> > > > Pardon me for butting in, but this seems incorrect for the first two
> cases
> > > > listed above, as the function as written will only compare the first
> 128 or
> > > > 64 bytes of each source and return the result.  The pattern
> expressed in
> > > > the 32 byte case appears more correct, as it compares the first 32
> bytes
> > > > and then lets later pieces of the function handle the smaller
> remaining
> > > > bits of the sources. Also, if this function is to handle arbitrarily
> large
> > > > source data, the 128 byte case needs to be in a loop.
> > > >
> > > > What am I missing?
> > > >
> > >
> > > Current max hash key length supported is 64 bytes, hence no comparison
> is
> > > done after 64 bytes. 128 bytes comparison is added to measure
> performance
> > > only and there is no use-case as of now. With the current use-cases
> its not
> > > required but if there is a need to handle large arbitrary data upto 128
> > > bytes it can be modified.
> > So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results,
> right?
> > While on PPC will work as expected (as it calls memcpu underneath)?
> > That looks really weird to me.
> > If you plan to use rte_memcmp only for hash comparisons, then probably
> > you should put it somewhere into librte_hash and name it accordingly:
> rte_hash_key_cmp() or something.
> > And put a big comment around it, that it only works with particular
> lengths.
> > If you want it to be a generic function inside EAL, then it probably
> need to handle different lengths properly
> > on all supported architectures.
> > Konstantin
> >
> >
> > Let me just explain it here and probably add it to document as well.
> >
> > rte_memcmp is not
> >
> > 1. a replacement to memcmp
> >
> > 2.  restricted to hash key comparison
> >
> > rte_memcmp is
> >
> > 1. optimized comparison for 16 to 128 bytes, v1 patch series had this
> support. Changed some of the logic in v2 due to concerns raised
> > for unavailable use-cases beyond 64 bytes comparison.
>
> From what I see in v2 it supposed to work correctly for len in [0,64] and
> len=128, right?
> Not sure I get it: so for v1 it was able to handle any length correctly,
> but then you removed it?
> If so, I wonder what was the reason? Make it faster?
>

My initial discussion was with Zhilong(John) from Intel and we decided to
implement up to 128 bytes comparison and use rte_hash and rte_lpm6 as a
candidate for testing. When I sent out v1 patch, Bruce comments were on
use-case for 128 bytes comparison and was it really required? Hence I
decided in v2 to support only up to 64 bytes and added 128 bytes only for
performance measurement.

Personally I think support for up to 128 bytes comparison is required,
there might not be use-cases today but it will definitely be useful.


> Another thing that looks strange to me:
> While all rte_cmp*() uses actual data values for comparison results,
> rte_memcmp_remainder() return value depends not only on data values but
> also on data locations:
>
> +static inline int
> +rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t
> n)
> +{
> ...
> exit:
> +
> +       return src_1u < src_2u ? -1 : 1;
> +}
>
>
This is a bug and its not supposed to be there. I will fix it. Thanks for
catching it.


> If you just test for equal/not equal that doesn't really matter.
> If this is supposed to be a 'proper' comparison function, then the result
> is sort of unpredictable.
>
> > With minor tuning over the weekend I am able to get better performance
> for
> > anything between 16 to 128 bytes comparison.
> >
> > 2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK are
> for equality or inequality hence "less than" or "greater than"
> > implementation in rte_memcmp doesn't make sense and will be removed in
> subsequent patches, it will return 0 or 1 for
> > equal/unequal cases.
>
> If you don't plan your function to follow memcmp() semantics and syntax,
> why to name it rte_memcmp()?
> I  think that will make a lot of confusion around.
> Why not to name it differently(and put a clear comment in the declaration
> of course)?
>

Following memcmp semantics is not hard but there are no use-cases for it in
DPDK currently. Keeping it specific to DPDK usage simplifies code as well.
I can change the name to "rte_compare" and add comments to the function.
Will it work?


>
> >
> > rte_hash will be the first candidate to move to rte_memcmp and
> subsequently rte_lpm6 which uses 16 bytes comparison will be
> > moved
> >
> > Later on RING_SIZE which uses large size for comparison will be moved. I
> am currently studying/understanding that logic and will make
> > changes to rte_memcmp to support that.
>
> Sorry, didn't get you here.
>

Once rte_hash, rte_lpm6 changes and new compare function code are reviewed
and accepted I plan to move to different components (RING_SIZE is currently
defined to be from 256 to 16384 bytes) and memcmp function being used in
test_ring, test_pmd_ring and other functions. I did not want to add all
component changes into one patch series as it causes high review latency or
patch series just dies down silently. Instead make patches small and
incremental in every series, hope this clarifies.

Thanks,
Ravi

Konstantin
>
> >
> > I don't want to make lot of changes in one shot and see that patch
> series die a slow death with no takers.
> >
> > Thanks,
> > Ravi
> >
> > >
> > > >
> > > > --
> > > > Matt Laswell
> > > > infinite io, inc.
> > > > laswell@infiniteio.com
> > > >
> > > >
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-11 20:46               ` Ravi Kerur
@ 2015-05-11 22:29                 ` Don Provan
  2015-05-13  1:16                   ` Ravi Kerur
       [not found]                 ` <2601191342CEEE43887BDE71AB9772582142EBB5@irsmsx105.ger.corp.intel.com>
  1 sibling, 1 reply; 21+ messages in thread
From: Don Provan @ 2015-05-11 22:29 UTC (permalink / raw)
  To: Ravi Kerur, Ananyev, Konstantin; +Cc: dev

I probably shouldn't stick my nose into this, but I can't help myself.

An experienced programmer will tend to ignore the documentation for
a routine named "blahblah_memcmp" and just assume it functions like
memcmp. Whether or not there's currently a use case in DPDK is
completely irrelevant because as soon as there *is* a use case, some
poor DPDK developer will try to use rte_memcmp for that and may or
may not have a test case that reveals their mistake.

The term "compare" suggests checking for larger or smaller.
If you want to check for equality, use "equal" or "eq" in the name
and return true if they're equal. But personally, I'd compare unless
there was a good reason not to. Indeed, I would just implement
full memcmp functionality and be done with it, even if that meant
using my fancy new assembly code for the cases I handle and then
calling memcmp itself for the cases I didn't.

If a routine that appears to take an arbitrary size doesn't, the name
should in some manner reflect what sizes it takes. Better would be
for a routine that only handles specific sizes to be split into versions
that only take fixed sizes, but I don't know enough about your use
cases to say whether that makes sense here.

-don provan
dprovan@bivio.net

-----Original Message-----
From: Ravi Kerur [mailto:rkerur@gmail.com] 
Sent: Monday, May 11, 2015 1:47 PM
To: Ananyev, Konstantin
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.

...
Following memcmp semantics is not hard but there are no use-cases for it in DPDK currently. Keeping it specific to DPDK usage simplifies code as well.
I can change the name to "rte_compare" and add comments to the function.
Will it work?
...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-08 21:19 ` [dpdk-dev] [PATCH v2] Implement memcmp using " Ravi Kerur
  2015-05-08 22:29   ` Matt Laswell
@ 2015-05-12  8:13   ` Linhaifeng
  2015-05-13  1:18     ` Ravi Kerur
  1 sibling, 1 reply; 21+ messages in thread
From: Linhaifeng @ 2015-05-12  8:13 UTC (permalink / raw)
  To: Ravi Kerur, dev

Hi, Ravi Kerur

On 2015/5/9 5:19, Ravi Kerur wrote:
> Preliminary results on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu
> 14.04 x86_64 shows comparisons using AVX/SSE instructions taking 1/3rd
> CPU ticks for 16, 32, 48 and 64 bytes comparison. In addition,

I had write a program to test rte_memcmp and I have a question about the result.
Why cost same CPU ticks for 128 256 512 1024 1500 bytes? Is there any problem in
my test?


[root@localhost test]# gcc avx_test.c -O3  -I /data/linhf/v2r2c00/open-source/dpdk/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/ -mavx2 -DRTE_MACHINE_CPUFLAG_AVX2
[root@localhost test]# ./a.out 0
each test run 100000000 times
copy 16 bytes costs average 7(rte_memcmp) 10(memcmp) ticks
copy 32 bytes costs average 9(rte_memcmp) 11(memcmp) ticks
copy 64 bytes costs average 6(rte_memcmp) 13(memcmp) ticks
copy 128 bytes costs average 11(rte_memcmp) 14(memcmp) ticks
copy 256 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
copy 512 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
copy 1024 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
copy 1500 bytes costs average 11(rte_memcmp) 14(memcmp) ticks
[root@localhost test]# ./a.out 1
each test run 100000000 times
copy 16 bytes costs average 2(rte_memcpy) 10(memcpy) ticks
copy 32 bytes costs average 2(rte_memcpy) 10(memcpy) ticks
copy 64 bytes costs average 3(rte_memcpy) 10(memcpy) ticks
copy 128 bytes costs average 7(rte_memcpy) 12(memcpy) ticks
copy 256 bytes costs average 9(rte_memcpy) 23(memcpy) ticks
copy 512 bytes costs average 14(rte_memcpy) 34(memcpy) ticks
copy 1024 bytes costs average 37(rte_memcpy) 61(memcpy) ticks
copy 1500 bytes costs average 62(rte_memcpy) 87(memcpy) ticks


Here is my program:

#include <stdio.h>
#include <rte_cycles.h>
#include <smmintrin.h>
#include <rte_memcpy.h>
#include <rte_memcmp.h>

#define TIMES 100000000L

void test_memcpy(size_t n)
{
        uint64_t start, end, i, start2, end2;
        uint8_t *src, *dst;

        src = (uint8_t*)malloc(n * sizeof(uint8_t));
        dst = (uint8_t*)malloc(n * sizeof(uint8_t));

        start = rte_rdtsc();
        for (i = 0; i < TIMES; i++) {
                rte_memcpy(dst, src, n);
        }
        end = rte_rdtsc();

        start2 = rte_rdtsc();
        for (i = 0; i < TIMES; i++) {
                memcpy(dst, src, n);
        }
        end2 = rte_rdtsc();


        free(src);
        free(dst);

        printf("copy %u bytes costs average %llu(rte_memcpy) %llu(memcpy) ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES);
}

int test_memcmp(size_t n)
{
        uint64_t start, end, i, start2, end2, j;
        uint8_t *src, *dst;
        int *ret;

        src = (uint8_t*)malloc(n * sizeof(uint8_t));
        dst = (uint8_t*)malloc(n * sizeof(uint8_t));
        ret = (int*)malloc(TIMES * sizeof(int));

        start = rte_rdtsc();
        for (i = 0; i < TIMES; i++) {
                ret[i] = rte_memcmp(dst, src, n);
        }
        end = rte_rdtsc();

        start2 = rte_rdtsc();
        for (i = 0; i < TIMES; i++) {
                ret[i] = memcmp(dst, src, n);
        }
        end2 = rte_rdtsc();

	// avoid gcc to optimize memcmp
        for (i = 0; i < TIMES; i++) {
                t += ret[i];
        }

        free(src);
        free(dst);

        printf("copy %u bytes costs average %llu(rte_memcmp) %llu(memcmp) ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES);
        return t;
}




int main(int narg, char** args)
{
        printf("each test run %llu times\n", TIMES);

        if (narg < 2) {
                printf("usage:./avx_test 0/1 1:test memcpy 0:test memcmp\n");
                return -1;
        }

        if (atoi(args[1])) {
                test_memcpy(16);
                test_memcpy(32);
                test_memcpy(64);
                test_memcpy(128);
                test_memcpy(256);
                test_memcpy(512);
                test_memcpy(1024);
                test_memcpy(1500);
        } else {
                test_memcmp(16);
                test_memcmp(32);
                test_memcmp(64);
                test_memcmp(128);
                test_memcmp(256);
                test_memcmp(512);
                test_memcmp(1024);
                test_memcmp(1500);
        }
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-11 22:29                 ` Don Provan
@ 2015-05-13  1:16                   ` Ravi Kerur
  2015-05-13  9:03                     ` Bruce Richardson
  2015-05-13 12:21                     ` Jay Rolette
  0 siblings, 2 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13  1:16 UTC (permalink / raw)
  To: Don Provan; +Cc: dev

On Mon, May 11, 2015 at 3:29 PM, Don Provan <dprovan@bivio.net> wrote:

> I probably shouldn't stick my nose into this, but I can't help myself.
>
> An experienced programmer will tend to ignore the documentation for
> a routine named "blahblah_memcmp" and just assume it functions like
> memcmp. Whether or not there's currently a use case in DPDK is
> completely irrelevant because as soon as there *is* a use case, some
> poor DPDK developer will try to use rte_memcmp for that and may or
> may not have a test case that reveals their mistake.
>

In general I agree with you. However, comparison is a hit(equal) or
miss(unequal) is generally the case in networking. I haven't seen cases
where "less than" or "greater than" has mattered.


>
> The term "compare" suggests checking for larger or smaller.
> If you want to check for equality, use "equal" or "eq" in the name
> and return true if they're equal. But personally, I'd compare unless
> there was a good reason not to. Indeed, I would just implement
> full memcmp functionality and be done with it, even if that meant
> using my fancy new assembly code for the cases I handle and then
> calling memcmp itself for the cases I didn't.
>
> Agreed, I will look into implementing full functionality.


> If a routine that appears to take an arbitrary size doesn't, the name
> should in some manner reflect what sizes it takes. Better would be
> for a routine that only handles specific sizes to be split into versions
> that only take fixed sizes, but I don't know enough about your use
> cases to say whether that makes sense here.
>
>
Users of rte_memcmp will be existing dpdk test and library code.

-don provan
> dprovan@bivio.net
>
> -----Original Message-----
> From: Ravi Kerur [mailto:rkerur@gmail.com]
> Sent: Monday, May 11, 2015 1:47 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
>
> ...
> Following memcmp semantics is not hard but there are no use-cases for it
> in DPDK currently. Keeping it specific to DPDK usage simplifies code as
> well.
> I can change the name to "rte_compare" and add comments to the function.
> Will it work?
> ...
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-12  8:13   ` Linhaifeng
@ 2015-05-13  1:18     ` Ravi Kerur
  2015-05-13  7:22       ` Linhaifeng
  0 siblings, 1 reply; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13  1:18 UTC (permalink / raw)
  To: Linhaifeng; +Cc: dev

Hi Linhaifeng,


On Tue, May 12, 2015 at 1:13 AM, Linhaifeng <haifeng.lin@huawei.com> wrote:

> Hi, Ravi Kerur
>
> On 2015/5/9 5:19, Ravi Kerur wrote:
> > Preliminary results on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu
> > 14.04 x86_64 shows comparisons using AVX/SSE instructions taking 1/3rd
> > CPU ticks for 16, 32, 48 and 64 bytes comparison. In addition,
>
> I had write a program to test rte_memcmp and I have a question about the
> result.
> Why cost same CPU ticks for 128 256 512 1024 1500 bytes? Is there any
> problem in
> my test?
>
>
If you can wait until Thursday I will probably send v3 patch which will
have full memcmp support.

In your program try with volatile pointer and see if it helps.

>
> [root@localhost test]# gcc avx_test.c -O3  -I
> /data/linhf/v2r2c00/open-source/dpdk/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/
> -mavx2 -DRTE_MACHINE_CPUFLAG_AVX2
> [root@localhost test]# ./a.out 0
> each test run 100000000 times
> copy 16 bytes costs average 7(rte_memcmp) 10(memcmp) ticks
> copy 32 bytes costs average 9(rte_memcmp) 11(memcmp) ticks
> copy 64 bytes costs average 6(rte_memcmp) 13(memcmp) ticks
> copy 128 bytes costs average 11(rte_memcmp) 14(memcmp) ticks
> copy 256 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
> copy 512 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
> copy 1024 bytes costs average 9(rte_memcmp) 14(memcmp) ticks
> copy 1500 bytes costs average 11(rte_memcmp) 14(memcmp) ticks
> [root@localhost test]# ./a.out 1
> each test run 100000000 times
> copy 16 bytes costs average 2(rte_memcpy) 10(memcpy) ticks
> copy 32 bytes costs average 2(rte_memcpy) 10(memcpy) ticks
> copy 64 bytes costs average 3(rte_memcpy) 10(memcpy) ticks
> copy 128 bytes costs average 7(rte_memcpy) 12(memcpy) ticks
> copy 256 bytes costs average 9(rte_memcpy) 23(memcpy) ticks
> copy 512 bytes costs average 14(rte_memcpy) 34(memcpy) ticks
> copy 1024 bytes costs average 37(rte_memcpy) 61(memcpy) ticks
> copy 1500 bytes costs average 62(rte_memcpy) 87(memcpy) ticks
>
>
> Here is my program:
>
> #include <stdio.h>
> #include <rte_cycles.h>
> #include <smmintrin.h>
> #include <rte_memcpy.h>
> #include <rte_memcmp.h>
>
> #define TIMES 100000000L
>
> void test_memcpy(size_t n)
> {
>         uint64_t start, end, i, start2, end2;
>         uint8_t *src, *dst;
>
>         src = (uint8_t*)malloc(n * sizeof(uint8_t));
>         dst = (uint8_t*)malloc(n * sizeof(uint8_t));
>
>         start = rte_rdtsc();
>         for (i = 0; i < TIMES; i++) {
>                 rte_memcpy(dst, src, n);
>         }
>         end = rte_rdtsc();
>
>         start2 = rte_rdtsc();
>         for (i = 0; i < TIMES; i++) {
>                 memcpy(dst, src, n);
>         }
>         end2 = rte_rdtsc();
>
>
>         free(src);
>         free(dst);
>
>         printf("copy %u bytes costs average %llu(rte_memcpy) %llu(memcpy)
> ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES);
> }
>
> int test_memcmp(size_t n)
> {
>         uint64_t start, end, i, start2, end2, j;
>         uint8_t *src, *dst;
>         int *ret;
>
>         src = (uint8_t*)malloc(n * sizeof(uint8_t));
>         dst = (uint8_t*)malloc(n * sizeof(uint8_t));
>         ret = (int*)malloc(TIMES * sizeof(int));
>
>         start = rte_rdtsc();
>         for (i = 0; i < TIMES; i++) {
>                 ret[i] = rte_memcmp(dst, src, n);
>         }
>         end = rte_rdtsc();
>
>         start2 = rte_rdtsc();
>         for (i = 0; i < TIMES; i++) {
>                 ret[i] = memcmp(dst, src, n);
>         }
>         end2 = rte_rdtsc();
>
>         // avoid gcc to optimize memcmp
>         for (i = 0; i < TIMES; i++) {
>                 t += ret[i];
>         }
>
>         free(src);
>         free(dst);
>
>         printf("copy %u bytes costs average %llu(rte_memcmp) %llu(memcmp)
> ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES);
>         return t;
> }
>
>
>
>
> int main(int narg, char** args)
> {
>         printf("each test run %llu times\n", TIMES);
>
>         if (narg < 2) {
>                 printf("usage:./avx_test 0/1 1:test memcpy 0:test
> memcmp\n");
>                 return -1;
>         }
>
>         if (atoi(args[1])) {
>                 test_memcpy(16);
>                 test_memcpy(32);
>                 test_memcpy(64);
>                 test_memcpy(128);
>                 test_memcpy(256);
>                 test_memcpy(512);
>                 test_memcpy(1024);
>                 test_memcpy(1500);
>         } else {
>                 test_memcmp(16);
>                 test_memcmp(32);
>                 test_memcmp(64);
>                 test_memcmp(128);
>                 test_memcmp(256);
>                 test_memcmp(512);
>                 test_memcmp(1024);
>                 test_memcmp(1500);
>         }
> }
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13  1:18     ` Ravi Kerur
@ 2015-05-13  7:22       ` Linhaifeng
  2015-05-13 20:00         ` Ravi Kerur
  0 siblings, 1 reply; 21+ messages in thread
From: Linhaifeng @ 2015-05-13  7:22 UTC (permalink / raw)
  To: Ravi Kerur; +Cc: dev



On 2015/5/13 9:18, Ravi Kerur wrote:
> If you can wait until Thursday I will probably send v3 patch which will
> have full memcmp support.

Ok, I'd like to test it:)

> 
> In your program try with volatile pointer and see if it helps.

like "volatile uint8_t *src, *dst" ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13  1:16                   ` Ravi Kerur
@ 2015-05-13  9:03                     ` Bruce Richardson
  2015-05-13 20:08                       ` Ravi Kerur
  2015-05-13 12:21                     ` Jay Rolette
  1 sibling, 1 reply; 21+ messages in thread
From: Bruce Richardson @ 2015-05-13  9:03 UTC (permalink / raw)
  To: Ravi Kerur; +Cc: dev, Don Provan

On Tue, May 12, 2015 at 06:16:20PM -0700, Ravi Kerur wrote:
> On Mon, May 11, 2015 at 3:29 PM, Don Provan <dprovan@bivio.net> wrote:
> 
> > I probably shouldn't stick my nose into this, but I can't help myself.
> >
> > An experienced programmer will tend to ignore the documentation for
> > a routine named "blahblah_memcmp" and just assume it functions like
> > memcmp. Whether or not there's currently a use case in DPDK is
> > completely irrelevant because as soon as there *is* a use case, some
> > poor DPDK developer will try to use rte_memcmp for that and may or
> > may not have a test case that reveals their mistake.
> >
> 
> In general I agree with you. However, comparison is a hit(equal) or
> miss(unequal) is generally the case in networking. I haven't seen cases
> where "less than" or "greater than" has mattered.
> 
> 
Agreed that == and != are the common operations. However, if that is what
is returned from the function - and given other limitations on parameter sizes -
I agree with previous posters that this function needs to have a different name
to rte_memcmp so as to avoid confusion.

/Bruce

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
       [not found]                 ` <2601191342CEEE43887BDE71AB9772582142EBB5@irsmsx105.ger.corp.intel.com>
@ 2015-05-13 10:12                   ` Ananyev, Konstantin
  2015-05-13 20:06                     ` Ravi Kerur
  0 siblings, 1 reply; 21+ messages in thread
From: Ananyev, Konstantin @ 2015-05-13 10:12 UTC (permalink / raw)
  To: Ravi Kerur (rkerur@gmail.com); +Cc: dev

Hi Ravi,

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Wednesday, May 13, 2015 11:02 AM
> To: Ananyev, Konstantin
> Subject: FW: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> 
> 
> 
> From: Ravi Kerur [mailto:rkerur@gmail.com]
> Sent: Monday, May 11, 2015 9:47 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> 
> Hi Konstantin,
> 
> On Mon, May 11, 2015 at 12:35 PM, Ananyev, Konstantin <konstantin.ananyev@intel.com> wrote:
> 
> Hi Ravi,
> 
> >
> > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > Sent: Monday, May 11, 2015 6:43 PM
> > To: Ananyev, Konstantin
> > Cc: Matt Laswell; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> >
> > Hi Konstantin,
> >
> >
> > On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <konstantin.ananyev@intel.com> wrote:
> > Hi Ravi,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > > Sent: Friday, May 08, 2015 11:55 PM
> > > To: Matt Laswell
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
> > >
> > > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com> wrote:
> > >
> > > >
> > > >
> > > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com> wrote:
> > > >
> > > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > > >> implemented with AVX/SSE instructions.
> > > >>
> > > >> +static inline int
> > > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > > >> +{
> > > >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> > > >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> > > >> +       int ret = 0;
> > > >> +
> > > >> +       if (n & 0x80)
> > > >> +               return rte_cmp128(src_1, src_2);
> > > >> +
> > > >> +       if (n & 0x40)
> > > >> +               return rte_cmp64(src_1, src_2);
> > > >> +
> > > >> +       if (n & 0x20) {
> > > >> +               ret = rte_cmp32(src_1, src_2);
> > > >> +               n -= 0x20;
> > > >> +               src_1 += 0x20;
> > > >> +               src_2 += 0x20;
> > > >> +       }
> > > >>
> > > >>
> > > > Pardon me for butting in, but this seems incorrect for the first two cases
> > > > listed above, as the function as written will only compare the first 128 or
> > > > 64 bytes of each source and return the result.  The pattern expressed in
> > > > the 32 byte case appears more correct, as it compares the first 32 bytes
> > > > and then lets later pieces of the function handle the smaller remaining
> > > > bits of the sources. Also, if this function is to handle arbitrarily large
> > > > source data, the 128 byte case needs to be in a loop.
> > > >
> > > > What am I missing?
> > > >
> > >
> > > Current max hash key length supported is 64 bytes, hence no comparison is
> > > done after 64 bytes. 128 bytes comparison is added to measure performance
> > > only and there is no use-case as of now. With the current use-cases its not
> > > required but if there is a need to handle large arbitrary data upto 128
> > > bytes it can be modified.
> > So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid results, right?
> > While on PPC will work as expected (as it calls memcpu underneath)?
> > That looks really weird to me.
> > If you plan to use rte_memcmp only for hash comparisons, then probably
> > you should put it somewhere into librte_hash and name it accordingly: rte_hash_key_cmp() or something.
> > And put a big comment around it, that it only works with particular lengths.
> > If you want it to be a generic function inside EAL, then it probably need to handle different lengths properly
> > on all supported architectures.
> > Konstantin
> >
> >
> > Let me just explain it here and probably add it to document as well.
> >
> > rte_memcmp is not
> >
> > 1. a replacement to memcmp
> >
> > 2.  restricted to hash key comparison
> >
> > rte_memcmp is
> >
> > 1. optimized comparison for 16 to 128 bytes, v1 patch series had this support. Changed some of the logic in v2 due to concerns raised
> > for unavailable use-cases beyond 64 bytes comparison.
> From what I see in v2 it supposed to work correctly for len in [0,64] and  len=128, right?
> Not sure I get it: so for v1 it was able to handle any length correctly, but then you removed it?
> If so, I wonder what was the reason? Make it faster?
> 
> My initial discussion was with Zhilong(John) from Intel and we decided to implement up to 128 bytes comparison and use rte_hash
> and rte_lpm6 as a candidate for testing. When I sent out v1 patch, Bruce comments were on use-case for 128 bytes comparison and
> was it really required? Hence I decided in v2 to support only up to 64 bytes and added 128 bytes only for performance measurement.
> Personally I think support for up to 128 bytes comparison is required, there might not be use-cases today but it will definitely be
> useful.

Ok, we don't have a real usage case for it now, but it still probably good to have it work with arbitrary key-length.
Again, as Don suggested in another mail, we can have an optimised implementation
for particular sizes and fall back to slow-path (memcp) for all other cases.
Even if you'll decide to limit len to particular value (64/128), it is probably not very good to have a gap in between,
as it exists now [65-127].

> 
> Another thing that looks strange to me:
> While all rte_cmp*() uses actual data values for comparison results,
> rte_memcmp_remainder() return value depends not only on data values but also on data locations:
> 
> +static inline int
> +rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u, size_t n)
> +{
> ...
> exit:
> +
> +       return src_1u < src_2u ? -1 : 1;
> +}
> 
> This is a bug and its not supposed to be there. I will fix it. Thanks for catching it.
> 
> If you just test for equal/not equal that doesn't really matter.
> If this is supposed to be a 'proper' comparison function, then the result is sort of unpredictable.
> > With minor tuning over the weekend I am able to get better performance for
> > anything between 16 to 128 bytes comparison.
> >
> > 2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK are for equality or inequality hence "less than" or "greater than"
> > implementation in rte_memcmp doesn't make sense and will be removed in subsequent patches, it will return 0 or 1 for
> > equal/unequal cases.
> 
> If you don't plan your function to follow memcmp() semantics and syntax, why to name it rte_memcmp()?
> I  think that will make a lot of confusion around.
> Why not to name it differently(and put a clear comment in the declaration of course)?
> 
> Following memcmp semantics is not hard but there are no use-cases for it in DPDK currently. Keeping it specific to DPDK usage
> simplifies code as well. I can change the name to "rte_compare" and add comments to the function. Will it work?

Yep, either rte_compare(), or as Don suggested rte_testequal() - both seems good to me.

Konstantin

> 
> 
> >
> > rte_hash will be the first candidate to move to rte_memcmp and subsequently rte_lpm6 which uses 16 bytes comparison will be
> > moved
> >
> > Later on RING_SIZE which uses large size for comparison will be moved. I am currently studying/understanding that logic and will
> make
> > changes to rte_memcmp to support that.
> 
> Sorry, didn't get you here.
> 
> Once rte_hash, rte_lpm6 changes and new compare function code are reviewed and accepted I plan to move to different
> components (RING_SIZE is currently defined to be from 256 to 16384 bytes) and memcmp function being used in test_ring,
> test_pmd_ring and other functions. I did not want to add all component changes into one patch series as it causes high review latency
> or patch series just dies down silently. Instead make patches small and incremental in every series, hope this clarifies.
> Thanks,
> Ravi
> Konstantin
> 
> >
> > I don't want to make lot of changes in one shot and see that patch series die a slow death with no takers.
> >
> > Thanks,
> > Ravi
> >
> > >
> > > >
> > > > --
> > > > Matt Laswell
> > > > infinite io, inc.
> > > > laswell@infiniteio.com
> > > >
> > > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13  1:16                   ` Ravi Kerur
  2015-05-13  9:03                     ` Bruce Richardson
@ 2015-05-13 12:21                     ` Jay Rolette
  2015-05-13 20:07                       ` Ravi Kerur
  1 sibling, 1 reply; 21+ messages in thread
From: Jay Rolette @ 2015-05-13 12:21 UTC (permalink / raw)
  To: Ravi Kerur; +Cc: dev, Don Provan

On Tue, May 12, 2015 at 8:16 PM, Ravi Kerur <rkerur@gmail.com> wrote:

> On Mon, May 11, 2015 at 3:29 PM, Don Provan <dprovan@bivio.net> wrote:
>
> > I probably shouldn't stick my nose into this, but I can't help myself.
> >
> > An experienced programmer will tend to ignore the documentation for
> > a routine named "blahblah_memcmp" and just assume it functions like
> > memcmp. Whether or not there's currently a use case in DPDK is
> > completely irrelevant because as soon as there *is* a use case, some
> > poor DPDK developer will try to use rte_memcmp for that and may or
> > may not have a test case that reveals their mistake.
> >
>
> In general I agree with you. However, comparison is a hit(equal) or
> miss(unequal) is generally the case in networking. I haven't seen cases
> where "less than" or "greater than" has mattered.
>

It's useful when you need to make sure packets from both sides of a
conversation go to the same processing queue/thread. Instead of hashing the
5-tuple from the packet as src.ip, dst.ip, src.dport, dst.dport, etc., you
can use lesser.ip, higher.ip, lesser.sport, higher.dport, etc.

Very common when you are doing deep packet inspection.

Jay

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13  7:22       ` Linhaifeng
@ 2015-05-13 20:00         ` Ravi Kerur
  0 siblings, 0 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13 20:00 UTC (permalink / raw)
  To: Linhaifeng; +Cc: dev

On Wed, May 13, 2015 at 12:22 AM, Linhaifeng <haifeng.lin@huawei.com> wrote:

>
>
> On 2015/5/13 9:18, Ravi Kerur wrote:
> > If you can wait until Thursday I will probably send v3 patch which will
> > have full memcmp support.
>
> Ok, I'd like to test it:)
>
> >
> > In your program try with volatile pointer and see if it helps.
>
> like "volatile uint8_t *src, *dst" ?
>

uint8_t * volatile src

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13 10:12                   ` Ananyev, Konstantin
@ 2015-05-13 20:06                     ` Ravi Kerur
  0 siblings, 0 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13 20:06 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

Hi Konstanin,


On Wed, May 13, 2015 at 3:12 AM, Ananyev, Konstantin <
konstantin.ananyev@intel.com> wrote:

> Hi Ravi,
>
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Wednesday, May 13, 2015 11:02 AM
> > To: Ananyev, Konstantin
> > Subject: FW: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> >
> >
> > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > Sent: Monday, May 11, 2015 9:47 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> >
> > Hi Konstantin,
> >
> > On Mon, May 11, 2015 at 12:35 PM, Ananyev, Konstantin <
> konstantin.ananyev@intel.com> wrote:
> >
> > Hi Ravi,
> >
> > >
> > > From: Ravi Kerur [mailto:rkerur@gmail.com]
> > > Sent: Monday, May 11, 2015 6:43 PM
> > > To: Ananyev, Konstantin
> > > Cc: Matt Laswell; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> > >
> > > Hi Konstantin,
> > >
> > >
> > > On Mon, May 11, 2015 at 2:51 AM, Ananyev, Konstantin <
> konstantin.ananyev@intel.com> wrote:
> > > Hi Ravi,
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ravi Kerur
> > > > Sent: Friday, May 08, 2015 11:55 PM
> > > > To: Matt Laswell
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE
> instructions.
> > > >
> > > > On Fri, May 8, 2015 at 3:29 PM, Matt Laswell <laswell@infiniteio.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Fri, May 8, 2015 at 4:19 PM, Ravi Kerur <rkerur@gmail.com>
> wrote:
> > > > >
> > > > >> This patch replaces memcmp in librte_hash with rte_memcmp which is
> > > > >> implemented with AVX/SSE instructions.
> > > > >>
> > > > >> +static inline int
> > > > >> +rte_memcmp(const void *_src_1, const void *_src_2, size_t n)
> > > > >> +{
> > > > >> +       const uint8_t *src_1 = (const uint8_t *)_src_1;
> > > > >> +       const uint8_t *src_2 = (const uint8_t *)_src_2;
> > > > >> +       int ret = 0;
> > > > >> +
> > > > >> +       if (n & 0x80)
> > > > >> +               return rte_cmp128(src_1, src_2);
> > > > >> +
> > > > >> +       if (n & 0x40)
> > > > >> +               return rte_cmp64(src_1, src_2);
> > > > >> +
> > > > >> +       if (n & 0x20) {
> > > > >> +               ret = rte_cmp32(src_1, src_2);
> > > > >> +               n -= 0x20;
> > > > >> +               src_1 += 0x20;
> > > > >> +               src_2 += 0x20;
> > > > >> +       }
> > > > >>
> > > > >>
> > > > > Pardon me for butting in, but this seems incorrect for the first
> two cases
> > > > > listed above, as the function as written will only compare the
> first 128 or
> > > > > 64 bytes of each source and return the result.  The pattern
> expressed in
> > > > > the 32 byte case appears more correct, as it compares the first 32
> bytes
> > > > > and then lets later pieces of the function handle the smaller
> remaining
> > > > > bits of the sources. Also, if this function is to handle
> arbitrarily large
> > > > > source data, the 128 byte case needs to be in a loop.
> > > > >
> > > > > What am I missing?
> > > > >
> > > >
> > > > Current max hash key length supported is 64 bytes, hence no
> comparison is
> > > > done after 64 bytes. 128 bytes comparison is added to measure
> performance
> > > > only and there is no use-case as of now. With the current use-cases
> its not
> > > > required but if there is a need to handle large arbitrary data upto
> 128
> > > > bytes it can be modified.
> > > So on x86 let say rte_memcmp(k1, k2, 65) might produce invalid
> results, right?
> > > While on PPC will work as expected (as it calls memcpu underneath)?
> > > That looks really weird to me.
> > > If you plan to use rte_memcmp only for hash comparisons, then probably
> > > you should put it somewhere into librte_hash and name it accordingly:
> rte_hash_key_cmp() or something.
> > > And put a big comment around it, that it only works with particular
> lengths.
> > > If you want it to be a generic function inside EAL, then it probably
> need to handle different lengths properly
> > > on all supported architectures.
> > > Konstantin
> > >
> > >
> > > Let me just explain it here and probably add it to document as well.
> > >
> > > rte_memcmp is not
> > >
> > > 1. a replacement to memcmp
> > >
> > > 2.  restricted to hash key comparison
> > >
> > > rte_memcmp is
> > >
> > > 1. optimized comparison for 16 to 128 bytes, v1 patch series had this
> support. Changed some of the logic in v2 due to concerns raised
> > > for unavailable use-cases beyond 64 bytes comparison.
> > From what I see in v2 it supposed to work correctly for len in [0,64]
> and  len=128, right?
> > Not sure I get it: so for v1 it was able to handle any length correctly,
> but then you removed it?
> > If so, I wonder what was the reason? Make it faster?
> >
> > My initial discussion was with Zhilong(John) from Intel and we decided
> to implement up to 128 bytes comparison and use rte_hash
> > and rte_lpm6 as a candidate for testing. When I sent out v1 patch, Bruce
> comments were on use-case for 128 bytes comparison and
> > was it really required? Hence I decided in v2 to support only up to 64
> bytes and added 128 bytes only for performance measurement.
> > Personally I think support for up to 128 bytes comparison is required,
> there might not be use-cases today but it will definitely be
> > useful.
>
> Ok, we don't have a real usage case for it now, but it still probably good
> to have it work with arbitrary key-length.
> Again, as Don suggested in another mail, we can have an optimised
> implementation
> for particular sizes and fall back to slow-path (memcp) for all other
> cases.
> Even if you'll decide to limit len to particular value (64/128), it is
> probably not very good to have a gap in between,
> as it exists now [65-127].
>
> Agreed. I am almost done with rte_memcmp mimicing memcmp and results look
ok to me. I am testing out on bsd and linux and will send out updated patch
once i am done with testing.

> >
> > Another thing that looks strange to me:
> > While all rte_cmp*() uses actual data values for comparison results,
> > rte_memcmp_remainder() return value depends not only on data values but
> also on data locations:
> >
> > +static inline int
> > +rte_memcmp_remainder(const uint8_t *src_1u, const uint8_t *src_2u,
> size_t n)
> > +{
> > ...
> > exit:
> > +
> > +       return src_1u < src_2u ? -1 : 1;
> > +}
> >
> > This is a bug and its not supposed to be there. I will fix it. Thanks
> for catching it.
> >
> > If you just test for equal/not equal that doesn't really matter.
> > If this is supposed to be a 'proper' comparison function, then the
> result is sort of unpredictable.
> > > With minor tuning over the weekend I am able to get better performance
> for
> > > anything between 16 to 128 bytes comparison.
> > >
> > > 2. will be specific to DPDK  i.e. currently all memcmp usage in DPDK
> are for equality or inequality hence "less than" or "greater than"
> > > implementation in rte_memcmp doesn't make sense and will be removed in
> subsequent patches, it will return 0 or 1 for
> > > equal/unequal cases.
> >
> > If you don't plan your function to follow memcmp() semantics and syntax,
> why to name it rte_memcmp()?
> > I  think that will make a lot of confusion around.
> > Why not to name it differently(and put a clear comment in the
> declaration of course)?
> >
> > Following memcmp semantics is not hard but there are no use-cases for it
> in DPDK currently. Keeping it specific to DPDK usage
> > simplifies code as well. I can change the name to "rte_compare" and add
> comments to the function. Will it work?
>
> Yep, either rte_compare(), or as Don suggested rte_testequal() - both
> seems good to me.
>
> Konstantin
>
> >
> >
> > >
> > > rte_hash will be the first candidate to move to rte_memcmp and
> subsequently rte_lpm6 which uses 16 bytes comparison will be
> > > moved
> > >
> > > Later on RING_SIZE which uses large size for comparison will be moved.
> I am currently studying/understanding that logic and will
> > make
> > > changes to rte_memcmp to support that.
> >
> > Sorry, didn't get you here.
> >
> > Once rte_hash, rte_lpm6 changes and new compare function code are
> reviewed and accepted I plan to move to different
> > components (RING_SIZE is currently defined to be from 256 to 16384
> bytes) and memcmp function being used in test_ring,
> > test_pmd_ring and other functions. I did not want to add all component
> changes into one patch series as it causes high review latency
> > or patch series just dies down silently. Instead make patches small and
> incremental in every series, hope this clarifies.
> > Thanks,
> > Ravi
> > Konstantin
> >
> > >
> > > I don't want to make lot of changes in one shot and see that patch
> series die a slow death with no takers.
> > >
> > > Thanks,
> > > Ravi
> > >
> > > >
> > > > >
> > > > > --
> > > > > Matt Laswell
> > > > > infinite io, inc.
> > > > > laswell@infiniteio.com
> > > > >
> > > > >
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13 12:21                     ` Jay Rolette
@ 2015-05-13 20:07                       ` Ravi Kerur
  0 siblings, 0 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13 20:07 UTC (permalink / raw)
  To: Jay Rolette; +Cc: dev, Don Provan

On Wed, May 13, 2015 at 5:21 AM, Jay Rolette <rolette@infiniteio.com> wrote:

> On Tue, May 12, 2015 at 8:16 PM, Ravi Kerur <rkerur@gmail.com> wrote:
>
>> On Mon, May 11, 2015 at 3:29 PM, Don Provan <dprovan@bivio.net> wrote:
>>
>> > I probably shouldn't stick my nose into this, but I can't help myself.
>> >
>> > An experienced programmer will tend to ignore the documentation for
>> > a routine named "blahblah_memcmp" and just assume it functions like
>> > memcmp. Whether or not there's currently a use case in DPDK is
>> > completely irrelevant because as soon as there *is* a use case, some
>> > poor DPDK developer will try to use rte_memcmp for that and may or
>> > may not have a test case that reveals their mistake.
>> >
>>
>> In general I agree with you. However, comparison is a hit(equal) or
>> miss(unequal) is generally the case in networking. I haven't seen cases
>> where "less than" or "greater than" has mattered.
>>
>
> It's useful when you need to make sure packets from both sides of a
> conversation go to the same processing queue/thread. Instead of hashing the
> 5-tuple from the packet as src.ip, dst.ip, src.dport, dst.dport, etc., you
> can use lesser.ip, higher.ip, lesser.sport, higher.dport, etc.
>
> Very common when you are doing deep packet inspection.
>

Thanks for sharing this information.

>
> Jay
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions.
  2015-05-13  9:03                     ` Bruce Richardson
@ 2015-05-13 20:08                       ` Ravi Kerur
  0 siblings, 0 replies; 21+ messages in thread
From: Ravi Kerur @ 2015-05-13 20:08 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Don Provan

On Wed, May 13, 2015 at 2:03 AM, Bruce Richardson <
bruce.richardson@intel.com> wrote:

> On Tue, May 12, 2015 at 06:16:20PM -0700, Ravi Kerur wrote:
> > On Mon, May 11, 2015 at 3:29 PM, Don Provan <dprovan@bivio.net> wrote:
> >
> > > I probably shouldn't stick my nose into this, but I can't help myself.
> > >
> > > An experienced programmer will tend to ignore the documentation for
> > > a routine named "blahblah_memcmp" and just assume it functions like
> > > memcmp. Whether or not there's currently a use case in DPDK is
> > > completely irrelevant because as soon as there *is* a use case, some
> > > poor DPDK developer will try to use rte_memcmp for that and may or
> > > may not have a test case that reveals their mistake.
> > >
> >
> > In general I agree with you. However, comparison is a hit(equal) or
> > miss(unequal) is generally the case in networking. I haven't seen cases
> > where "less than" or "greater than" has mattered.
> >
> >
> Agreed that == and != are the common operations. However, if that is what
> is returned from the function - and given other limitations on parameter
> sizes -
> I agree with previous posters that this function needs to have a different
> name
> to rte_memcmp so as to avoid confusion.
>

I will be implementing complete memcmp itself, so probably I will retain
same name.

>
> /Bruce
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-05-13 20:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08 21:19 [dpdk-dev] [PATCH v2] Implement rte_memcmp with AVX/SSE instructions Ravi Kerur
2015-05-08 21:19 ` [dpdk-dev] [PATCH v2] Implement memcmp using " Ravi Kerur
2015-05-08 22:29   ` Matt Laswell
2015-05-08 22:54     ` Ravi Kerur
2015-05-08 23:25       ` Matt Laswell
2015-05-11  9:51       ` Ananyev, Konstantin
2015-05-11 17:42         ` Ravi Kerur
     [not found]           ` <2601191342CEEE43887BDE71AB9772582142E44A@irsmsx105.ger.corp.intel.com>
2015-05-11 19:35             ` Ananyev, Konstantin
2015-05-11 20:46               ` Ravi Kerur
2015-05-11 22:29                 ` Don Provan
2015-05-13  1:16                   ` Ravi Kerur
2015-05-13  9:03                     ` Bruce Richardson
2015-05-13 20:08                       ` Ravi Kerur
2015-05-13 12:21                     ` Jay Rolette
2015-05-13 20:07                       ` Ravi Kerur
     [not found]                 ` <2601191342CEEE43887BDE71AB9772582142EBB5@irsmsx105.ger.corp.intel.com>
2015-05-13 10:12                   ` Ananyev, Konstantin
2015-05-13 20:06                     ` Ravi Kerur
2015-05-12  8:13   ` Linhaifeng
2015-05-13  1:18     ` Ravi Kerur
2015-05-13  7:22       ` Linhaifeng
2015-05-13 20:00         ` Ravi Kerur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).