From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) by dpdk.org (Postfix) with ESMTP id 2F4C62A07 for ; Tue, 12 May 2015 10:13:35 +0200 (CEST) Received: from 172.24.2.119 (EHLO szxeml432-hub.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id CLG37517; Tue, 12 May 2015 16:13:31 +0800 (CST) Received: from [127.0.0.1] (10.177.19.115) by szxeml432-hub.china.huawei.com (10.82.67.209) with Microsoft SMTP Server id 14.3.158.1; Tue, 12 May 2015 16:13:14 +0800 Message-ID: <5551B615.5060405@huawei.com> Date: Tue, 12 May 2015 16:13:09 +0800 From: Linhaifeng User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 MIME-Version: 1.0 To: Ravi Kerur , References: <1431119946-32078-1-git-send-email-rkerur@gmail.com> <1431119989-32124-1-git-send-email-rkerur@gmail.com> In-Reply-To: <1431119989-32124-1-git-send-email-rkerur@gmail.com> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.19.115] X-CFilter-Loop: Reflected Subject: Re: [dpdk-dev] [PATCH v2] Implement memcmp using AVX/SSE instructions. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 May 2015 08:13:37 -0000 Hi, Ravi Kerur On 2015/5/9 5:19, Ravi Kerur wrote: > Preliminary results on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu > 14.04 x86_64 shows comparisons using AVX/SSE instructions taking 1/3rd > CPU ticks for 16, 32, 48 and 64 bytes comparison. In addition, I had write a program to test rte_memcmp and I have a question about the result. Why cost same CPU ticks for 128 256 512 1024 1500 bytes? Is there any problem in my test? [root@localhost test]# gcc avx_test.c -O3 -I /data/linhf/v2r2c00/open-source/dpdk/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/ -mavx2 -DRTE_MACHINE_CPUFLAG_AVX2 [root@localhost test]# ./a.out 0 each test run 100000000 times copy 16 bytes costs average 7(rte_memcmp) 10(memcmp) ticks copy 32 bytes costs average 9(rte_memcmp) 11(memcmp) ticks copy 64 bytes costs average 6(rte_memcmp) 13(memcmp) ticks copy 128 bytes costs average 11(rte_memcmp) 14(memcmp) ticks copy 256 bytes costs average 9(rte_memcmp) 14(memcmp) ticks copy 512 bytes costs average 9(rte_memcmp) 14(memcmp) ticks copy 1024 bytes costs average 9(rte_memcmp) 14(memcmp) ticks copy 1500 bytes costs average 11(rte_memcmp) 14(memcmp) ticks [root@localhost test]# ./a.out 1 each test run 100000000 times copy 16 bytes costs average 2(rte_memcpy) 10(memcpy) ticks copy 32 bytes costs average 2(rte_memcpy) 10(memcpy) ticks copy 64 bytes costs average 3(rte_memcpy) 10(memcpy) ticks copy 128 bytes costs average 7(rte_memcpy) 12(memcpy) ticks copy 256 bytes costs average 9(rte_memcpy) 23(memcpy) ticks copy 512 bytes costs average 14(rte_memcpy) 34(memcpy) ticks copy 1024 bytes costs average 37(rte_memcpy) 61(memcpy) ticks copy 1500 bytes costs average 62(rte_memcpy) 87(memcpy) ticks Here is my program: #include #include #include #include #include #define TIMES 100000000L void test_memcpy(size_t n) { uint64_t start, end, i, start2, end2; uint8_t *src, *dst; src = (uint8_t*)malloc(n * sizeof(uint8_t)); dst = (uint8_t*)malloc(n * sizeof(uint8_t)); start = rte_rdtsc(); for (i = 0; i < TIMES; i++) { rte_memcpy(dst, src, n); } end = rte_rdtsc(); start2 = rte_rdtsc(); for (i = 0; i < TIMES; i++) { memcpy(dst, src, n); } end2 = rte_rdtsc(); free(src); free(dst); printf("copy %u bytes costs average %llu(rte_memcpy) %llu(memcpy) ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES); } int test_memcmp(size_t n) { uint64_t start, end, i, start2, end2, j; uint8_t *src, *dst; int *ret; src = (uint8_t*)malloc(n * sizeof(uint8_t)); dst = (uint8_t*)malloc(n * sizeof(uint8_t)); ret = (int*)malloc(TIMES * sizeof(int)); start = rte_rdtsc(); for (i = 0; i < TIMES; i++) { ret[i] = rte_memcmp(dst, src, n); } end = rte_rdtsc(); start2 = rte_rdtsc(); for (i = 0; i < TIMES; i++) { ret[i] = memcmp(dst, src, n); } end2 = rte_rdtsc(); // avoid gcc to optimize memcmp for (i = 0; i < TIMES; i++) { t += ret[i]; } free(src); free(dst); printf("copy %u bytes costs average %llu(rte_memcmp) %llu(memcmp) ticks\n", n, (end - start)/TIMES, (end2 - start2)/TIMES); return t; } int main(int narg, char** args) { printf("each test run %llu times\n", TIMES); if (narg < 2) { printf("usage:./avx_test 0/1 1:test memcpy 0:test memcmp\n"); return -1; } if (atoi(args[1])) { test_memcpy(16); test_memcpy(32); test_memcpy(64); test_memcpy(128); test_memcpy(256); test_memcpy(512); test_memcpy(1024); test_memcpy(1500); } else { test_memcmp(16); test_memcmp(32); test_memcmp(64); test_memcmp(128); test_memcmp(256); test_memcmp(512); test_memcmp(1024); test_memcmp(1500); } }