From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f169.google.com (mail-we0-f169.google.com [74.125.82.169]) by dpdk.org (Postfix) with ESMTP id C01F75A5C for ; Sun, 25 Jan 2015 15:50:27 +0100 (CET) Received: by mail-we0-f169.google.com with SMTP id u56so5131162wes.0 for ; Sun, 25 Jan 2015 06:50:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=9cIzrB2KNeB2rRn7X5UObBXDjw6xHpMy95pXFm/4J0A=; b=IlHChlta86DMC6n0hn7QGYux49j3e5u94B0YKW88WJUppMJClJBwJ2ZCPsH9+QtbuY NyHegET+5D+0pPlW9WVMToaKUsrE8M5vI+7LZJ+YoRdK+1dGHj3svOA0NaDHKrDv/nC/ PMVgJcDDEc3t7KRP6GxRI5kFkczv7n99Kf6C5VV4QzcWae5RXBTyHtI6w0f5l0JBFoKE MJZykEepS4w83IRqPD/DDArwwpimOIEz2AxsKX4hXGtIymHfoAXAG5IzywNUebzEthOa VARRCFsIblGq8imEz1U8cfvltmwdh66Qj5Ndu4OoCKEXNjuzvW3lqXiJ4EaZEmcVYtAL ClNA== MIME-Version: 1.0 X-Received: by 10.180.79.106 with SMTP id i10mr17614150wix.35.1422197427506; Sun, 25 Jan 2015 06:50:27 -0800 (PST) Sender: lukego@gmail.com Received: by 10.27.6.134 with HTTP; Sun, 25 Jan 2015 06:50:27 -0800 (PST) In-Reply-To: <1421632414-10027-1-git-send-email-zhihong.wang@intel.com> References: <1421632414-10027-1-git-send-email-zhihong.wang@intel.com> Date: Sun, 25 Jan 2015 15:50:27 +0100 X-Google-Sender-Auth: IwOxGSz98l9HTphZBU3No0H_yzI Message-ID: From: Luke Gorrie To: zhihong.wang@intel.com Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" , "snabb-devel@googlegroups.com" Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Jan 2015 14:50:27 -0000 Hi John, On 19 January 2015 at 02:53, wrote: > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. > It also extends memcpy test coverage with unaligned cases and more test > points. > I am really interested in this work you are doing on memory copies optimized for packet data. I would like to understand it in more depth. I have a lot of questions and ideas but let me try to keep it simple for now :-) How do you benchmark? where does the "factor of 2-8" cited elsewhere in the thread come from? how can I reproduce? what results are you seeing compared with libc? I did a quick benchmark this weekend based on cachebench . This seems like a fairly weak benchmark (always L1 cache, always same alignment, always predictable branches). Do you think this is relevant? How does this compare with your results? I compared: rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native and -O3) memcpy from glibc 2.19 (ubuntu 14.04) memcpy from glibc 2.20 (arch linux) on hardware: E5-2620v3 (Haswell) E5-2650 (Sandy Bridge) running cachebench like this: ./cachebench -p -e1 -x1 -m14 rte_memcpy.h on Haswell: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.01 89191.88 1.00 384 0.01 96505.43 0.92 512 0.01 96509.19 1.00 768 0.01 91475.72 1.06 1024 0.01 96293.82 0.95 1536 0.01 96521.66 1.00 2048 0.01 96522.87 1.00 3072 0.01 96525.53 1.00 4096 0.01 96522.79 1.00 6144 0.01 96507.71 1.00 8192 0.01 94584.41 1.02 12288 0.01 95062.80 0.99 16384 0.01 80493.46 1.18 libc 2.20 on Haswell: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.01 65978.64 1.00 384 0.01 100249.01 0.66 512 0.01 123476.55 0.81 768 0.01 144699.86 0.85 1024 0.01 159459.88 0.91 1536 0.01 168001.92 0.95 2048 0.01 80738.31 2.08 3072 0.01 80270.02 1.01 4096 0.01 84239.84 0.95 6144 0.01 90600.13 0.93 8192 0.01 89767.94 1.01 12288 0.01 92085.98 0.97 16384 0.01 92719.95 0.99 libc 2.19 on Haswell: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.02 59871.69 1.00 384 0.01 68545.94 0.87 512 0.01 72674.23 0.94 768 0.01 79257.47 0.92 1024 0.01 79740.43 0.99 1536 0.01 85483.67 0.93 2048 0.01 87703.68 0.97 3072 0.01 86685.71 1.01 4096 0.01 87147.84 0.99 6144 0.01 68622.96 1.27 8192 0.01 70591.25 0.97 12288 0.01 72621.28 0.97 16384 0.01 67713.63 1.07 rte_memcpy on Sandy Bridge: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.02 62158.19 1.00 384 0.01 73256.41 0.85 512 0.01 82032.16 0.89 768 0.01 73919.92 1.11 1024 0.01 75937.51 0.97 1536 0.01 78280.20 0.97 2048 0.01 79562.54 0.98 3072 0.01 80800.93 0.98 4096 0.01 81453.71 0.99 6144 0.01 81915.84 0.99 8192 0.01 82427.98 0.99 12288 0.01 82789.82 1.00 16384 0.01 67519.66 1.23 libc 2.20 on Sandy Bridge: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.02 48651.20 1.00 384 0.02 57653.91 0.84 512 0.01 67909.77 0.85 768 0.01 71177.75 0.95 1024 0.01 72519.48 0.98 1536 0.01 76686.24 0.95 2048 0.19 4975.55 15.41 3072 0.19 5091.97 0.98 4096 0.19 5152.38 0.99 6144 0.18 5211.26 0.99 8192 0.18 5245.27 0.99 12288 0.18 5276.50 0.99 16384 0.18 5209.80 1.01 libc 2.19 on Sandy Bridge: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.02 44970.51 1.00 384 0.02 51922.46 0.87 512 0.02 57230.56 0.91 768 0.02 63438.96 0.90 1024 0.01 67506.58 0.94 1536 0.01 72579.25 0.93 2048 0.01 75722.25 0.96 3072 0.01 71039.19 1.07 4096 0.01 73946.17 0.96 6144 0.02 40969.79 1.80 8192 0.02 41396.05 0.99 12288 0.02 41830.01 0.99 16384 0.02 42032.40 1.00 Last question: Why is rte_memcpy inline? (Would making it a library function give you smaller code, comparable performance, and fast compiles?) Cheers! -Luke