From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by dpdk.org (Postfix) with ESMTP id 2D20C5F1B for ; Tue, 27 Jan 2015 14:57:45 +0100 (CET) Received: by mail-wi0-f179.google.com with SMTP id l15so5176958wiw.0 for ; Tue, 27 Jan 2015 05:57:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=pbJZLCXB/2p87lCKofpguRJQQsA9nLemj4OnQGMY0TU=; b=ckFuRBF/Nux+JJuhQUz3R6sW2tL9uWS60aScqgLSkqWnrIQ6vQcmnZydnJlJQebyxJ VUedVm6bi4bERfDyDCpZGooVpmpHEdRNTUuPVINAGHJAaFjBgQSZGlSIlz4OVwSTb6D2 kbSnsUR8aABxDxzaPKNbnviQayssRIjk6sdi39/nmbf0mo/bzKNPuFGDjggw4+Y7zAdC aH/DqSTgWErQFDhWyoZcT5wZG13G/C8trsAknM5njPjBhtNAzzWzTccBBsFR8FBX+dRI LmcM+fCDn98Fra/xElv8pAvPjeWlfcgWr5K88JhFohkm77aSt0PO/JzaAF5zc/ojQqAz hwAg== MIME-Version: 1.0 X-Received: by 10.194.62.235 with SMTP id b11mr203336wjs.73.1422367064293; Tue, 27 Jan 2015 05:57:44 -0800 (PST) Sender: lukego@gmail.com Received: by 10.27.6.134 with HTTP; Tue, 27 Jan 2015 05:57:44 -0800 (PST) In-Reply-To: References: <1421632414-10027-1-git-send-email-zhihong.wang@intel.com> Date: Tue, 27 Jan 2015 14:57:44 +0100 X-Google-Sender-Auth: xd7v2hWOQvXYTZXVK0Ft9IOR9dY Message-ID: From: Luke Gorrie To: "snabb-devel@googlegroups.com" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [snabb-devel] RE: [PATCH 0/4] DPDK memcpy optimization X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jan 2015 13:57:45 -0000 Hi again John, Thank you for the patient answers :-) Thank you for pointing this out: I was mistakenly testing your Sandy Bridge code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2). Correcting that, your code is both the fastest and the smallest in my humble micro benchmarking tests. Looks like you have done great work! You probably knew that already :-) but thank you for walking me through it. The code compiles to 745 bytes of object code (smaller than glibc 2.20 memcpy) and cachebenches like this: Memory Copy Library Cache Test C Size Nanosec MB/sec % Chnge ------- ------- ------- ------- 256 0.01 97587.60 1.00 384 0.01 97628.83 1.00 512 0.01 97613.95 1.00 768 0.01 147811.44 0.66 1024 0.01 158938.68 0.93 1536 0.01 168487.49 0.94 2048 0.01 174278.83 0.97 3072 0.01 156922.58 1.11 4096 0.01 145811.59 1.08 6144 0.01 157388.27 0.93 8192 0.01 149616.95 1.05 12288 0.01 149064.26 1.00 16384 0.01 107895.06 1.38 the key difference from my perspective is that glibc 2.20 memcpy performance goes way down for >= 2048 bytes when they switch from vector moves to string moves, while your code stays consistent. I will take it for a spin in a real application. Cheers, -Luke