From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lukego@gmail.com>
Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com
 [209.85.212.179]) by dpdk.org (Postfix) with ESMTP id 2D20C5F1B
 for <dev@dpdk.org>; Tue, 27 Jan 2015 14:57:45 +0100 (CET)
Received: by mail-wi0-f179.google.com with SMTP id l15so5176958wiw.0
 for <dev@dpdk.org>; Tue, 27 Jan 2015 05:57:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=pbJZLCXB/2p87lCKofpguRJQQsA9nLemj4OnQGMY0TU=;
 b=ckFuRBF/Nux+JJuhQUz3R6sW2tL9uWS60aScqgLSkqWnrIQ6vQcmnZydnJlJQebyxJ
 VUedVm6bi4bERfDyDCpZGooVpmpHEdRNTUuPVINAGHJAaFjBgQSZGlSIlz4OVwSTb6D2
 kbSnsUR8aABxDxzaPKNbnviQayssRIjk6sdi39/nmbf0mo/bzKNPuFGDjggw4+Y7zAdC
 aH/DqSTgWErQFDhWyoZcT5wZG13G/C8trsAknM5njPjBhtNAzzWzTccBBsFR8FBX+dRI
 LmcM+fCDn98Fra/xElv8pAvPjeWlfcgWr5K88JhFohkm77aSt0PO/JzaAF5zc/ojQqAz
 hwAg==
MIME-Version: 1.0
X-Received: by 10.194.62.235 with SMTP id b11mr203336wjs.73.1422367064293;
 Tue, 27 Jan 2015 05:57:44 -0800 (PST)
Sender: lukego@gmail.com
Received: by 10.27.6.134 with HTTP; Tue, 27 Jan 2015 05:57:44 -0800 (PST)
In-Reply-To: <F60F360A2500CD45ACDB1D700268892D0E76129D@SHSMSX101.ccr.corp.intel.com>
References: <1421632414-10027-1-git-send-email-zhihong.wang@intel.com>
 <CAA2XHbfxoc9DDgbNUQJJT4TRfhHc5FbXWTnTfwDO7wEjF3y-Qw@mail.gmail.com>
 <F60F360A2500CD45ACDB1D700268892D0E760AF5@SHSMSX101.ccr.corp.intel.com>
 <CAA2XHbdOUBJwgMMJF7xG2Rh+sPkDfxKp8JkTe6+3zgn-WC7TdQ@mail.gmail.com>
 <F60F360A2500CD45ACDB1D700268892D0E76129D@SHSMSX101.ccr.corp.intel.com>
Date: Tue, 27 Jan 2015 14:57:44 +0100
X-Google-Sender-Auth: xd7v2hWOQvXYTZXVK0Ft9IOR9dY
Message-ID: <CAA2XHbeqZyK-RZEVh-+afwMWoL1ORg1aFvsmDaUeDNCTjmupcA@mail.gmail.com>
From: Luke Gorrie <luke@snabb.co>
To: "snabb-devel@googlegroups.com" <snabb-devel@googlegroups.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [snabb-devel] RE: [PATCH 0/4] DPDK memcpy
	optimization
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jan 2015 13:57:45 -0000

Hi again John,

Thank you for the patient answers :-)

Thank you for pointing this out: I was mistakenly testing your Sandy Bridge
code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2).

Correcting that, your code is both the fastest and the smallest in my
humble micro benchmarking tests.

Looks like you have done great work! You probably knew that already :-) but
thank you for walking me through it.

The code compiles to 745 bytes of object code (smaller than glibc 2.20
memcpy) and cachebenches like this:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            97587.60        1.00
384             0.01            97628.83        1.00
512             0.01            97613.95        1.00
768             0.01            147811.44       0.66
1024            0.01            158938.68       0.93
1536            0.01            168487.49       0.94
2048            0.01            174278.83       0.97
3072            0.01            156922.58       1.11
4096            0.01            145811.59       1.08
6144            0.01            157388.27       0.93
8192            0.01            149616.95       1.05
12288           0.01            149064.26       1.00
16384           0.01            107895.06       1.38

the key difference from my perspective is that glibc 2.20 memcpy
performance goes way down for >= 2048 bytes when they switch from vector
moves to string moves, while your code stays consistent.

I will take it for a spin in a real application.

Cheers,
-Luke