From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by dpdk.org (Postfix) with ESMTP id D15D712A8 for ; Tue, 5 May 2015 23:56:16 +0200 (CEST) Received: by oift201 with SMTP id t201so160434009oif.3 for ; Tue, 05 May 2015 14:56:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=eyuQL+jNXmH8BKYsKdcP5RWLYHxQSuvZb3DEq6EOirs=; b=UE92aojm6RH84e/ymUsK3FppA3Z8aYRkFa+9JL/AIv70p9FjtnSAA77uyGN0TY6P7/ dvz37P5aFv5ensEgBKw+Jrm0XSKFl9YIVCFjPcUFmgZMn4Jo8yZIa9m8LRXBAbqxYaC3 uy48uANPTH2Abhp+2QA87duziuRxPyFC92p4XTkZPd8RlVYPiMHNTXAtel2XFvUdUaXB 4COUAmlFrOnRhFTctsyiSfw18DB214uqsem0A/swyQqYgSg4vREUqDgntp+am8Q/sVlq z/fdC7Ef0E3EebO4E8Oy7h92TpJbDEI34juVDsZXpzHWMxtO6MzEd0Q7+uh+qH8gEzVz 64Xg== MIME-Version: 1.0 X-Received: by 10.60.145.168 with SMTP id sv8mr24121603oeb.43.1430862976234; Tue, 05 May 2015 14:56:16 -0700 (PDT) Received: by 10.202.179.195 with HTTP; Tue, 5 May 2015 14:56:16 -0700 (PDT) In-Reply-To: References: <1429716828-19012-1-git-send-email-rkerur@gmail.com> <1429716828-19012-2-git-send-email-rkerur@gmail.com> <55389E44.8030603@intel.com> <20150423081138.GA8592@bricha3-MOBL3> <2601191342CEEE43887BDE71AB97725821420FC7@irsmsx105.ger.corp.intel.com> <20150423140042.GA7248@bricha3-MOBL3> Date: Tue, 5 May 2015 14:56:16 -0700 Message-ID: From: Ravi Kerur To: Bruce Richardson Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH] Implement memcmp using AVX/SSE instructio X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 May 2015 21:56:17 -0000 On Thu, Apr 23, 2015 at 3:26 PM, Ravi Kerur wrote: > > > On Thu, Apr 23, 2015 at 7:00 AM, Bruce Richardson < > bruce.richardson@intel.com> wrote: > >> On Thu, Apr 23, 2015 at 06:53:44AM -0700, Ravi Kerur wrote: >> > On Thu, Apr 23, 2015 at 2:23 AM, Ananyev, Konstantin < >> > konstantin.ananyev@intel.com> wrote: >> > >> > > >> > > >> > > > -----Original Message----- >> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce >> Richardson >> > > > Sent: Thursday, April 23, 2015 9:12 AM >> > > > To: Wodkowski, PawelX >> > > > Cc: dev@dpdk.org >> > > > Subject: Re: [dpdk-dev] [PATCH] Implement memcmp using AVX/SSE >> instructio >> > > > >> > > > On Thu, Apr 23, 2015 at 09:24:52AM +0200, Pawel Wodkowski wrote: >> > > > > On 2015-04-22 17:33, Ravi Kerur wrote: >> > > > > >+/** >> > > > > >+ * Compare bytes between two locations. The locations must not >> > > overlap. >> > > > > >+ * >> > > > > >+ * @note This is implemented as a macro, so it's address should >> not >> > > be taken >> > > > > >+ * and care is needed as parameter expressions may be evaluated >> > > multiple times. >> > > > > >+ * >> > > > > >+ * @param src_1 >> > > > > >+ * Pointer to the first source of the data. >> > > > > >+ * @param src_2 >> > > > > >+ * Pointer to the second source of the data. >> > > > > >+ * @param n >> > > > > >+ * Number of bytes to compare. >> > > > > >+ * @return >> > > > > >+ * true if equal otherwise false. >> > > > > >+ */ >> > > > > >+static inline bool >> > > > > >+rte_memcmp(const void *src_1, const void *src, >> > > > > >+ size_t n) __attribute__((always_inline)); >> > > > > You are exposing this as public API, so I think you should follow >> > > > > description bellow or not call this _memcmp_ >> > > > > >> > > > > int memcmp(const void *s1, const void *s2, size_t n); >> > > > > >> > > > > The memcmp() function returns an integer less than, equal to, or >> > > greater >> > > > > than >> > > > > zero if the first n bytes of s1 is found, >> respectively, >> > > to be >> > > > > less than, to >> > > > > match, or be greater than the first n bytes of s2. >> > > > > >> > > > >> > > > +1 to this point. >> > > > >> > > > Also, if I read your quoted performance numbers in your earlier mail >> > > correctly, >> > > > we are only looking at a 1-4% performance increase. Is the >> additional >> > > code to >> > > > maintain worth the benefit? >> > > >> > > Yep, same thought here, is it really worth it? >> > > Konstantin >> > > >> > > > >> > > > /Bruce >> > > > >> > > > > -- >> > > > > Pawel >> > > >> > >> > I think I haven't exploited every thing x86 has to offer to improve >> > performance. I am looking for inputs. Until we have exhausted all >> avenues I >> > don't want to drop it. One thing I have noticed is that bigger key size >> > gets better performance numbers. I plan to re-run perf tests with 64 and >> > 128 bytes key size and will report back. Any other avenues to try out >> > please let me know I will give it a shot. >> > >> > Thanks, >> > Ravi >> >> Hi Ravi, >> >> are 128 byte comparisons realistic? An IPv6 5-tuple with double vlan tags >> is still >> only 41 bytes, or 48 with some padding added? >> While for a memcpy function, you can see cases where you are going to >> copy a whole >> packet, meaning that sizes of 128B+ (up to multiple k) are realistic, >> it's harder >> to see that for a compare function. >> >> In any case, we await the results of your further optimization work to >> see how >> that goes. >> >> > Actually I was looking at wrong numbers. Wrote couple of sample programs and found that memory comparison with AVX/SSE takes almost 1/3rd less cpu ticks when compared with regular memcmp. For 16bytes, regular memcmp Time: 276 ticks (3623188 memcmp/tick) Time: 276 ticks (3623188 memcmp/tick) memcmp with AVX/SSE Time: 86 ticks (11627906 memcmp/tick) Time: 87 ticks (11494252 memcmp/tick) For 32bytes, regular memcmp Time: 301 ticks (3322259 memcmp/tick) Time: 302 ticks (3311258 memcmp/tick) memcmp with AVX/SSE Time: 87 ticks (11494252 memcmp/tick) Time: 88 ticks (11363636 memcmp/tick) For 64bytes, regular memcmp Time: 376 ticks (2855696 memcmp/tick) 0 Time: 377 ticks (2848121 memcmp/tick) 0 memcmp with AVX/SSE Time: 110 ticks (9761289 memcmp/tick) 0 Time: 110 ticks (9761289 memcmp/tick) 0 With some modifications to original patch, and looking through test_hash_perf which has statistics for every test (Add on empty, Add update, Lookup) it performs, in almost all categories (16, 32, 48 and 64 bytes) AVX/SSE beats regular memcmp. Please note that the time measured in test_hash_perf is for hash functions (jhash and hash_crc) and memcmp is just a small part of the hash functionality. I will send modified patch later on. Thanks, Ravi > Hi Bruce, > > Couple of things I am planning to try > > 1. Use _xor_ and _testz_ instructions for comparison instead of _cmpeq_ > and _mask_. > 2. I am using unaligned loads, not sure about the penalty, I plan to try > with aligned loads if address is aligned and compare results. > > Agreed that with just L3 or even if we go with L2 + L3 + L4 tuples it will > not exceed 64 bytes, 128 bytes is just a stretch for some weird MPLSoGRE > header formats. > > My focus is currently on improving performance for < 64 bytes and < 128 > bytes key lengths only. > > Thanks, > Ravi > > Regards, >> /Bruce >> > >