From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f68.google.com (mail-pl0-f68.google.com [209.85.160.68]) by dpdk.org (Postfix) with ESMTP id 6D3B52A58 for ; Fri, 1 Dec 2017 19:04:31 +0100 (CET) Received: by mail-pl0-f68.google.com with SMTP id v41so6678419plg.4 for ; Fri, 01 Dec 2017 10:04:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Ozbfoo9nSMnbqhVua/xrnW0Rg3ZjSw4h3mTeAdzypNE=; b=YcibxOvZHo9oAVfELyKyLBvH8EVc/0C8mwsgQ6ZICLopYjzaQI9phRqzEmqV3c9SUo r3suJEUzUFgcnWRmPPUf2Qv5SomwO8S1jyRd/zWjAAMGSG5gOdHUVh48G+beKnVdQZls rgYslQm71i0jb9+3c/w5vaeolYcM8xonFUoUAD+zLhnuIOc6kMY8SKlTsurQS7obqOcX GVbV+hWxcNaX66mss1QZab2AAwwCRENjlaXzOLO7hbov2imGtqGLaWJMhKYyv1/lQR6q pF6jDqHsSZXqR80IhXiQzALfRWkCkD98xWoSIchMs1B4lkLt1W+e8DOhqqTgcmgErlOz YamA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Ozbfoo9nSMnbqhVua/xrnW0Rg3ZjSw4h3mTeAdzypNE=; b=gCZw+yvO6PqdzANp5m82RborxNkprx20J2Oz4M/NH07W/I9xWc3ZJnxi5pyzoZLLXZ aZVXE4pgoYjtWuaUNXGamnzVBOrsgMAHbo/gs7HCnrj1/icGoyAf5i3+vqYCSsQthgOk ygvlxug1MASFHnlcaeyhKmc1/Npiipn9keTEoJK2mlPPqjNTLH8rEnlNZ7JLRnHLVe5a 2zzUjxh4Hx0bFKiY059xe0+BhYeyL7VWJvvZrSlvDHELqBiLk9O9nPd/UmS1ZS+ZO1Rp pUOJ0Z57MSa4F2XMuMK5PF6tgKJhDVCpuNZFrdr2bhQ7r9TPZJqzl2Fe9asZFqnQjuIC 6DUg== X-Gm-Message-State: AJaThX5X8YKt2jxhHwHJ/KHDDdMz51cwLHXbkzy961y3AfvWDbm66Q3X dpujLzNijKGHoaZfCDWE83H01g== X-Google-Smtp-Source: AGs4zMbIs5KwAsPTAdNuZFLqiNiICGxmQImDlaeXBTGwioXx/uxJfHKenSBH/R8Z+OrAHzRdEagddw== X-Received: by 10.84.246.137 with SMTP id m9mr6998917pll.130.1512151470457; Fri, 01 Dec 2017 10:04:30 -0800 (PST) Received: from xeon-e3 (76-14-207-240.or.wavecable.com. [76.14.207.240]) by smtp.gmail.com with ESMTPSA id y19sm11200980pgv.19.2017.12.01.10.04.30 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 01 Dec 2017 10:04:30 -0800 (PST) Date: Fri, 1 Dec 2017 10:04:18 -0800 From: Stephen Hemminger To: Konstantin Ananyev Cc: dev@dpdk.org Message-ID: <20171201100418.3491bff0@xeon-e3> In-Reply-To: <1512126771-27503-2-git-send-email-konstantin.ananyev@intel.com> References: <1512126771-27503-1-git-send-email-konstantin.ananyev@intel.com> <1512126771-27503-2-git-send-email-konstantin.ananyev@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb() X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2017 18:04:31 -0000 On Fri, 1 Dec 2017 11:12:51 +0000 Konstantin Ananyev wrote: > On x86 it is possible to use lock-prefixed instructions to get > the similar effect as mfence. > As pointed by Java guys, on most modern HW that gives a better > performance than using mfence: > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > That patch adopts that technique for rte_smp_mb() implementation. > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction, > i.e. from ~110 to ~55 cycles per operation. > > Signed-off-by: Konstantin Ananyev > --- > .../common/include/arch/x86/rte_atomic.h | 45 +++++++++++++++++++++- > 1 file changed, 43 insertions(+), 2 deletions(-) > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h > index 4eac66631..07b7fa7f7 100644 > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h > @@ -55,12 +55,53 @@ extern "C" { > > #define rte_rmb() _mm_lfence() > > -#define rte_smp_mb() rte_mb() > - > #define rte_smp_wmb() rte_compiler_barrier() > > #define rte_smp_rmb() rte_compiler_barrier() > > +/* > + * From Intel Software Development Manual; Vol 3; > + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families: > + * ... > + * . Reads are not reordered with other reads. > + * . Writes are not reordered with older reads. > + * . Writes to memory are not reordered with other writes, > + * with the following exceptions: > + * . streaming stores (writes) executed with the non-temporal move > + * instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and > + * . string operations (see Section 8.2.4.1). > + * ... > + * . Reads may be reordered with older writes to different locations but not > + * with older writes to the same location. > + * . Reads or writes cannot be reordered with I/O instructions, > + * locked instructions, or serializing instructions. > + * . Reads cannot pass earlier LFENCE and MFENCE instructions. > + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions. > + * . LFENCE instructions cannot pass earlier reads. > + * . SFENCE instructions cannot pass earlier writes ... > + * . MFENCE instructions cannot pass earlier reads, writes ... > + * > + * As pointed by Java guys, that makes possible to use lock-prefixed > + * instructions to get the same effect as mfence and on most modern HW > + * that gives a better perfomarnce than using mfence: > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > + * So below we use that technique for rte_smp_mb() implementation. > + */ > + > +#ifdef RTE_ARCH_I686 > +#define RTE_SP RTE_STR(esp) > +#else > +#define RTE_SP RTE_STR(rsp) > +#endif > + > +#define RTE_MB_DUMMY_MEMP "-128(%%" RTE_SP ")" > + > +static __rte_always_inline void > +rte_smp_mb(void) > +{ > + asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory"); > +} > + > #define rte_io_mb() rte_mb() > > #define rte_io_wmb() rte_compiler_barrier() The lock instruction is a stronger barrier than the compiler barrier and has worse performance impact. Are you sure it is necessary to use it in DPDK. Linux kernel has successfully used simple compiler reodering barrier for years. Don't confuse rte_smp_mb with the required barrier for talking to I/O devices.