From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id A50E0324A for ; Sat, 2 Dec 2017 00:08:43 +0100 (CET) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Dec 2017 15:08:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.45,347,1508828400"; d="scan'208";a="8585260" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by orsmga003.jf.intel.com with ESMTP; 01 Dec 2017 15:08:41 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.67]) by IRSMSX154.ger.corp.intel.com ([169.254.12.83]) with mapi id 14.03.0319.002; Fri, 1 Dec 2017 23:08:40 +0000 From: "Ananyev, Konstantin" To: Stephen Hemminger CC: "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb() Thread-Index: AQHTapVd3bedCUxRrkamiGA53GG7LqMuyHEAgABUJtA= Date: Fri, 1 Dec 2017 23:08:39 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772585FAC39B0@irsmsx105.ger.corp.intel.com> References: <1512126771-27503-1-git-send-email-konstantin.ananyev@intel.com> <1512126771-27503-2-git-send-email-konstantin.ananyev@intel.com> <20171201100418.3491bff0@xeon-e3> In-Reply-To: <20171201100418.3491bff0@xeon-e3> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMGY0MjJjMjItODY4NC00YTYxLWE5Y2UtYTFmOTBjNDZiYjI4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IkhQRE9XR0F2bTBnV0VRMWgrYzZOdytKTWp1WGtMOENYVVZtN0JoenFvRUk9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb() X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Dec 2017 23:08:45 -0000 Hi Stephen, > -----Original Message----- > From: Stephen Hemminger [mailto:stephen@networkplumber.org] > Sent: Friday, December 1, 2017 6:04 PM > To: Ananyev, Konstantin > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructio= ns to reduce cost of rte_smp_mb() >=20 > On Fri, 1 Dec 2017 11:12:51 +0000 > Konstantin Ananyev wrote: >=20 > > On x86 it is possible to use lock-prefixed instructions to get > > the similar effect as mfence. > > As pointed by Java guys, on most modern HW that gives a better > > performance than using mfence: > > https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > That patch adopts that technique for rte_smp_mb() implementation. > > On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction, > > i.e. from ~110 to ~55 cycles per operation. > > > > Signed-off-by: Konstantin Ananyev > > --- > > .../common/include/arch/x86/rte_atomic.h | 45 ++++++++++++++= +++++++- > > 1 file changed, 43 insertions(+), 2 deletions(-) > > > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/= librte_eal/common/include/arch/x86/rte_atomic.h > > index 4eac66631..07b7fa7f7 100644 > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h > > @@ -55,12 +55,53 @@ extern "C" { > > > > #define rte_rmb() _mm_lfence() > > > > -#define rte_smp_mb() rte_mb() > > - > > #define rte_smp_wmb() rte_compiler_barrier() > > > > #define rte_smp_rmb() rte_compiler_barrier() > > > > +/* > > + * From Intel Software Development Manual; Vol 3; > > + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families: > > + * ... > > + * . Reads are not reordered with other reads. > > + * . Writes are not reordered with older reads. > > + * . Writes to memory are not reordered with other writes, > > + * with the following exceptions: > > + * . streaming stores (writes) executed with the non-temporal move > > + * instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); a= nd > > + * . string operations (see Section 8.2.4.1). > > + * ... > > + * . Reads may be reordered with older writes to different locations b= ut not > > + * with older writes to the same location. > > + * . Reads or writes cannot be reordered with I/O instructions, > > + * locked instructions, or serializing instructions. > > + * . Reads cannot pass earlier LFENCE and MFENCE instructions. > > + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instruc= tions. > > + * . LFENCE instructions cannot pass earlier reads. > > + * . SFENCE instructions cannot pass earlier writes ... > > + * . MFENCE instructions cannot pass earlier reads, writes ... > > + * > > + * As pointed by Java guys, that makes possible to use lock-prefixed > > + * instructions to get the same effect as mfence and on most modern HW > > + * that gives a better perfomarnce than using mfence: > > + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ > > + * So below we use that technique for rte_smp_mb() implementation. > > + */ > > + > > +#ifdef RTE_ARCH_I686 > > +#define RTE_SP RTE_STR(esp) > > +#else > > +#define RTE_SP RTE_STR(rsp) > > +#endif > > + > > +#define RTE_MB_DUMMY_MEMP "-128(%%" RTE_SP ")" > > + > > +static __rte_always_inline void > > +rte_smp_mb(void) > > +{ > > + asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory"); > > +} > > + > > #define rte_io_mb() rte_mb() > > > > #define rte_io_wmb() rte_compiler_barrier() >=20 > The lock instruction is a stronger barrier than the compiler barrier > and has worse performance impact. Are you sure it is necessary to use it = in DPDK. > Linux kernel has successfully used simple compiler reodering barrier for = years. Where do you see compiler barrier? Right now for x86 rte_smp_mb()=3D=3Drte_mb()=3D=3Dmfence. So I am replacing mfence with 'lock add'. As comment above says - on most modern x86 systems it is faster, while allow to preserve memory ordering. Konstantin=20 >=20 > Don't confuse rte_smp_mb with the required barrier for talking to I/O dev= ices.