From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f178.google.com (mail-pf0-f178.google.com [209.85.192.178]) by dpdk.org (Postfix) with ESMTP id 0A31A1B221 for ; Wed, 1 Nov 2017 03:53:22 +0100 (CET) Received: by mail-pf0-f178.google.com with SMTP id n89so819353pfk.11 for ; Tue, 31 Oct 2017 19:53:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=iOnYUVv5BqFpRafaM+39FUU6StqJLKpv7bdgwOMc6/4=; b=VPDkxqEnWlzCpSIv0nqcWW6nAFls61Jde7wE7Qj0vVwq+PVsi3GjF7YK/UORpRKLVj 30JfhIGuhFmylGNVvh+cUK+Qgy2ePp1Df2wNcGMGqItLXEKFLj0mmRt7mf3GD49v8ymB k2Ff5alE8DF4aOfc2fGoP5417PXdsjwh4tnGcjhtvJBYyPk9tz2Vpv7Z1cvJ41y43zQd e4RJNNLBLEywiiBX0ENOQ+UASgh5nQJt7Wt8QGL7EhxtvmslIECJUDe1OaM9cx+h6xls T2koM2usLbkiAx/OvhB7gOUyKoR6vFAK4mKgD5KI89AB48bjdd3iQ2ALFkU0qsHE6BuU NVdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=iOnYUVv5BqFpRafaM+39FUU6StqJLKpv7bdgwOMc6/4=; b=izvXJk9n/R6aKwANkqNPtpwrxTJhM/+ehaRMcUnVe52fuirDgkure/aE13KZ6ff2Ei LjREBphk9BNmjbZY+YaVKZLmeExAt6Z6dtqB7nA6VOSJS2/VqlCseCSskDYpfcZkkp4p rzx2CSq99/f6h26bAv0ZkeDvoIxcbh49mM3H/M3gDMv8lNhv1Uk4LUtO9Pb9FYp/AcKq 6RdFCAp+xjltdy1EJcOV0Ef+JPwtwsPCiBkulk7fRQCdlEoyWPq9l8t17OAU9kDr+KVK Cl/GFxlRs3Sl/yOeL5SjW3Bod9RsIC9W/7RUmXn9wYhrnh7nXsrXHZdrb3jE49A+jxOo /wkA== X-Gm-Message-State: AMCzsaUEx1yfCd674dceWIWl5kPbdxHcUCTTQs1g7m+y84mgekR1goUB KjFGg2xYCGuZ469ElSj99fA= X-Google-Smtp-Source: ABhQp+R5UIw/wd7rspWmxaQlaFT+vQ7Z3cFfdlWarjTHmb+Vorw6kCVf9njSK2QnbZUSGBHm16MFUQ== X-Received: by 10.98.130.203 with SMTP id w194mr4641200pfd.308.1509504802124; Tue, 31 Oct 2017 19:53:22 -0700 (PDT) Received: from [0.0.0.0] (67.209.179.165.16clouds.com. [67.209.179.165]) by smtp.gmail.com with ESMTPSA id k24sm6130222pfj.151.2017.10.31.19.53.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 31 Oct 2017 19:53:21 -0700 (PDT) To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia.he@hxt-semitech.com" , "jie2.liu@hxt-semitech.com" , "bing.zhao@hxt-semitech.com" , "Richardson, Bruce" , jianbo.liu@arm.com, hemant.agrawal@nxp.com References: <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com> <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> <20171023100617.GA17957@jerin> <20171025132642.GA13977@jerin> <20171031111433.GA21742@jerin> From: Jia He Message-ID: <69adfb00-4582-b362-0540-d1d9d6bcf6aa@gmail.com> Date: Wed, 1 Nov 2017 10:53:12 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171031111433.GA21742@jerin> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Nov 2017 02:53:23 -0000 Hi Jerin Thanks for your suggestions. I will try to use config macro to let it be chosen by user. I need to point out one possible issue in your load_acq/store_rel patch at https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch @@ -516,8 +541,13 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,          /* Restore n as it may change every loop */          n = max; +#if 0          *old_head = r->cons.head;          const uint32_t prod_tail = r->prod.tail; +#else +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_RELAXED);                      --[1] +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE);   --[2] +#endif line [1] __ATOMIC_RELAXED is not enough for this case(tested in our ARM64 server). line [2] __ATOMIC_ACQUIRE guarantee the 2nd load will not be reorded before the 1st load, but will not guarantee the 1st load will not be reordered after the 2nd load. Please also refer to your mentioned freebsd implementation. They use __ATOMIC_ACQUIRE at line [1]. Should it be like instead? +#else +        *old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE); +        const uint32_t prod_tail = __atomic_load_n(&r->prod.tail, __ATOMIC_ACQUIRE); Cheers, Jia On 10/31/2017 7:14 PM, Jerin Jacob Wrote: > -----Original Message----- >> Date: Tue, 31 Oct 2017 10:55:15 +0800 >> From: Jia He >> To: Jerin Jacob >> Cc: "Ananyev, Konstantin" , "Zhao, Bing" >> , Olivier MATZ , >> "dev@dpdk.org" , "jia.he@hxt-semitech.com" >> , "jie2.liu@hxt-semitech.com" >> , "bing.zhao@hxt-semitech.com" >> , "Richardson, Bruce" >> >> Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod >> loading when doing enqueue/dequeue >> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 >> Thunderbird/52.4.0 >> >> Hi Jerin > Hi Jia, > >> Do you think  next step whether I need to implement the load_acquire half >> barrier as per freebsd > I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics > and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) > and Platform B: arm64 OOO machine) > > smp_rmb() performs better in Platform A: > acquire/release semantics perform better in platform B: > > Here is the patch: > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > > In terms of next step: > - I am not sure the cost associated with acquire/release semantics on x86 or ppc. > IMO, We need to have both options under conditional compilation > flags and let the target platform choose the best one. > > Thoughts? > > Here is the performance numbers: > - Both platforms are running at different frequency, So absolute numbers does not > matter, Just check the relative numbers. > > Platform A: Performance numbers: > ================================ > no patch(Non arm64 OOO machine) > ------------------------------- > > SP/SC single enq/dequeue: 40 > MP/MC single enq/dequeue: 282 > SP/SC burst enq/dequeue (size: 8): 11 > MP/MC burst enq/dequeue (size: 8): 42 > SP/SC burst enq/dequeue (size: 32): 8 > MP/MC burst enq/dequeue (size: 32): 16 > > ### Testing empty dequeue ### > SC empty dequeue: 8.01 > MC empty dequeue: 11.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 11.30 > MP/MC bulk enq/dequeue (size: 8): 42.85 > SP/SC bulk enq/dequeue (size: 32): 8.25 > MP/MC bulk enq/dequeue (size: 32): 16.46 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 20.62 > MP/MC bulk enq/dequeue (size: 8): 56.30 > SP/SC bulk enq/dequeue (size: 32): 10.94 > MP/MC bulk enq/dequeue (size: 32): 18.66 > Test OK > > # smp_rmb() patch((Non OOO arm64 machine) > http://dpdk.org/dev/patchwork/patch/30029/ > ----------------------------------------- > > SP/SC single enq/dequeue: 42 > MP/MC single enq/dequeue: 291 > SP/SC burst enq/dequeue (size: 8): 12 > MP/MC burst enq/dequeue (size: 8): 44 > SP/SC burst enq/dequeue (size: 32): 8 > MP/MC burst enq/dequeue (size: 32): 16 > > ### Testing empty dequeue ### > SC empty dequeue: 13.01 > MC empty dequeue: 15.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 11.60 > MP/MC bulk enq/dequeue (size: 8): 44.32 > SP/SC bulk enq/dequeue (size: 32): 8.60 > MP/MC bulk enq/dequeue (size: 32): 16.50 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 20.95 > MP/MC bulk enq/dequeue (size: 8): 56.90 > SP/SC bulk enq/dequeue (size: 32): 10.90 > MP/MC bulk enq/dequeue (size: 32): 18.78 > Test OK > RTE>> > > # c11 memory model patch((Non OOO arm64 machine) > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > --------------------------------------------------------------------------------------------- > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 197 > MP/MC single enq/dequeue: 328 > SP/SC burst enq/dequeue (size: 8): 31 > MP/MC burst enq/dequeue (size: 8): 50 > SP/SC burst enq/dequeue (size: 32): 13 > MP/MC burst enq/dequeue (size: 32): 18 > > ### Testing empty dequeue ### > SC empty dequeue: 13.01 > MC empty dequeue: 18.02 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 30.95 > MP/MC bulk enq/dequeue (size: 8): 50.30 > SP/SC bulk enq/dequeue (size: 32): 13.27 > MP/MC bulk enq/dequeue (size: 32): 18.11 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 43.38 > MP/MC bulk enq/dequeue (size: 8): 64.42 > SP/SC bulk enq/dequeue (size: 32): 16.71 > MP/MC bulk enq/dequeue (size: 32): 22.21 > > > Platform B: Performance numbers: > ============================== > #no patch(OOO arm64 machine) > ---------------------------- > > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 81 > MP/MC single enq/dequeue: 207 > SP/SC burst enq/dequeue (size: 8): 15 > MP/MC burst enq/dequeue (size: 8): 31 > SP/SC burst enq/dequeue (size: 32): 7 > MP/MC burst enq/dequeue (size: 32): 11 > > ### Testing empty dequeue ### > SC empty dequeue: 3.00 > MC empty dequeue: 5.00 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 15.38 > MP/MC bulk enq/dequeue (size: 8): 30.64 > SP/SC bulk enq/dequeue (size: 32): 7.25 > MP/MC bulk enq/dequeue (size: 32): 11.06 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 31.51 > MP/MC bulk enq/dequeue (size: 8): 49.38 > SP/SC bulk enq/dequeue (size: 32): 14.32 > MP/MC bulk enq/dequeue (size: 32): 15.89 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 72.66 > MP/MC bulk enq/dequeue (size: 8): 121.89 > SP/SC bulk enq/dequeue (size: 32): 16.88 > MP/MC bulk enq/dequeue (size: 32): 24.23 > Test OK > RTE>> > > > # smp_rmb() patch((OOO arm64 machine) > http://dpdk.org/dev/patchwork/patch/30029/ > ------------------------------------------- > > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 152 > MP/MC single enq/dequeue: 265 > SP/SC burst enq/dequeue (size: 8): 24 > MP/MC burst enq/dequeue (size: 8): 39 > SP/SC burst enq/dequeue (size: 32): 9 > MP/MC burst enq/dequeue (size: 32): 13 > > ### Testing empty dequeue ### > SC empty dequeue: 31.01 > MC empty dequeue: 32.01 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 24.26 > MP/MC bulk enq/dequeue (size: 8): 39.52 > SP/SC bulk enq/dequeue (size: 32): 9.47 > MP/MC bulk enq/dequeue (size: 32): 13.31 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 40.29 > MP/MC bulk enq/dequeue (size: 8): 59.57 > SP/SC bulk enq/dequeue (size: 32): 17.34 > MP/MC bulk enq/dequeue (size: 32): 21.58 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 79.05 > MP/MC bulk enq/dequeue (size: 8): 153.46 > SP/SC bulk enq/dequeue (size: 32): 26.41 > MP/MC bulk enq/dequeue (size: 32): 38.37 > Test OK > RTE>> > > > # c11 memory model patch((OOO arm64 machine) > https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch > ---------------------------------------------------------------------------------------------- > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 98 > MP/MC single enq/dequeue: 130 > SP/SC burst enq/dequeue (size: 8): 18 > MP/MC burst enq/dequeue (size: 8): 22 > SP/SC burst enq/dequeue (size: 32): 7 > MP/MC burst enq/dequeue (size: 32): 9 > > ### Testing empty dequeue ### > SC empty dequeue: 4.00 > MC empty dequeue: 5.00 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 17.40 > MP/MC bulk enq/dequeue (size: 8): 22.88 > SP/SC bulk enq/dequeue (size: 32): 7.62 > MP/MC bulk enq/dequeue (size: 32): 8.96 > > ### Testing using two hyperthreads ### > SP/SC bulk enq/dequeue (size: 8): 20.24 > MP/MC bulk enq/dequeue (size: 8): 25.83 > SP/SC bulk enq/dequeue (size: 32): 12.21 > MP/MC bulk enq/dequeue (size: 32): 13.20 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 67.54 > MP/MC bulk enq/dequeue (size: 8): 124.63 > SP/SC bulk enq/dequeue (size: 32): 21.13 > MP/MC bulk enq/dequeue (size: 32): 28.44 > Test OK > RTE>>quit > > >> or find any other performance test case to compare the performance impact? > As far as I know, ring_perf_autotest is the better performance test. > If you have trouble in using "High-resolution cycle counter" in your platform then also > you can use ring_perf_auto test to compare the performance(as relative > number matters) > > Jerin > >> Thanks for any suggestions. >> >> Cheers, >> Jia