From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f41.google.com (mail-pg0-f41.google.com [74.125.83.41]) by dpdk.org (Postfix) with ESMTP id 914551B310 for ; Mon, 23 Oct 2017 10:49:11 +0200 (CEST) Received: by mail-pg0-f41.google.com with SMTP id k7so11252705pga.3 for ; Mon, 23 Oct 2017 01:49:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-transfer-encoding; bh=AsBkFdwsOeetIqwjWunb2F5pOCStXa9b2F9R1w/6FmE=; b=Tg1J1dC8JnVkEMhOH69lgTcz1gf0iIysGgaQ8PTMd9Pi+bio4gjmqVx5tmNoiJCw9p r7nvoibVO7a2fOFEpHekGZ+WBLxSj00tDmOoSUqXgBX5jdwKwZleYVgoHQb/7rQLg0yd gT4wTINFKmGlGHo9Xz6M4mzyLXxxq6s3pdZ7eIkL/dkJQNVOsN+FGJkZ6CDn5ZTx5pCh 1YgGV6W2Hi5tKexv6NJi0jIxy+CwAQjfWI19QFbCcOtLCbUkzvnXwut1tqvNv+j3E6WV 9fRQE4dXA/gO1oSnsQYtsCzzsnbtMitwM5DCeIl9yarCte+Domivr7pZqZeAc7PZLqIC k1DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=AsBkFdwsOeetIqwjWunb2F5pOCStXa9b2F9R1w/6FmE=; b=A76n6y29VsSF/AvVszQ9mwbwvYYjnePfCYWJaS9aG4K97C2HdUQSr9Vuy1yvRuJdZ3 35n8fWtRKxTT1LCsuRwxZvkKp2otdZD2wpCdP/LM+8K+8LVwFLfQMvkwn6JuhYU1+075 nrykZsRlDPOxcGhQ2LnMl9bmFRveZxL/lPs3+qTEFEckse2ZDQzYzvPc4KFLhH+Df4ZM yozTGBA8oofx2v6KRCkolZS4dPHyqrBGPGgG88fFLJi4avYl1T0T6jAgev3HlT8t//84 TSw/qfUZ/3czvxaM+GbnVMzFHbl97G3AX/x0Q5jFeERC30XUC/hk47YkP+YKAR2EPvqg prRA== X-Gm-Message-State: AMCzsaW02I2wrQOI5rKINF6kW2gTiX0GI3/xqLz7lcptjoTd6HV2arN+ yyqX4qNUjpG/zim1Eb+aWUo= X-Google-Smtp-Source: ABhQp+TMG2tG8HapWHh+S0X2KqYFuv9eJ0qq+pZQr3SJYLF7XEC7NMhu5CsFZEvlZh1LOr5L92eRnQ== X-Received: by 10.84.241.207 with SMTP id t15mr9757235plm.158.1508748550613; Mon, 23 Oct 2017 01:49:10 -0700 (PDT) Received: from [0.0.0.0] (67.209.179.165.16clouds.com. [67.209.179.165]) by smtp.gmail.com with ESMTPSA id u131sm10533302pgc.89.2017.10.23.01.49.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Oct 2017 01:49:10 -0700 (PDT) To: Jerin Jacob Cc: "Ananyev, Konstantin" , "Zhao, Bing" , Olivier MATZ , "dev@dpdk.org" , "jia.he@hxt-semitech.com" , "jie2.liu@hxt-semitech.com" , "bing.zhao@hxt-semitech.com" , "Richardson, Bruce" References: <20171012155350.j34ddtivxzd27pag@platinum> <2601191342CEEE43887BDE71AB9772585FAA859F@IRSMSX103.ger.corp.intel.com> <20171012172311.GA8524@jerin> <2601191342CEEE43887BDE71AB9772585FAAB171@IRSMSX103.ger.corp.intel.com> <8806e2bd-c57b-03ff-a315-0a311690f1d9@163.com> <2601191342CEEE43887BDE71AB9772585FAAB404@IRSMSX103.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772585FAAB570@IRSMSX103.ger.corp.intel.com> <3e580cd7-2854-d855-be9c-7c4ce06e3ed5@gmail.com> <20171020054319.GA4249@jerin> From: Jia He Message-ID: Date: Mon, 23 Oct 2017 16:49:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171020054319.GA4249@jerin> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Oct 2017 08:49:12 -0000 Hi Jerin On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > -----Original Message----- >> [...] >> dependant on each other. >> Thus a memory barrier is neccessary. > Yes. The barrier is necessary. > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > implementation is derived from freebsd's buf_ring.h. > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > I think, the only outstanding issue is, how to reduce the performance > impact for arm64. I believe using accurate/release semantics instead > of rte_smp_rmb() will reduce the performance overhead like similar ring implementations below, > freebsd: https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > odp: https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > Jia, > 1) Can you verify the use of accurate/release semantics fixes the problem in your > platform? like use of atomic_load_acq* in the reference code. > 2) If so, What is the overhead between accurate/release and plane smp_smb() > barriers. Based on that we need decide what path to take. I've tested 3 cases.  The new 3rd case is to use the load_acquire barrier (half barrier) you mentioned at above link. The patch seems like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,                 /* Reset n to the initial burst count */                 n = max; -               *old_head = r->prod.head; -               const uint32_t cons_tail = r->cons.tail; +               *old_head = atomic_load_acq_32(&r->prod.head); +               const uint32_t cons_tail = atomic_load_acq_32(&r->cons.tail); @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s                 /* Restore n as it may change every loop */                 n = max; -               *old_head = r->cons.head; -               const uint32_t prod_tail = r->prod.tail; +               *old_head = atomic_load_acq_32(&r->cons.head); +               const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The subtraction is done between two unsigned 32bits value                  * (the result is always modulo 32 bits even if we have                  * cons_head > prod_tail). So 'entries' is always between 0                  * and size(ring)-1. */ The half barrier patch passed the fuctional test. As for the performance comparision on *arm64*(the debug patch is at http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the test results below: [case 1] old codes, no barrier ============================================  Performance counter stats for './test --no-huge -l 1-10':      689275.001200      task-clock (msec)         #    9.771 CPUs utilized               6223      context-switches          #    0.009 K/sec                 10      cpu-migrations            #    0.000 K/sec                653      page-faults               #    0.001 K/sec      1721190914583      cycles                    #    2.497 GHz      3363238266430      instructions              #    1.95  insn per cycle    branches           27804740      branch-misses             #    0.00% of all branches       70.540618825 seconds time elapsed [case 2] full barrier with rte_smp_rmb() ============================================  Performance counter stats for './test --no-huge -l 1-10':      582557.895850      task-clock (msec)         #    9.752 CPUs utilized               5242      context-switches          #    0.009 K/sec                 10      cpu-migrations            #    0.000 K/sec                665      page-faults               #    0.001 K/sec      1454360730055      cycles                    #    2.497 GHz       587197839907      instructions              #    0.40  insn per cycle    branches           27799687      branch-misses             #    0.00% of all branches       59.735582356 seconds time elapse [case 1] half barrier with load_acquire ============================================  Performance counter stats for './test --no-huge -l 1-10':      660758.877050      task-clock (msec)         #    9.764 CPUs utilized               5982      context-switches          #    0.009 K/sec                 11      cpu-migrations            #    0.000 K/sec                657      page-faults               #    0.001 K/sec      1649875318044      cycles                    #    2.497 GHz       591583257765      instructions              #    0.36  insn per cycle    branches           27994903      branch-misses             #    0.00% of all branches       67.672855107 seconds time elapsed Please see the context-switches in the perf results test result  sorted by time is: full barrier < half barrier < no barrier AFAICT, in this case ,the cpu reordering will add the possibility for context switching and increase the running time. Any ideas? Cheers, Jia > > Note: > This issue wont come in all the arm64 implementation. it comes on arm64 > implementation with OOO(out of order) implementations. > > >> Cheers, >> Jia >> >>> Konstantin >>> >>>>> . In another >>>>> mail of this thread, we've made a simple test based on this and captured >>>>> some information and I pasted there.(I pasted the patch there :-)) >>>> Are you talking about that one: >>>> http://dpdk.org/dev/patchwork/patch/30405/ >>>> ? >>>> It still uses test/test/test_mbuf.c..., >>>> but anyway I don't really understand how mbuf_autotest supposed >>>> to work with these changes: >>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >>>> rte_ring_enqueue(refcnt_mbuf_ring, m); >>>> } >>>> } >>>> - rte_pktmbuf_free(m); >>>> + // rte_pktmbuf_free(m); >>>> } >>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned int iter, >>>> while (!rte_ring_empty(refcnt_mbuf_ring)) >>>> ; >>>> >>>> + if (NULL != m) { >>>> + if (1 != rte_mbuf_refcnt_read(m)) >>>> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); >>>> + rte_pktmbuf_free(m); >>>> + } >>>> + >>>> /* check that all mbufs are back into mempool by now */ >>>> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { >>>> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { >>>> >>>> That means all your mbufs (except the last one) will still be allocated. >>>> So the test would fail - as it should, I think. >>>> >>>>> And >>>>> it seems that Juhamatti & Jacod found some reverting action several >>>>> months ago. >>>> Didn't get that one either. >>>> Konstantin -- Cheers, Jia