From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B4874A0503; Fri, 6 May 2022 17:41:18 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 998D340395; Fri, 6 May 2022 17:41:18 +0200 (CEST) Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) by mails.dpdk.org (Postfix) with ESMTP id C39154014F for ; Fri, 6 May 2022 17:41:16 +0200 (CEST) Received: by mail-pg1-f171.google.com with SMTP id g184so3914094pgc.1 for ; Fri, 06 May 2022 08:41:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=EvBkNY7LP5OOGum3Zsn7jj9+sb1VCHhaQU412wi+DnY=; b=F/EVsVmYtbG0o9+kc156AMr5qMkV0sVQfl8y/aifNeKKNP0FEk/p434+erL85UgviR /tUxZ6eGDWBH1PU5A5sVw0GsQ/LVDr6wTGiRmrWvKKrGyM3hVAz6MJ4CW3AGRKo++iaT aHKzJuA5jvqfTrxUWSrNsyEwHBCvg3WswTIzg5XJ0C7eAPW9QKczXVdRvzmaTwYi2mnO yyXdKxI1yhAqmvNpvQ1L1WsHWSBvgkaZEOf3uz2fxkOO/Hv91TodxmmIPZjsf394ifvg wRhE6St5BAkFjyXFgqChmOtxHTJJkoaZGLBVPO70RjPoiXoSbKL2m6d1hqiWpD4s0qjK m+iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=EvBkNY7LP5OOGum3Zsn7jj9+sb1VCHhaQU412wi+DnY=; b=lRi3EvwyNL0OhMPQ0gf5TXE714mmmtSFmFFQUEgzHl/XuCXqlrGkMH9mUScbJr1E9Q +oo34W6FJpng78BwzHf4dP226MtDylnEsd5QWmMA3x7QPkqczdfOcgMjaY0NWIZE57FN pOaUIb9FXqI7/AwG/6u+lYc91hUec6IW/TkKfe3OA5tHwy0qfrhv7pKndHBZda4/jDEY sW4FNJZ+dPIao7OUmuj2E3noAhTSwPGzPwHBFrWVQzm7C3fwMPIJ14CEm+GR+X1tEkdR D5JTzBxe1G8/seC0GHYRJuY3kDevN9HxKS0EgiDGUBHVxMR3/BMFUmWfSxU9SWIZ43Nn 2XGw== X-Gm-Message-State: AOAM530GH5s7YbjqmirXdflPpFmc2K5si0IdWr3PFjFj39xJ5A2tvU1C dQLy7bR21O9K5bwTrl01I91BPA== X-Google-Smtp-Source: ABdhPJyLTyPZc7xZCBaGlNa8WNnCcjZ+z21NPPp/W2nPyCUE5IL9C50t035Ly34aZLdijzdMPISDCQ== X-Received: by 2002:a63:7d04:0:b0:378:fb34:5162 with SMTP id y4-20020a637d04000000b00378fb345162mr3203507pgc.487.1651851675723; Fri, 06 May 2022 08:41:15 -0700 (PDT) Received: from hermes.local (204-195-112-199.wavecable.com. [204.195.112.199]) by smtp.gmail.com with ESMTPSA id l189-20020a6388c6000000b003c5a2581b2dsm3450295pgd.8.2022.05.06.08.41.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 06 May 2022 08:41:15 -0700 (PDT) Date: Fri, 6 May 2022 08:41:12 -0700 From: Stephen Hemminger To: Tyler Retzlaff Cc: Honnappa Nagarahalli , "dev@dpdk.org" , nd Subject: Re: [RFC] rte_ring: don't use always inline Message-ID: <20220506084112.5bcc3000@hermes.local> In-Reply-To: <20220506072434.GA19777@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net> References: <20220505224547.394253-1-stephen@networkplumber.org> <20220506072434.GA19777@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Fri, 6 May 2022 00:24:34 -0700 Tyler Retzlaff wrote: > On Thu, May 05, 2022 at 10:59:32PM +0000, Honnappa Nagarahalli wrote: > > Thanks Stephen. Do you see any performance difference with this change? > > as a matter of due diligence i think a comparison should be made just > to be confident nothing is regressing. > > i support this change in principal since it is generally accepted best > practice to not force inlining since it can remove more valuable > optimizations that the compiler may make that the human can't see. > the optimizations may vary depending on compiler implementation. > > force inlining should be used as a targeted measure rather than blanket > on every function and when in use probably needs to be periodically > reviewed and potentially removed as the code / compiler evolves. > > also one other consideration is the impact of a particular compiler's > force inlining intrinsic/builtin is that it may permit inlining of > functions when not declared in a header. i.e. a function from one > library may be able to be inlined to another binary as a link time > optimization. although everything here is in a header so it's a bit > moot. > > i'd like to see this change go in if possible. > > thanks > Some quick numbers from Gcc 10.3 and 2.7G AMD and ring_perf_autotest Looks like always inline is faster on second run but just inline is slightly faster on first run. Maybe the icache gets loaded for second run, but on first pass the smaller code size helps. With always_inline: ### Testing single element enq/deq ### legacy APIs: SP/SC: single: 6.36 legacy APIs: MP/MC: single: 15.38 ### Testing burst enq/deq ### legacy APIs: SP/SC: burst (size: 8): 14.21 legacy APIs: SP/SC: burst (size: 32): 34.97 legacy APIs: MP/MC: burst (size: 8): 19.33 legacy APIs: MP/MC: burst (size: 32): 40.01 ### Testing bulk enq/deq ### legacy APIs: SP/SC: bulk (size: 8): 13.48 legacy APIs: SP/SC: bulk (size: 32): 34.07 legacy APIs: MP/MC: bulk (size: 8): 18.18 legacy APIs: MP/MC: bulk (size: 32): 37.32 ### Testing empty bulk deq ### legacy APIs: SP/SC: bulk (size: 8): 1.73 legacy APIs: MP/MC: bulk (size: 8): 1.74 ### Testing using two hyperthreads ### legacy APIs: SP/SC: bulk (size: 8): 4.57 legacy APIs: MP/MC: bulk (size: 8): 7.14 legacy APIs: SP/SC: bulk (size: 32): 2.14 legacy APIs: MP/MC: bulk (size: 32): 2.14 ### Testing using two physical cores ### legacy APIs: SP/SC: bulk (size: 8): 13.50 legacy APIs: MP/MC: bulk (size: 8): 41.68 legacy APIs: SP/SC: bulk (size: 32): 6.63 legacy APIs: MP/MC: bulk (size: 32): 8.75 ### Testing using all worker nodes ### Bulk enq/dequeue count on size 8 Core [0] count = 22792 Core [1] count = 22830 Core [2] count = 22896 Core [3] count = 22850 Core [4] count = 22688 Core [5] count = 22457 Core [6] count = 22815 Core [7] count = 22837 Core [8] count = 23045 Core [9] count = 23087 Core [10] count = 23066 Core [11] count = 23018 Core [12] count = 23132 Core [13] count = 23084 Core [14] count = 23216 Core [15] count = 23183 Total count (size: 8): 366996 Bulk enq/dequeue count on size 32 Core [0] count = 24069 Core [1] count = 24171 Core [2] count = 24101 Core [3] count = 24062 Core [4] count = 24078 Core [5] count = 24092 Core [6] count = 23837 Core [7] count = 24114 Core [8] count = 24189 Core [9] count = 24182 Core [10] count = 24118 Core [11] count = 24086 Core [12] count = 24182 Core [13] count = 24177 Core [14] count = 24224 Core [15] count = 24205 Total count (size: 32): 385887 ### Testing single element enq/deq ### elem APIs: element size 16B: SP/SC: single: 7.78 elem APIs: element size 16B: MP/MC: single: 18.00 ### Testing burst enq/deq ### elem APIs: element size 16B: SP/SC: burst (size: 8): 15.16 elem APIs: element size 16B: SP/SC: burst (size: 32): 46.38 elem APIs: element size 16B: MP/MC: burst (size: 8): 23.59 elem APIs: element size 16B: MP/MC: burst (size: 32): 41.65 ### Testing bulk enq/deq ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 13.48 elem APIs: element size 16B: SP/SC: bulk (size: 32): 35.57 elem APIs: element size 16B: MP/MC: bulk (size: 8): 23.61 elem APIs: element size 16B: MP/MC: bulk (size: 32): 41.10 ### Testing empty bulk deq ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 1.72 elem APIs: element size 16B: MP/MC: bulk (size: 8): 1.72 ### Testing using two hyperthreads ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 4.51 elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.16 elem APIs: element size 16B: SP/SC: bulk (size: 32): 2.91 elem APIs: element size 16B: MP/MC: bulk (size: 32): 2.98 ### Testing using two physical cores ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 26.27 elem APIs: element size 16B: MP/MC: bulk (size: 8): 43.94 elem APIs: element size 16B: SP/SC: bulk (size: 32): 7.09 elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.31 ### Testing using all worker nodes ### Bulk enq/dequeue count on size 8 Core [0] count = 22970 Core [1] count = 23068 Core [2] count = 22807 Core [3] count = 22823 Core [4] count = 22361 Core [5] count = 22732 Core [6] count = 22788 Core [7] count = 23005 Core [8] count = 22826 Core [9] count = 22882 Core [10] count = 22936 Core [11] count = 22971 Core [12] count = 23095 Core [13] count = 23087 Core [14] count = 23160 Core [15] count = 23155 Total count (size: 8): 366666 Bulk enq/dequeue count on size 32 Core [0] count = 22940 Core [1] count = 22964 Core [2] count = 22957 Core [3] count = 22934 Core [4] count = 22938 Core [5] count = 22826 Core [6] count = 22922 Core [7] count = 22927 Core [8] count = 23090 Core [9] count = 23042 Core [10] count = 23093 Core [11] count = 23004 Core [12] count = 22973 Core [13] count = 22947 Core [14] count = 23075 Core [15] count = 23021 Total count (size: 32): 367653 Test OK With just inline: ### Testing single element enq/deq ### legacy APIs: SP/SC: single: 4.61 legacy APIs: MP/MC: single: 15.15 ### Testing burst enq/deq ### legacy APIs: SP/SC: burst (size: 8): 13.20 legacy APIs: SP/SC: burst (size: 32): 33.10 legacy APIs: MP/MC: burst (size: 8): 18.06 legacy APIs: MP/MC: burst (size: 32): 37.53 ### Testing bulk enq/deq ### legacy APIs: SP/SC: bulk (size: 8): 11.62 legacy APIs: SP/SC: bulk (size: 32): 32.36 legacy APIs: MP/MC: bulk (size: 8): 18.07 legacy APIs: MP/MC: bulk (size: 32): 37.10 ### Testing empty bulk deq ### legacy APIs: SP/SC: bulk (size: 8): 1.69 legacy APIs: MP/MC: bulk (size: 8): 1.28 ### Testing using two hyperthreads ### legacy APIs: SP/SC: bulk (size: 8): 4.42 legacy APIs: MP/MC: bulk (size: 8): 7.15 legacy APIs: SP/SC: bulk (size: 32): 2.13 legacy APIs: MP/MC: bulk (size: 32): 2.12 ### Testing using two physical cores ### legacy APIs: SP/SC: bulk (size: 8): 13.59 legacy APIs: MP/MC: bulk (size: 8): 40.95 legacy APIs: SP/SC: bulk (size: 32): 6.53 legacy APIs: MP/MC: bulk (size: 32): 8.67 ### Testing using all worker nodes ### Bulk enq/dequeue count on size 8 Core [0] count = 21666 Core [1] count = 21693 Core [2] count = 21790 Core [3] count = 21706 Core [4] count = 21600 Core [5] count = 21575 Core [6] count = 21583 Core [7] count = 21600 Core [8] count = 21862 Core [9] count = 21872 Core [10] count = 21906 Core [11] count = 21938 Core [12] count = 22036 Core [13] count = 21965 Core [14] count = 21992 Core [15] count = 22052 Total count (size: 8): 348836 Bulk enq/dequeue count on size 32 Core [0] count = 23307 Core [1] count = 23352 Core [2] count = 23314 Core [3] count = 23304 Core [4] count = 23232 Core [5] count = 23244 Core [6] count = 23398 Core [7] count = 23308 Core [8] count = 23245 Core [9] count = 23278 Core [10] count = 22568 Core [11] count = 23308 Core [12] count = 23288 Core [13] count = 23262 Core [14] count = 23357 Core [15] count = 23366 Total count (size: 32): 372131 ### Testing single element enq/deq ### elem APIs: element size 16B: SP/SC: single: 9.93 elem APIs: element size 16B: MP/MC: single: 20.81 ### Testing burst enq/deq ### elem APIs: element size 16B: SP/SC: burst (size: 8): 15.50 elem APIs: element size 16B: SP/SC: burst (size: 32): 37.11 elem APIs: element size 16B: MP/MC: burst (size: 8): 25.13 elem APIs: element size 16B: MP/MC: burst (size: 32): 43.84 ### Testing bulk enq/deq ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 17.50 elem APIs: element size 16B: SP/SC: bulk (size: 32): 38.51 elem APIs: element size 16B: MP/MC: bulk (size: 8): 24.98 elem APIs: element size 16B: MP/MC: bulk (size: 32): 45.97 ### Testing empty bulk deq ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 2.55 elem APIs: element size 16B: MP/MC: bulk (size: 8): 2.55 ### Testing using two hyperthreads ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 4.43 elem APIs: element size 16B: MP/MC: bulk (size: 8): 6.92 elem APIs: element size 16B: SP/SC: bulk (size: 32): 2.82 elem APIs: element size 16B: MP/MC: bulk (size: 32): 2.93 ### Testing using two physical cores ### elem APIs: element size 16B: SP/SC: bulk (size: 8): 26.51 elem APIs: element size 16B: MP/MC: bulk (size: 8): 42.32 elem APIs: element size 16B: SP/SC: bulk (size: 32): 6.94 elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.15 ### Testing using all worker nodes ### Bulk enq/dequeue count on size 8 Core [0] count = 22850 Core [1] count = 22907 Core [2] count = 22799 Core [3] count = 22843 Core [4] count = 22293 Core [5] count = 22671 Core [6] count = 22294 Core [7] count = 22753 Core [8] count = 22878 Core [9] count = 22894 Core [10] count = 22886 Core [11] count = 22939 Core [12] count = 23076 Core [13] count = 22999 Core [14] count = 22910 Core [15] count = 22904 Total count (size: 8): 364896 Bulk enq/dequeue count on size 32 Core [0] count = 22279 Core [1] count = 22564 Core [2] count = 22659 Core [3] count = 22645 Core [4] count = 22629 Core [5] count = 22671 Core [6] count = 22721 Core [7] count = 22732 Core [8] count = 22668 Core [9] count = 22710 Core [10] count = 22691 Core [11] count = 22606 Core [12] count = 22699 Core [13] count = 22776 Core [14] count = 22792 Core [15] count = 22756 Total count (size: 32): 362598 Test OK