DPDK patches and discussions
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Tyler Retzlaff <roretzla@linux.microsoft.com>
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>
Subject: Re: [RFC] rte_ring: don't use always inline
Date: Fri, 6 May 2022 08:41:12 -0700	[thread overview]
Message-ID: <20220506084112.5bcc3000@hermes.local> (raw)
In-Reply-To: <20220506072434.GA19777@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Fri, 6 May 2022 00:24:34 -0700
Tyler Retzlaff <roretzla@linux.microsoft.com> wrote:

> On Thu, May 05, 2022 at 10:59:32PM +0000, Honnappa Nagarahalli wrote:
> > Thanks Stephen. Do you see any performance difference with this change?  
> 
> as a matter of due diligence i think a comparison should be made just
> to be confident nothing is regressing.
> 
> i support this change in principal since it is generally accepted best
> practice to not force inlining since it can remove more valuable
> optimizations that the compiler may make that the human can't see.
> the optimizations may vary depending on compiler implementation.
> 
> force inlining should be used as a targeted measure rather than blanket
> on every function and when in use probably needs to be periodically
> reviewed and potentially removed as the code / compiler evolves.
> 
> also one other consideration is the impact of a particular compiler's
> force inlining intrinsic/builtin is that it may permit inlining of
> functions when not declared in a header. i.e. a function from one
> library may be able to be inlined to another binary as a link time
> optimization. although everything here is in a header so it's a bit
> moot.
> 
> i'd like to see this change go in if possible.
> 
> thanks
> 


Some quick numbers from Gcc 10.3 and 2.7G AMD and ring_perf_autotest

Looks like always inline is faster on second run but just inline is
slightly faster on first run. Maybe the icache gets loaded for second run,
but on first pass the smaller code size helps.

With always_inline:
### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 6.36
legacy APIs: MP/MC: single: 15.38

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 14.21
legacy APIs: SP/SC: burst (size: 32): 34.97
legacy APIs: MP/MC: burst (size: 8): 19.33
legacy APIs: MP/MC: burst (size: 32): 40.01

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 13.48
legacy APIs: SP/SC: bulk (size: 32): 34.07
legacy APIs: MP/MC: bulk (size: 8): 18.18
legacy APIs: MP/MC: bulk (size: 32): 37.32

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 1.73
legacy APIs: MP/MC: bulk (size: 8): 1.74

### Testing using two hyperthreads ###
legacy APIs: SP/SC: bulk (size: 8): 4.57
legacy APIs: MP/MC: bulk (size: 8): 7.14
legacy APIs: SP/SC: bulk (size: 32): 2.14
legacy APIs: MP/MC: bulk (size: 32): 2.14

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 13.50
legacy APIs: MP/MC: bulk (size: 8): 41.68
legacy APIs: SP/SC: bulk (size: 32): 6.63
legacy APIs: MP/MC: bulk (size: 32): 8.75

### Testing using all worker nodes ###

Bulk enq/dequeue count on size 8
Core [0] count = 22792
Core [1] count = 22830
Core [2] count = 22896
Core [3] count = 22850
Core [4] count = 22688
Core [5] count = 22457
Core [6] count = 22815
Core [7] count = 22837
Core [8] count = 23045
Core [9] count = 23087
Core [10] count = 23066
Core [11] count = 23018
Core [12] count = 23132
Core [13] count = 23084
Core [14] count = 23216
Core [15] count = 23183
Total count (size: 8): 366996

Bulk enq/dequeue count on size 32
Core [0] count = 24069
Core [1] count = 24171
Core [2] count = 24101
Core [3] count = 24062
Core [4] count = 24078
Core [5] count = 24092
Core [6] count = 23837
Core [7] count = 24114
Core [8] count = 24189
Core [9] count = 24182
Core [10] count = 24118
Core [11] count = 24086
Core [12] count = 24182
Core [13] count = 24177
Core [14] count = 24224
Core [15] count = 24205
Total count (size: 32): 385887

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 7.78
elem APIs: element size 16B: MP/MC: single: 18.00

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 15.16
elem APIs: element size 16B: SP/SC: burst (size: 32): 46.38
elem APIs: element size 16B: MP/MC: burst (size: 8): 23.59
elem APIs: element size 16B: MP/MC: burst (size: 32): 41.65

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 13.48
elem APIs: element size 16B: SP/SC: bulk (size: 32): 35.57
elem APIs: element size 16B: MP/MC: bulk (size: 8): 23.61
elem APIs: element size 16B: MP/MC: bulk (size: 32): 41.10

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 1.72
elem APIs: element size 16B: MP/MC: bulk (size: 8): 1.72

### Testing using two hyperthreads ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 4.51
elem APIs: element size 16B: MP/MC: bulk (size: 8): 7.16
elem APIs: element size 16B: SP/SC: bulk (size: 32): 2.91
elem APIs: element size 16B: MP/MC: bulk (size: 32): 2.98

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 26.27
elem APIs: element size 16B: MP/MC: bulk (size: 8): 43.94
elem APIs: element size 16B: SP/SC: bulk (size: 32): 7.09
elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.31

### Testing using all worker nodes ###

Bulk enq/dequeue count on size 8
Core [0] count = 22970
Core [1] count = 23068
Core [2] count = 22807
Core [3] count = 22823
Core [4] count = 22361
Core [5] count = 22732
Core [6] count = 22788
Core [7] count = 23005
Core [8] count = 22826
Core [9] count = 22882
Core [10] count = 22936
Core [11] count = 22971
Core [12] count = 23095
Core [13] count = 23087
Core [14] count = 23160
Core [15] count = 23155
Total count (size: 8): 366666

Bulk enq/dequeue count on size 32
Core [0] count = 22940
Core [1] count = 22964
Core [2] count = 22957
Core [3] count = 22934
Core [4] count = 22938
Core [5] count = 22826
Core [6] count = 22922
Core [7] count = 22927
Core [8] count = 23090
Core [9] count = 23042
Core [10] count = 23093
Core [11] count = 23004
Core [12] count = 22973
Core [13] count = 22947
Core [14] count = 23075
Core [15] count = 23021
Total count (size: 32): 367653
Test OK

With just inline:
### Testing single element enq/deq ###
legacy APIs: SP/SC: single: 4.61
legacy APIs: MP/MC: single: 15.15

### Testing burst enq/deq ###
legacy APIs: SP/SC: burst (size: 8): 13.20
legacy APIs: SP/SC: burst (size: 32): 33.10
legacy APIs: MP/MC: burst (size: 8): 18.06
legacy APIs: MP/MC: burst (size: 32): 37.53

### Testing bulk enq/deq ###
legacy APIs: SP/SC: bulk (size: 8): 11.62
legacy APIs: SP/SC: bulk (size: 32): 32.36
legacy APIs: MP/MC: bulk (size: 8): 18.07
legacy APIs: MP/MC: bulk (size: 32): 37.10

### Testing empty bulk deq ###
legacy APIs: SP/SC: bulk (size: 8): 1.69
legacy APIs: MP/MC: bulk (size: 8): 1.28

### Testing using two hyperthreads ###
legacy APIs: SP/SC: bulk (size: 8): 4.42
legacy APIs: MP/MC: bulk (size: 8): 7.15
legacy APIs: SP/SC: bulk (size: 32): 2.13
legacy APIs: MP/MC: bulk (size: 32): 2.12

### Testing using two physical cores ###
legacy APIs: SP/SC: bulk (size: 8): 13.59
legacy APIs: MP/MC: bulk (size: 8): 40.95
legacy APIs: SP/SC: bulk (size: 32): 6.53
legacy APIs: MP/MC: bulk (size: 32): 8.67

### Testing using all worker nodes ###

Bulk enq/dequeue count on size 8
Core [0] count = 21666
Core [1] count = 21693
Core [2] count = 21790
Core [3] count = 21706
Core [4] count = 21600
Core [5] count = 21575
Core [6] count = 21583
Core [7] count = 21600
Core [8] count = 21862
Core [9] count = 21872
Core [10] count = 21906
Core [11] count = 21938
Core [12] count = 22036
Core [13] count = 21965
Core [14] count = 21992
Core [15] count = 22052
Total count (size: 8): 348836

Bulk enq/dequeue count on size 32
Core [0] count = 23307
Core [1] count = 23352
Core [2] count = 23314
Core [3] count = 23304
Core [4] count = 23232
Core [5] count = 23244
Core [6] count = 23398
Core [7] count = 23308
Core [8] count = 23245
Core [9] count = 23278
Core [10] count = 22568
Core [11] count = 23308
Core [12] count = 23288
Core [13] count = 23262
Core [14] count = 23357
Core [15] count = 23366
Total count (size: 32): 372131

### Testing single element enq/deq ###
elem APIs: element size 16B: SP/SC: single: 9.93
elem APIs: element size 16B: MP/MC: single: 20.81

### Testing burst enq/deq ###
elem APIs: element size 16B: SP/SC: burst (size: 8): 15.50
elem APIs: element size 16B: SP/SC: burst (size: 32): 37.11
elem APIs: element size 16B: MP/MC: burst (size: 8): 25.13
elem APIs: element size 16B: MP/MC: burst (size: 32): 43.84

### Testing bulk enq/deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 17.50
elem APIs: element size 16B: SP/SC: bulk (size: 32): 38.51
elem APIs: element size 16B: MP/MC: bulk (size: 8): 24.98
elem APIs: element size 16B: MP/MC: bulk (size: 32): 45.97

### Testing empty bulk deq ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 2.55
elem APIs: element size 16B: MP/MC: bulk (size: 8): 2.55

### Testing using two hyperthreads ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 4.43
elem APIs: element size 16B: MP/MC: bulk (size: 8): 6.92
elem APIs: element size 16B: SP/SC: bulk (size: 32): 2.82
elem APIs: element size 16B: MP/MC: bulk (size: 32): 2.93

### Testing using two physical cores ###
elem APIs: element size 16B: SP/SC: bulk (size: 8): 26.51
elem APIs: element size 16B: MP/MC: bulk (size: 8): 42.32
elem APIs: element size 16B: SP/SC: bulk (size: 32): 6.94
elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.15

### Testing using all worker nodes ###

Bulk enq/dequeue count on size 8
Core [0] count = 22850
Core [1] count = 22907
Core [2] count = 22799
Core [3] count = 22843
Core [4] count = 22293
Core [5] count = 22671
Core [6] count = 22294
Core [7] count = 22753
Core [8] count = 22878
Core [9] count = 22894
Core [10] count = 22886
Core [11] count = 22939
Core [12] count = 23076
Core [13] count = 22999
Core [14] count = 22910
Core [15] count = 22904
Total count (size: 8): 364896

Bulk enq/dequeue count on size 32
Core [0] count = 22279
Core [1] count = 22564
Core [2] count = 22659
Core [3] count = 22645
Core [4] count = 22629
Core [5] count = 22671
Core [6] count = 22721
Core [7] count = 22732
Core [8] count = 22668
Core [9] count = 22710
Core [10] count = 22691
Core [11] count = 22606
Core [12] count = 22699
Core [13] count = 22776
Core [14] count = 22792
Core [15] count = 22756
Total count (size: 32): 362598
Test OK



  parent reply	other threads:[~2022-05-06 15:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-05 22:45 Stephen Hemminger
2022-05-05 22:59 ` Honnappa Nagarahalli
2022-05-05 23:10   ` Stephen Hemminger
2022-05-05 23:16     ` Stephen Hemminger
2022-05-06  1:37     ` Honnappa Nagarahalli
2022-05-06  7:24   ` Tyler Retzlaff
2022-05-06 15:12     ` Honnappa Nagarahalli
2022-05-06 15:28       ` Bruce Richardson
2022-05-06 16:33         ` Stephen Hemminger
2022-05-06 16:39           ` Bruce Richardson
2022-05-06 17:48             ` Konstantin Ananyev
2022-05-06 15:41     ` Stephen Hemminger [this message]
2022-05-06 16:38       ` Bruce Richardson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220506084112.5bcc3000@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=dev@dpdk.org \
    --cc=nd@arm.com \
    --cc=roretzla@linux.microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).