From: Pierre Laurent <pierre@emutex.com>
To: users@dpdk.org, webguru2688@gmail.com
Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Date: Tue, 11 Sep 2018 20:36:42 +0100 [thread overview]
Message-ID: <0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com> (raw)
In-Reply-To: <20180911110744.7ef55fc2@xeon-e3>
Can I suggest a few steps for investigating more ?
First, verify that the L1 cache is really the suspect one. this can be
done simply with perf utility and the counter L1-dcache-load-misses. the
simplest tool is "perf" which is part of linux-tools packages
|||$ apt-get install linux-tools-common linux-tools-generic
linux-tools-`uname -r` |
$ sudo perf stat -d -d -d ./build/rxtx
EAL: Detected 12 lcore(s)
....
^C
Performance counter stats for './build/rxtx':
1413.787490 task-clock (msec) # 0.923 CPUs utilized
18 context-switches # 0.013 K/sec
4 cpu-migrations # 0.003 K/sec
238 page-faults # 0.168 K/sec
4,436,904,124 cycles # 3.138
GHz (32.67%)
3,888,094,815 stalled-cycles-frontend # 87.63% frontend
cycles idle (32.94%)
237,378,065 instructions # 0.05 insn per cycle
# 16.38 stalled
cycles per insn (39.73%)
76,863,834 branches # 54.367
M/sec (40.01%)
101,550 branch-misses # 0.13% of all
branches (40.30%)
94,805,298 L1-dcache-loads # 67.058
M/sec (39.77%)
263,530,291 L1-dcache-load-misses # 277.97% of all
L1-dcache hits (13.77%)
425,934 LLC-loads # 0.301
M/sec (13.60%)
181,295 LLC-load-misses # 42.56% of all
LL-cache hits (20.21%)
<not supported> L1-icache-loads
775,365 L1-icache-load-misses (26.71%)
70,580,827 dTLB-loads # 49.923
M/sec (25.46%)
2,474 dTLB-load-misses # 0.00% of all dTLB
cache hits (13.01%)
277 iTLB-loads # 0.196
K/sec (13.01%)
994 iTLB-load-misses # 358.84% of all iTLB
cache hits (19.52%)
<not supported> L1-dcache-prefetches
7,204 L1-dcache-prefetch-misses # 0.005
M/sec (26.03%)
1.531809863 seconds time elapsed
One of the common mistakes is to have excessively large tx and rx
queues, which in turn helps trigger excessively large bursts. Your L1
cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512
cache lines is not much ..... If the bursts you are processing happen to
be more than approx 128 buffers, then you will be trashing the cache
when running your loop. I would notice that you use a pool of 8192 of
your buffers, and if you use them round-robin, then you have a perfect
recipe for cache trashing. If so, then prefetch would help.
rte_hash_lookup looks into cache lines too (at least 3 per successful
invoke). If you use the same key, then rte_hash_lookup will look into
the same cache lines. if your keys are randomly distributed, then it is
another recipe for cache trashing.
It is not clear from your descriptions if the core which reads the
bursts from dpdk PMD is the same than the core which does the
processing. if a core touch your buffers (e.g. tag1), and then you pass
the buffer to another core, than you get LLC coherency overheads, which
would also trigger LLC-load-misses (which you can detect through perf
output above)
It seems you have this type of processor (codename sandybridge, 6 cores,
hyperthread is enabled)
https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
Can you double check that your application run with the right core
assignment ? Since hyperthreading is enabled, you should not use 0
(plenty functions for the linux kernel run on core 0) nor core 6 (which
is the same hardware than core 0) and make sure the hyperthread
corresponding to the core you are running is not used either. You can
get the CPU<-->Core assignment with lscpu tool
$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,0,0,0,,0,0,0,0
7,1,0,0,,1,1,1,0
8,2,0,0,,2,2,2,0
9,3,0,0,,3,3,3,0
10,4,0,0,,4,4,4,0
11,5,0,0,,5,5,5,0
If you do not need hyperthreading, and if L1 cache is your bottleneck,
you might need to disable hyperthreading and get 64K bytes L1 cache per
core. If you really need hyperthreading, then use less cache in your
code by better tuning the buffer pool sizes.
SW prefetch is quite difficult to use efficiently. There are 4 different
hardware prefetcher with different algorithms (adjacent cache lines,
stride access ...) where the use of prefetch instruction is unnecessary,
and there is a hw limit of about 8 pending L1 data cache misses
(sometimes documented as 5, sometimes documented as 10 ..). This creates
serious burden of software complexity to abide by the hw rules.
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
. Just verify the hardware prefetchers are all enabled thru msr 0x1A4.
Some bios might have created a different setup.
On 11/09/18 19:07, Stephen Hemminger wrote:
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688@gmail.com> wrote:
>
>> If I don't do any processing, I easily get 10G. It is only when I access
>> the tag when the throughput drops.
>> What confuses me is if I use the following snippet, it works at line rate.
>>
>> ```
>> int temp_key = 1; // declared outside of the for loop
>>
>> for (i = 0; i < pkt_count; i++) {
>> if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>> }
>> }
>> ```
>>
>> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
>> fall in throughput (which in a way confirms the issue is due to cache
>> misses).
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>
------
This email has been scanned for spam and malware by The Email Laundry.
next prev parent reply other threads:[~2018-09-11 19:36 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-11 8:15 Arvind Narayanan
2018-09-11 14:20 ` Wiles, Keith
2018-09-11 15:42 ` Arvind Narayanan
2018-09-11 16:52 ` Wiles, Keith
2018-09-11 17:18 ` Arvind Narayanan
2018-09-11 18:07 ` Stephen Hemminger
2018-09-11 18:39 ` Arvind Narayanan
2018-09-11 19:12 ` Stephen Hemminger
2018-09-12 8:22 ` Van Haaren, Harry
2018-09-11 19:36 ` Pierre Laurent [this message]
2018-09-11 21:49 ` Arvind Narayanan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com \
--to=pierre@emutex.com \
--cc=users@dpdk.org \
--cc=webguru2688@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).