Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path

DPDK usage discussions
 help / color / mirror / Atom feed

From: Pierre Laurent <pierre@emutex.com>
To: users@dpdk.org, webguru2688@gmail.com
Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path
Date: Tue, 11 Sep 2018 20:36:42 +0100	[thread overview]
Message-ID: <0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com> (raw)
In-Reply-To: <20180911110744.7ef55fc2@xeon-e3>

Can I suggest a few steps for investigating more ?

First, verify that the L1 cache is really the suspect one. this can be 
done simply with perf utility and the counter L1-dcache-load-misses. the 
simplest tool is "perf" which is part of linux-tools packages

|||$ apt-get install linux-tools-common linux-tools-generic 
linux-tools-`uname -r` |

$ sudo perf stat -d -d -d ./build/rxtx
EAL: Detected 12 lcore(s)
....

^C

  Performance counter stats for './build/rxtx':

        1413.787490      task-clock (msec)         # 0.923 CPUs utilized
                 18      context-switches          # 0.013 K/sec
                  4      cpu-migrations            # 0.003 K/sec
                238      page-faults               # 0.168 K/sec
      4,436,904,124      cycles                    # 3.138 
GHz                      (32.67%)
      3,888,094,815      stalled-cycles-frontend   # 87.63% frontend 
cycles idle     (32.94%)
        237,378,065      instructions              # 0.05  insn per cycle
                                                   # 16.38  stalled 
cycles per insn  (39.73%)
         76,863,834      branches                  # 54.367 
M/sec                    (40.01%)
            101,550      branch-misses             # 0.13% of all 
branches          (40.30%)
         94,805,298      L1-dcache-loads           # 67.058 
M/sec                    (39.77%)
        263,530,291      L1-dcache-load-misses     # 277.97% of all 
L1-dcache hits    (13.77%)
            425,934      LLC-loads                 # 0.301 
M/sec                    (13.60%)
            181,295      LLC-load-misses           # 42.56% of all 
LL-cache hits     (20.21%)
    <not supported> L1-icache-loads
            775,365 L1-icache-load-misses (26.71%)
         70,580,827      dTLB-loads                # 49.923 
M/sec                    (25.46%)
              2,474      dTLB-load-misses          # 0.00% of all dTLB 
cache hits   (13.01%)
                277      iTLB-loads                # 0.196 
K/sec                    (13.01%)
                994      iTLB-load-misses          # 358.84% of all iTLB 
cache hits   (19.52%)
    <not supported> L1-dcache-prefetches
              7,204      L1-dcache-prefetch-misses # 0.005 
M/sec                    (26.03%)

        1.531809863 seconds time elapsed

One of the common mistakes is to have excessively large tx and rx 
queues, which in turn helps trigger excessively large bursts. Your L1 
cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512 
cache lines is not much ..... If the bursts you are processing happen to 
be more than approx 128 buffers, then you will be trashing the cache 
when running your loop. I would notice that you use a pool of 8192 of 
your buffers, and if you use them round-robin, then you have a perfect 
recipe for cache trashing. If so, then prefetch would help.

rte_hash_lookup looks into cache lines too (at least 3 per successful 
invoke). If you use the same key, then rte_hash_lookup will look into 
the same cache lines. if your keys are randomly distributed, then it is 
another recipe for cache trashing.

It is not clear from your descriptions if the core which reads the 
bursts from dpdk PMD is the same than the core which does the 
processing. if a core touch your buffers (e.g. tag1), and then you pass 
the buffer to another core, than you get LLC coherency overheads, which 
would also trigger LLC-load-misses (which you can detect through perf 
output above)

It seems you have this type of processor (codename sandybridge, 6 cores, 
hyperthread is enabled)

https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI

Can you double check that your application run with the right core 
assignment  ? Since hyperthreading is enabled, you should not use 0 
(plenty functions for the linux kernel run on core 0) nor core 6 (which 
is the same hardware than core 0) and make sure the hyperthread 
corresponding to the core you are running is not used either. You can 
get the CPU<-->Core assignment with lscpu tool

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,0,0,,4,4,4,0
5,5,0,0,,5,5,5,0
6,0,0,0,,0,0,0,0
7,1,0,0,,1,1,1,0
8,2,0,0,,2,2,2,0
9,3,0,0,,3,3,3,0
10,4,0,0,,4,4,4,0
11,5,0,0,,5,5,5,0

If you do not need hyperthreading, and if L1 cache is your bottleneck, 
you might need to disable hyperthreading and get 64K bytes L1 cache per 
core. If you really need hyperthreading, then use less cache in your 
code by better tuning the buffer pool sizes.

SW prefetch is quite difficult to use efficiently. There are 4 different 
hardware prefetcher with different algorithms (adjacent cache lines, 
stride access ...) where the use of prefetch instruction is unnecessary, 
and there is a hw limit of about 8 pending L1 data cache misses 
(sometimes documented as 5, sometimes documented as 10 ..). This creates 
serious burden of software complexity to abide by the hw rules.

https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors 
. Just verify the hardware prefetchers are all enabled thru msr 0x1A4. 
Some bios might have created a different setup.

On 11/09/18 19:07, Stephen Hemminger wrote:
> On Tue, 11 Sep 2018 12:18:42 -0500
> Arvind Narayanan <webguru2688@gmail.com> wrote:
>
>> If I don't do any processing, I easily get 10G. It is only when I access
>> the tag when the throughput drops.
>> What confuses me is if I use the following snippet, it works at line rate.
>>
>> ```
>> int temp_key = 1; // declared outside of the for loop
>>
>> for (i = 0; i < pkt_count; i++) {
>>      if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) {
>>      }
>> }
>> ```
>>
>> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience
>> fall in throughput (which in a way confirms the issue is due to cache
>> misses).
> Your packet data is not in cache.
> Doing prefetch can help but it is very timing sensitive. If prefetch is done
> before data is available it won't help. And if prefetch is done just before
> data is used then there isn't enough cycles to get it from memory to the cache.
>
>

------
This email has been scanned for spam and malware by The Email Laundry.

next prev parent reply	other threads:[~2018-09-11 19:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-11  8:15 Arvind Narayanan
2018-09-11 14:20 ` Wiles, Keith
2018-09-11 15:42   ` Arvind Narayanan
2018-09-11 16:52     ` Wiles, Keith
2018-09-11 17:18       ` Arvind Narayanan
2018-09-11 18:07         ` Stephen Hemminger
2018-09-11 18:39           ` Arvind Narayanan
2018-09-11 19:12             ` Stephen Hemminger
2018-09-12  8:22             ` Van Haaren, Harry
2018-09-11 19:36           ` Pierre Laurent [this message]
2018-09-11 21:49             ` Arvind Narayanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com \
    --to=pierre@emutex.com \
    --cc=users@dpdk.org \
    --cc=webguru2688@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).