From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mr30.theemaillaundry.net (mr30.theemaillaundry.net [78.46.72.43]) by dpdk.org (Postfix) with ESMTP id 81A234C94 for ; Tue, 11 Sep 2018 21:36:46 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mr30.theemaillaundry.net (Postfix) with ESMTP id 3CB262408BB; Tue, 11 Sep 2018 20:36:46 +0100 (IST) X-Amavis-Modified: Mail body modified (using disclaimer) - mr30.theemaillaundry.net X-Virus-Scanned: amavisd-new at theemaillaundry.net Received: from mr30.theemaillaundry.net ([127.0.0.1]) by localhost (mr30.theemaillaundry.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kTZMXQ7Nrfed; Tue, 11 Sep 2018 20:36:44 +0100 (IST) Received: from marvin.emutex.com (unknown [92.51.199.138]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mr30.theemaillaundry.net (Postfix) with ESMTPS id BE3E124090E; Tue, 11 Sep 2018 20:36:44 +0100 (IST) Received: from [10.10.68.107] by statler.emutex.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1fzoSo-0003JK-Rh; Tue, 11 Sep 2018 20:36:44 +0100 To: users@dpdk.org, webguru2688@gmail.com References: <20180911110744.7ef55fc2@xeon-e3> From: Pierre Laurent Message-ID: <0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com> Date: Tue, 11 Sep 2018 20:36:42 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20180911110744.7ef55fc2@xeon-e3> Content-Language: en-US Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 19:36:46 -0000 Can I suggest a few steps for investigating more ? First, verify that the L1 cache is really the suspect one. this can be done simply with perf utility and the counter L1-dcache-load-misses. the simplest tool is "perf" which is part of linux-tools packages |||$ apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r` | $ sudo perf stat -d -d -d ./build/rxtx EAL: Detected 12 lcore(s) .... ^C  Performance counter stats for './build/rxtx':        1413.787490      task-clock (msec)         # 0.923 CPUs utilized                 18      context-switches          # 0.013 K/sec                  4      cpu-migrations            # 0.003 K/sec                238      page-faults               # 0.168 K/sec      4,436,904,124      cycles                    # 3.138 GHz                      (32.67%)      3,888,094,815      stalled-cycles-frontend   # 87.63% frontend cycles idle     (32.94%)        237,378,065      instructions              # 0.05  insn per cycle                                                   # 16.38  stalled cycles per insn  (39.73%)         76,863,834      branches                  # 54.367 M/sec                    (40.01%)            101,550      branch-misses             # 0.13% of all branches          (40.30%)         94,805,298      L1-dcache-loads           # 67.058 M/sec                    (39.77%)        263,530,291      L1-dcache-load-misses     # 277.97% of all L1-dcache hits    (13.77%)            425,934      LLC-loads                 # 0.301 M/sec                    (13.60%)            181,295      LLC-load-misses           # 42.56% of all LL-cache hits     (20.21%)    L1-icache-loads            775,365 L1-icache-load-misses (26.71%)         70,580,827      dTLB-loads                # 49.923 M/sec                    (25.46%)              2,474      dTLB-load-misses          # 0.00% of all dTLB cache hits   (13.01%)                277      iTLB-loads                # 0.196 K/sec                    (13.01%)                994      iTLB-load-misses          # 358.84% of all iTLB cache hits   (19.52%)    L1-dcache-prefetches              7,204      L1-dcache-prefetch-misses # 0.005 M/sec                    (26.03%)        1.531809863 seconds time elapsed One of the common mistakes is to have excessively large tx and rx queues, which in turn helps trigger excessively large bursts. Your L1 cache is 32K, that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is not much ..... If the bursts you are processing happen to be more than approx 128 buffers, then you will be trashing the cache when running your loop. I would notice that you use a pool of 8192 of your buffers, and if you use them round-robin, then you have a perfect recipe for cache trashing. If so, then prefetch would help. rte_hash_lookup looks into cache lines too (at least 3 per successful invoke). If you use the same key, then rte_hash_lookup will look into the same cache lines. if your keys are randomly distributed, then it is another recipe for cache trashing. It is not clear from your descriptions if the core which reads the bursts from dpdk PMD is the same than the core which does the processing. if a core touch your buffers (e.g. tag1), and then you pass the buffer to another core, than you get LLC coherency overheads, which would also trigger LLC-load-misses (which you can detect through perf output above) It seems you have this type of processor (codename sandybridge, 6 cores, hyperthread is enabled) https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI Can you double check that your application run with the right core assignment  ? Since hyperthreading is enabled, you should not use 0 (plenty functions for the linux kernel run on core 0) nor core 6 (which is the same hardware than core 0) and make sure the hyperthread corresponding to the core you are running is not used either. You can get the CPU<-->Core assignment with lscpu tool $ lscpu -p # The following is the parsable format, which can be fed to other # programs. Each different item in every column has an unique ID # starting from zero. # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,4,0,0,,4,4,4,0 5,5,0,0,,5,5,5,0 6,0,0,0,,0,0,0,0 7,1,0,0,,1,1,1,0 8,2,0,0,,2,2,2,0 9,3,0,0,,3,3,3,0 10,4,0,0,,4,4,4,0 11,5,0,0,,5,5,5,0 If you do not need hyperthreading, and if L1 cache is your bottleneck, you might need to disable hyperthreading and get 64K bytes L1 cache per core. If you really need hyperthreading, then use less cache in your code by better tuning the buffer pool sizes. SW prefetch is quite difficult to use efficiently. There are 4 different hardware prefetcher with different algorithms (adjacent cache lines, stride access ...) where the use of prefetch instruction is unnecessary, and there is a hw limit of about 8 pending L1 data cache misses (sometimes documented as 5, sometimes documented as 10 ..). This creates serious burden of software complexity to abide by the hw rules. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some bios might have created a different setup. On 11/09/18 19:07, Stephen Hemminger wrote: > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan wrote: > >> If I don't do any processing, I easily get 10G. It is only when I access >> the tag when the throughput drops. >> What confuses me is if I use the following snippet, it works at line rate. >> >> ``` >> int temp_key = 1; // declared outside of the for loop >> >> for (i = 0; i < pkt_count; i++) { >> if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { >> } >> } >> ``` >> >> But as soon as I replace `temp_key` with `my_packet->tag1`, I experience >> fall in throughput (which in a way confirms the issue is due to cache >> misses). > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the cache. > > ------ This email has been scanned for spam and malware by The Email Laundry.