From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by dpdk.org (Postfix) with ESMTP id 587FB2BF9 for ; Tue, 11 Sep 2018 23:49:21 +0200 (CEST) Received: by mail-lf1-f53.google.com with SMTP id v77-v6so21634626lfa.6 for ; Tue, 11 Sep 2018 14:49:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zF6deOVc409KZE+X4OGycHc1n9BAqbECC0yaDv5xJvI=; b=JEsJAnNq2UcBNLwLNCwZahfTAnu1ebJ9crUd72Q8Ya69XQUydMszAkoBbg+s/xidNe R+9VCyOvrWQ3Ib3X+QNeKaVOLi9TATvvb5s28kg0OMsnJgMYQdAC1ci6pH/H+/dB+ftb wyOdbMfXA4s851iPhhKfnhqclpzIHcENXzffIOtL3Dvi5AZ69g2xLKmhdGLY/Q8TXsGh Mq/djA4bqv+GOmBu9JlyA+WjfgfKVk5lAtK50N9boeIPK2BkvUwkEE7KanvtjUETo4Rv L3rbQvKLjMaIp2EozKAqo5Zh+1bGMzF5zKv4pGRlitXNxtbakYhx5SK1qwunqZ/xx+1r pYjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zF6deOVc409KZE+X4OGycHc1n9BAqbECC0yaDv5xJvI=; b=ldg4miRPgksj57ZrcoWtsPCXmOPRR965ctvVjtHnqA7ltNtSEoFQNbwpD1PhgV1xEu cBfRNctNr7058xb8RMmtINXYlEsJA6/JILT09lv9Rp8HNX4aC9SHY0DDyID8Dx3NeeDf RGbyp100Lh1OaMKDNVoIcbjrRYqsA/WPPJ3TVvh+azuIznMSc1/t4FOzNxKiqZY63vJc mGDXBbgDoGT8Ils+admx/M3meepmd5gdnwAMU6k/e14Q1NuwZYcvc/33IGmeqRT4fMqq zB8OC/m8SjjjrNLZU07U/3gON1q3OeLk+ZLCQ8Zy1IjuLZmoH1YdZ9tQ/eQNxJBdK3Zq Lb7Q== X-Gm-Message-State: APzg51Ce3OlTWFWd9yfFYD+RXU3oO9qbg/1WPDh6w5bOPH17teI4U0HW CaHNkeURYQR7cRwVIu/YJaEzQCTkzHGmG9tFN8jeAxpu X-Google-Smtp-Source: ANB0Vda4H1oby0u99b6Gi5OLbkKwBrXkfxXmlQFNOSAZusvDSyCghyHTJ7UhewsC4sJL2fAa/xEglWkbsZRDcBoHyDU= X-Received: by 2002:a19:c94a:: with SMTP id z71-v6mr3020031lff.34.1536702560610; Tue, 11 Sep 2018 14:49:20 -0700 (PDT) MIME-Version: 1.0 References: <20180911110744.7ef55fc2@xeon-e3> <0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com> In-Reply-To: <0eb1c465-4639-571e-adcc-c9125ec3c27a@emutex.com> From: Arvind Narayanan Date: Tue, 11 Sep 2018 16:49:07 -0500 Message-ID: To: pierre@emutex.com Cc: users@dpdk.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] How to use software prefetching for custom structures to increase throughput on the fast path X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 21:49:21 -0000 Stephen and Pierre, thanks! Pierre, all points noted. As per Pierre's suggestions, I performed perf stat on the application. Here are the results.. Using pktgen default configuration, I send 100M packets on a 10G line. This is when I use my_packet->tag1 to lookup where the throughput drops to 8.4G/10G: Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4 -- -p 3': 47453.031698 task-clock (msec) # 1.830 CPUs utilized 77 context-switches # 0.002 K/sec 6 cpu-migrations # 0.000 K/sec 868 page-faults # 0.018 K/sec 113,357,285,372 cycles # 2.389 GHz (49.95%) 53,324,793,523 stalled-cycles-frontend # 47.04% frontend cycles idle (49.95%) 27,161,539,189 stalled-cycles-backend # 23.96% backend cycles idle (49.96%) 191,560,395,309 instructions # 1.69 insn per cycle # 0.28 stalled cycles per insn (56.22%) 36,872,293,868 branches # 777.027 M/sec (56.23%) 13,801,124 branch-misses # 0.04% of all branches (56.24%) 67,524,214,383 L1-dcache-loads # 1422.969 M/sec (56.24%) 1,015,922,260 L1-dcache-load-misses # 1.50% of all L1-dcache hits (56.26%) 619,670,574 LLC-loads # 13.059 M/sec (56.29%) 82,917 LLC-load-misses # 0.01% of all LL-cache hits (56.31%) L1-icache-loads 2,059,915 L1-icache-load-misses (56.30%) 67,641,851,208 dTLB-loads # 1425.448 M/sec (56.29%) 151,760 dTLB-load-misses # 0.00% of all dTLB cache hits (50.01%) 904 iTLB-loads # 0.019 K/sec (50.01%) 10,309 iTLB-load-misses # 1140.38% of all iTLB cache hits (50.00%) L1-dcache-prefetches 528,633,571 L1-dcache-prefetch-misses # 11.140 M/sec (49.97%) 25.929843368 seconds time elapsed This is when I use a temp_key approach: Performance counter stats for './build/rxtx -l 1-5 --master-lcore=1 -n 4 -- -p 3': 42614.775381 task-clock (msec) # 1.729 CPUs utilized 71 context-switches # 0.002 K/sec 6 cpu-migrations # 0.000 K/sec 869 page-faults # 0.020 K/sec 99,422,031,536 cycles # 2.333 GHz (49.89%) 43,615,501,744 stalled-cycles-frontend # 43.87% frontend cycles idle (49.91%) 21,325,495,955 stalled-cycles-backend # 21.45% backend cycles idle (49.95%) 170,398,414,529 instructions # 1.71 insn per cycle # 0.26 stalled cycles per insn (56.22%) 32,543,342,205 branches # 763.663 M/sec (56.26%) 52,276,245 branch-misses # 0.16% of all branches (56.30%) 58,855,845,003 L1-dcache-loads # 1381.114 M/sec (56.33%) 1,046,059,603 L1-dcache-load-misses # 1.78% of all L1-dcache hits (56.34%) 598,557,493 LLC-loads # 14.046 M/sec (56.35%) 84,048 LLC-load-misses # 0.01% of all LL-cache hits (56.35%) L1-icache-loads 2,150,306 L1-icache-load-misses (56.33%) 58,942,694,476 dTLB-loads # 1383.152 M/sec (56.29%) 147,013 dTLB-load-misses # 0.00% of all dTLB cache hits (49.97%) 22,392 iTLB-loads # 0.525 K/sec (49.93%) 5,839 iTLB-load-misses # 26.08% of all iTLB cache hits (49.90%) L1-dcache-prefetches 533,602,543 L1-dcache-prefetch-misses # 12.522 M/sec (49.89%) 24.647230934 seconds time elapsed Not sure if I am understanding it correctly, but there are a lot of iTLB-load-misses in the lower-throughput perf stat output. One of the common mistakes is to have excessively large tx and rx queues, > which in turn helps trigger excessively large bursts. Your L1 cache is 32K, > that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is > not much ..... If the bursts you are processing happen to be more than > approx 128 buffers, then you will be trashing the cache when running your > loop. I would notice that you use a pool of 8192 of your buffers, and if > you use them round-robin, then you have a perfect recipe for cache > trashing. If so, then prefetch would help. > You raised a very good point here and I think DPDK's writing efficient code page could maybe have a section on this topic and help on understanding how this assignment helps, or maybe I should have missed it if DPDK already has details about how to assign RX and TX ring sizes. Without knowing compute load of each part on the data path, people like me just assign random 2^n values (I blame myself here though). rte_mbuf pool size is 4096 rx_ring and tx_ring sizes are 1024 rings used to communicate between cores are 8192 my_packet mempool is 8192 MAX _BURST_SIZE for all the loops in the DPDK application is set to 32 It is not clear from your descriptions if the core which reads the bursts > from dpdk PMD is the same than the core which does the processing. if a > core touch your buffers (e.g. tag1), and then you pass the buffer to > another core, than you get LLC coherency overheads, which would also > trigger LLC-load-misses (which you can detect through perf output above) > I isolate CPUs 1,2,3,4,5 from kernel, thus leaving 0 for kernel operations. Core 2 (which runs an infinite RX/TX loop) reads the packets from DPDK PMD and sets tag1 values, while Core 4 lookups the rte_hash table using tag1 as key and proceeds further. > > It seems you have this type of processor (codename sandybridge, 6 cores, > hyperthread is enabled) > > > https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI > > Can you double check that your application run with the right core > assignment ? Since hyperthreading is enabled, you should not use 0 (plenty > functions for the linux kernel run on core 0) nor core 6 (which is the same > hardware than core 0) and make sure the hyperthread corresponding to the > core you are running is not used either. You can get the CPU<-->Core > assignment with lscpu tool > I had had HT disabled for all the experiments. Here is the output of lscpu -p # The following is the parsable format, which can be fed to other # programs. Each different item in every column has an unique ID # starting from zero. # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,4,0,0,,4,4,4,0 5,5,0,0,,5,5,5,0 Thanks, Arvind On Tue, Sep 11, 2018 at 2:36 PM Pierre Laurent wrote: > > Can I suggest a few steps for investigating more ? > > First, verify that the L1 cache is really the suspect one. this can be > done simply with perf utility and the counter L1-dcache-load-misses. the > simplest tool is "perf" which is part of linux-tools packages > > $ apt-get install linux-tools-common linux-tools-generic > linux-tools-`uname -r` > > $ sudo perf stat -d -d -d ./build/rxtx > EAL: Detected 12 lcore(s) > .... > > ^C > > Performance counter stats for './build/rxtx': > > 1413.787490 task-clock (msec) # 0.923 CPUs > utilized > 18 context-switches # 0.013 > K/sec > 4 cpu-migrations # 0.003 > K/sec > 238 page-faults # 0.168 > K/sec > 4,436,904,124 cycles # 3.138 > GHz (32.67%) > 3,888,094,815 stalled-cycles-frontend # 87.63% frontend > cycles idle (32.94%) > 237,378,065 instructions # 0.05 insn per > cycle > # 16.38 stalled > cycles per insn (39.73%) > 76,863,834 branches # 54.367 > M/sec (40.01%) > 101,550 branch-misses # 0.13% of all > branches (40.30%) > 94,805,298 L1-dcache-loads # 67.058 > M/sec (39.77%) > 263,530,291 L1-dcache-load-misses # 277.97% of all > L1-dcache hits (13.77%) > 425,934 LLC-loads # 0.301 > M/sec (13.60%) > 181,295 LLC-load-misses # 42.56% of all > LL-cache hits (20.21%) > > L1-icache-loads > 775,365 > L1-icache-load-misses (26.71%) > 70,580,827 dTLB-loads # 49.923 > M/sec (25.46%) > 2,474 dTLB-load-misses # 0.00% of all dTLB > cache hits (13.01%) > 277 iTLB-loads # 0.196 > K/sec (13.01%) > 994 iTLB-load-misses # 358.84% of all iTLB > cache hits (19.52%) > > L1-dcache-prefetches > 7,204 L1-dcache-prefetch-misses # 0.005 > M/sec (26.03%) > > 1.531809863 seconds time elapsed > > > One of the common mistakes is to have excessively large tx and rx queues, > which in turn helps trigger excessively large bursts. Your L1 cache is 32K, > that is , 512 cache lines. L1 cache is not elastic, 512 cache lines is > not much ..... If the bursts you are processing happen to be more than > approx 128 buffers, then you will be trashing the cache when running your > loop. I would notice that you use a pool of 8192 of your buffers, and if > you use them round-robin, then you have a perfect recipe for cache > trashing. If so, then prefetch would help. > > rte_hash_lookup looks into cache lines too (at least 3 per successful > invoke). If you use the same key, then rte_hash_lookup will look into the > same cache lines. if your keys are randomly distributed, then it is another > recipe for cache trashing. > > > It is not clear from your descriptions if the core which reads the bursts > from dpdk PMD is the same than the core which does the processing. if a > core touch your buffers (e.g. tag1), and then you pass the buffer to > another core, than you get LLC coherency overheads, which would also > trigger LLC-load-misses (which you can detect through perf output above) > > > It seems you have this type of processor (codename sandybridge, 6 cores, > hyperthread is enabled) > > > https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI > > Can you double check that your application run with the right core > assignment ? Since hyperthreading is enabled, you should not use 0 (plenty > functions for the linux kernel run on core 0) nor core 6 (which is the same > hardware than core 0) and make sure the hyperthread corresponding to the > core you are running is not used either. You can get the CPU<-->Core > assignment with lscpu tool > > $ lscpu -p > # The following is the parsable format, which can be fed to other > # programs. Each different item in every column has an unique ID > # starting from zero. > # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 > 0,0,0,0,,0,0,0,0 > 1,1,0,0,,1,1,1,0 > 2,2,0,0,,2,2,2,0 > 3,3,0,0,,3,3,3,0 > 4,4,0,0,,4,4,4,0 > 5,5,0,0,,5,5,5,0 > 6,0,0,0,,0,0,0,0 > 7,1,0,0,,1,1,1,0 > 8,2,0,0,,2,2,2,0 > 9,3,0,0,,3,3,3,0 > 10,4,0,0,,4,4,4,0 > 11,5,0,0,,5,5,5,0 > If you do not need hyperthreading, and if L1 cache is your bottleneck, you > might need to disable hyperthreading and get 64K bytes L1 cache per core. > If you really need hyperthreading, then use less cache in your code by > better tuning the buffer pool sizes. > > > SW prefetch is quite difficult to use efficiently. There are 4 different > hardware prefetcher with different algorithms (adjacent cache lines, stride > access ...) where the use of prefetch instruction is unnecessary, and there > is a hw limit of about 8 pending L1 data cache misses (sometimes documented > as 5, sometimes documented as 10 ..). This creates serious burden of > software complexity to abide by the hw rules. > > > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors > . Just verify the hardware prefetchers are all enabled thru msr 0x1A4. Some > bios might have created a different setup. > > > > On 11/09/18 19:07, Stephen Hemminger wrote: > > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan wrote: > > > If I don't do any processing, I easily get 10G. It is only when I access > the tag when the throughput drops. > What confuses me is if I use the following snippet, it works at line rate. > > ``` > int temp_key = 1; // declared outside of the for loop > > for (i = 0; i < pkt_count; i++) { > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < 0) { > } > } > ``` > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > fall in throughput (which in a way confirms the issue is due to cache > misses). > > > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the cache. > > > > > > ------ > This email has been scanned for spam and malware by The Email Laundry. > >