From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) by dpdk.org (Postfix) with ESMTP id 043357F0D for ; Tue, 11 Nov 2014 16:28:02 +0100 (CET) Received: by mail-wg0-f45.google.com with SMTP id x12so11861103wgg.32 for ; Tue, 11 Nov 2014 07:37:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=H7c8P0VU7yjQV52iXb+8YDrkSgq8PTEkSwzheGn4bpQ=; b=Hz8BcFiA9a9/VgPgQ6eilfYOl+rRI0uAjrb9bqmn4q876LTfugrzLGmI58lA6qAcoP Jf+PuUwdfwaoU9MUaL5RAxFm2o3EwtDJXZqbFrzj3vGbTW0OkhSkalupWYX0bp+9scYu vELQb19zq+RYiKgHH/nRaF79cbp4nAjd6evh9FxDo4wvg+iUunEIEqJsG/6uojPQ8w1q ENvwbh0NBKYkKok9lqNebLPEe7kZO5fYTD9jR3poN057hYA2CMLU6JGqZQHfCak3XIQJ 7GlANOLPmy7mnM14eubstvrpZsKfHAOtzZD7HWa0JQr+4AT/lyf5HRMFdy4j3eixlNEU LFfA== MIME-Version: 1.0 X-Received: by 10.180.211.108 with SMTP id nb12mr33808781wic.76.1415720272229; Tue, 11 Nov 2014 07:37:52 -0800 (PST) Received: by 10.27.86.144 with HTTP; Tue, 11 Nov 2014 07:37:52 -0800 (PST) Date: Tue, 11 Nov 2014 17:37:52 +0200 Message-ID: From: jigsaw To: Bruce Richardson , "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: [dpdk-dev] LLC miss in librte_distributor X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Nov 2014 15:28:02 -0000 Hi Bruce, I noticed that librte_distributor has quite sever LLC miss problem when running on 16 cores. While on 8 cores, there's no such problem. The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 cores on 2 sockets. The test case is the distributor_perf_autotest, i.e. in app/test/test_distributor_perf.c. The test result is collected by command: perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test -cff -n2 --no-huge Note that test results show that with or without hugepage, the LCC miss rate remains the same. So I will just show --no-huge config. With 8 cores, the LLC miss rate is OK: LLC-load-misses 26750 LLC-loads 93979233 LLC-store-misses 432263 LLC-stores 69954746 That is 0.028% of load miss and 0.62% of store miss. With 16 cores, the LLC miss rate is very high: LLC-load-misses 70263520 LLC-loads 143807657 LLC-store-misses 23115990 LLC-stores 63692854 That is 48.9% load miss and 36.3% store miss. Most of the load miss happens at first line of rte_distributor_poll_pkt. Most of the store miss happens at ... I don't know, because perf record on LLC-store-misses brings down my machine. It's not so straightforward to me how could this happen: 8 core fine, but 16 cores very bad. My guess is that 16 cores bring in more QPI transaction between sockets? Or 16 cores bring a different LLC access pattern? So I tried to reduce the padding inside union rte_distributor_buffer from 3 cachelines to 1 cacheline. - char pad[CACHE_LINE_SIZE*3]; + char pad[CACHE_LINE_SIZE]; And it does have a obvious result: LLC-load-misses 53159968 LLC-loads 167756282 LLC-store-misses 29012799 LLC-stores 63352541 Now it is 31.69% of load miss, and 45.79% of store miss. It lows down the load miss rate, but raises the store miss rate. Both numbers are still very high, sadly. But the bright side is that it decrease the Time per burst and time per packet. The original version has: === Performance test of distributor === Time per burst: 8013 Time per packet: 250 And the patched ver has: === Performance test of distributor === Time per burst: 6834 Time per packet: 213 I tried a couple of other tricks. Such as adding more idle loops in rte_distributor_get_pkt, and making the rte_distributor_buffer thread_local to each worker core. But none of this trick has any noticeable outcome. These failures make me tend to believe the high LLC miss rate is related to QPI or NUMA. But my machine is not able to perf on uncore QPI events so this cannot be approved. I cannot draw any conclusion or reveal the root cause after all. But I suggest a further study on the performance bottleneck so as to find a good solution. thx & rgds, -qinglai