From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id CE92AA32A2 for ; Thu, 24 Oct 2019 19:32:22 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id C48FA1EBAF; Thu, 24 Oct 2019 19:32:21 +0200 (CEST) Received: from smtp-4.sys.kth.se (smtp-4.sys.kth.se [130.237.48.193]) by dpdk.org (Postfix) with ESMTP id 84CA61EBAC for ; Thu, 24 Oct 2019 19:32:20 +0200 (CEST) Received: from smtp-4.sys.kth.se (localhost.localdomain [127.0.0.1]) by smtp-4.sys.kth.se (Postfix) with ESMTP id 1C6912D14 for ; Thu, 24 Oct 2019 19:32:20 +0200 (CEST) X-Virus-Scanned: by amavisd-new at kth.se Received: from smtp-4.sys.kth.se ([127.0.0.1]) by smtp-4.sys.kth.se (smtp-4.sys.kth.se [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Hg1yVtO11n4S for ; Thu, 24 Oct 2019 19:32:18 +0200 (CEST) X-KTH-Auth: barbette [2001:6b0:1:1140:847b:ee39:7e75:e872] DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kth.se; s=default; t=1571938338; bh=KnAJiKOjP1aYf67BbNo1lOl/V6iPFTb7zYpY7hoX7cU=; h=To:From:Subject:Date; b=ccXY0eNU8AyKoMObtXPoFoV5ZXNTpEKd8IH+Y7Dck0ssA4hKSO76eF2Yvh+bzazJr UUhUhvhRkUm5LJjpAWYcvH3lBnTSrsHiXPn1/Ht/KLEAfv1qiyOiFL2wD6v9yfqpXM tcDwQCNuwMo2rUBQmwUPrfJxBNQn6f1W8kr/7Q6I= X-KTH-mail-from: barbette@kth.se X-KTH-rcpt-to: dev@dpdk.org Received: from [IPv6:2001:6b0:1:1140:847b:ee39:7e75:e872] (unknown [IPv6:2001:6b0:1:1140:847b:ee39:7e75:e872]) by smtp-4.sys.kth.se (Postfix) with ESMTPSA id 9573D597 for ; Thu, 24 Oct 2019 19:32:18 +0200 (CEST) To: "dev@dpdk.org" From: Tom Barbette Message-ID: Date: Thu, 24 Oct 2019 19:32:18 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: [dpdk-dev] Performance impact of "declaring" more CPU cores X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi all, We're experiencing a very strange problem. The code of our application is modified to always use 8 cores. However, when running with "-l 0-MAXCPU", with MAXCPU varying between 7 and 15 (therefore "allocating" more *unused* cores), the performance of the application change. It can be multiple Gbps at very high speed (100G) with a large number of cores, and it is not linear in the sense that more cores will not necessarily increase, nor degrade performance. An example can be seen here : https://kth.box.com/s/v5v1hyidd51ebd7b8ixmcw513lfqwj4a . Note the errorbars (10 runs per points) show that it is not "global" variance that create the difference between cores. Once you end up in a "class" of performance, you stay there even when changing some parameters (such as the number of cores you actually use). From a research perspective this is problematic as we cannot trust results with X cores are indeed a certain %age better than Y core because it could be due to this problem, not what we "improved" in the application. That application is a NAT and a FW, but we could observe that with other VNFs, at different scales. The link is saturated at 100G with real packets (avg size 1040B). We have Mellanox ConnectX 5, DPDK 19.02. This could be observed on between Skylake and Cascade Lake machines. The example is a single-socket machine with 8 cores and HT. But this happened with a NUMA machine, or using only variation of the 18 cores of one sockets. The only useful observation we made is that when we are in a "bad case", the LLC has more cache misses. We could not really reproduce with XL710, but at 40G the problem can be barely seen with MLX, so that path is not very conclusive. We do not have other 100G NICs (hurry up Intel! :p). Similarly reproducing with testpmd is hard, as the problem really shows with something like 8 cores, and we need the load to be less than the NIC wire speed, but high enough to observe the effect... To rule out that problem, we hardcoded the number of cores to be used everywhere inside the application. So except for DPDK internals, all resources are allocated for 8 cores. And that is still the best lead. Could it be that we simply get lucky/unlucky buffer allocations for "something per lcore" and therefore packet evict each others because of the limited associativity of the cache? At those speeds, failure of DDIO is death... Would there be a way to fix/verify the memory allocation? base-virtaddr did not work though. ASLR disabled (enabled does not change anything). Thanks for the help, Tom