From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com [209.85.213.169]) by dpdk.org (Postfix) with ESMTP id 6266D6C97 for ; Thu, 12 May 2016 16:58:07 +0200 (CEST) Received: by mail-ig0-f169.google.com with SMTP id m9so54115427ige.1 for ; Thu, 12 May 2016 07:58:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=9e8EagELhz5ryI6G/xHPAyjxMjPDzX21ufqmKsZd0qY=; b=aF85SyTnyHFchTu6sI12hkkwBl87gSFxcExvTOA6KOpcyIgiTGNRCgEjseAJRWlH1o MpMWDX4c9MczC/Mstg/qCmLUN493tSkV4C9LSEuSUPK0enu4fDVD9qMN0jaae/+76sBc 4VjL8bhQC/tbH/tYaDXOEQ+MIhSLjaZqg6WHxdNHC/Cs15RUlFfOXBwJNX0dmU2wb00V 42jTEbuYofkRwnhGF0sft+V4jNFONEgNkP8+GmyHuz3mi3EFE8tr4QD4c+ydQ/AEq87V +8FaHkdC11E13JXOnAANdD7UWVK5lo2pKaOAS9l5ctI+2849yBCcTuTEQtCHCDxb4+P3 0cOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=9e8EagELhz5ryI6G/xHPAyjxMjPDzX21ufqmKsZd0qY=; b=AgN12riwZzshxl5p08IdvBex/E4ZfPFdItGl4t/YpQQ+9Xwy9WPpS21/6TEh0d+nLO fHSzbXhDlpgnu+r5X3WnoJtF1j2olL9jkfuIjtUObcl9WJJa0nct/IaFdi3/AwAxidrs IsAbIKnOnF7kyYp6nB+b7tyUuDlSPeEOqWwQwgLHh4e9pnluOik03lTlcqsdHUymiZVz SH8QhAX56/6Qbby30yi+BK8M0YOdsG655EWfQWGhDr4Eq+NNomvXYYCizOaWjhNC4mAT 0LxaClkELAlKYTkc0916LMMZow2hulzxZiEwUaWGWJQ90GGhKkSODOd/mGNJOIGNvISh dE4A== X-Gm-Message-State: AOPr4FU04PTm6ZMYc7wjZ+mtNpU2/CV3cThXPu7MXgDHInAjpTGs6am/Ts32eH9wIpNKe9KPBHe08he9zkEa7g== X-Received: by 10.50.18.132 with SMTP id w4mr8279099igd.83.1463065086880; Thu, 12 May 2016 07:58:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.4.198 with HTTP; Thu, 12 May 2016 07:57:47 -0700 (PDT) From: Take Ceara Date: Thu, 12 May 2016 16:57:47 +0200 Message-ID: To: users@dpdk.org Content-Type: text/plain; charset=UTF-8 Subject: [dpdk-users] Performance degradation in dpdk 2.2 when using multiple NICs on different CPU sockets X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 May 2016 14:58:07 -0000 Hello, We're working on a project where we use DPDK and our own TCP implementation on top in order to setup a high number of TCP sessions/s between DPDK-controlled ethernet ports. Our reference hardware platform is: - Super X10DRX dual socket motherboard - 2 Intel E5-2660 v3 Processor (10 cores * 2hw threads) - 128GB RAM, using 16x 8G DDR4 2133Mhz to fill all the memory slots - 2 40G Intel XL710-QDA1 adapters The CPU layout is: $ $RTE_SDK/tools/cpu_layout.py ============================================================ Core and Socket Information (as reported by '/proc/cpuinfo') ============================================================ cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] sockets = [0, 1] Socket 0 Socket 1 -------- -------- Core 0 [0, 20] [10, 30] Core 1 [1, 21] [11, 31] Core 2 [2, 22] [12, 32] Core 3 [3, 23] [13, 33] Core 4 [4, 24] [14, 34] Core 8 [5, 25] [15, 35] Core 9 [6, 26] [16, 36] Core 10 [7, 27] [17, 37] Core 11 [8, 28] [18, 38] Core 12 [9, 29] [19, 39] Following the DPDK performance guidelines [1], according to section 7.2, point 3: "Note: To get the best performance, ensure that the core and NICs are in the same socket. In the example above 85:00.0 is on socket 1 and should be used by cores on socket 1 for the best performance." For testing we connected the two 40G ports back to back and decided to install them on different sockets. We made sure that the NICs are controlled by cores that are in the same socket: NIC0: PCI 02:00.0 (socket 0) -> cores: 2-9 (all on socket 0) NIC1: PCI 82:00.0 (socket 1) -> cores: 12-19 (all on socket 1) With this configuration our implementation could achieve a TCP setup rate (assuming NIC0 running the clients and NIC1 running the servers) of ~3.2M sessions/s. According to our previous benchmarks on single socket servers this was a really low performance as we were expecting around 12M sessions/s. (At least in theory) our implementation should scale almost linearly: - we use RSS hashing to distribute traffic between queues and completely split the TCP stack into independent per core-queue stacks. - there's no locking between any of the cores handling the queues. - there's virtually no atomic variable usage while generating the traffic. - all the memory used for generating the TCP sessions is allocated from the local socket of the core. - we use per socket mbuf pools. I then found this note [1] (section 7.1.1): "Care should be take with NUMA. If you are using 2 or more ports from different NICs, it is best to ensure that these NICs are on the same CPU socket. An example of how to determine this is shown further below." We then moved both NICs on socket 1 and used the following configuration: NIC0: 0000:83:00.0 (socket 1) -> cores: 12-15 (socket 1) NIC1: 0000:84:00.0 (socket 1) -> cores: 16-19 (socket 1) In this case the setup rate scaled almost linearly to ~6M sessions/s as we originally expected. I thought initially that the performance drop was due to the way the driver allocates and polls the queues. However, when going a bit through the i40e driver code, as far as I see, all the memory allocations are also done based on the socket-id that's passed when setting up the RX queues (i40e_dev_rx_queue_setup) which contradicts my guess. We'd like to eventually use both sockets at the same time and this performance degradation raises a problem. What are the alternatives to overcome this limitation? Would running two DPDK instances (e.g., in VMs) have the same limitation? Thanks, Dumitru Ceara [1] http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html