From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dumitru.ceara@gmail.com>
Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com
 [209.85.213.169]) by dpdk.org (Postfix) with ESMTP id 6266D6C97
 for <users@dpdk.org>; Thu, 12 May 2016 16:58:07 +0200 (CEST)
Received: by mail-ig0-f169.google.com with SMTP id m9so54115427ige.1
 for <users@dpdk.org>; Thu, 12 May 2016 07:58:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:from:date:message-id:subject:to;
 bh=9e8EagELhz5ryI6G/xHPAyjxMjPDzX21ufqmKsZd0qY=;
 b=aF85SyTnyHFchTu6sI12hkkwBl87gSFxcExvTOA6KOpcyIgiTGNRCgEjseAJRWlH1o
 MpMWDX4c9MczC/Mstg/qCmLUN493tSkV4C9LSEuSUPK0enu4fDVD9qMN0jaae/+76sBc
 4VjL8bhQC/tbH/tYaDXOEQ+MIhSLjaZqg6WHxdNHC/Cs15RUlFfOXBwJNX0dmU2wb00V
 42jTEbuYofkRwnhGF0sft+V4jNFONEgNkP8+GmyHuz3mi3EFE8tr4QD4c+ydQ/AEq87V
 +8FaHkdC11E13JXOnAANdD7UWVK5lo2pKaOAS9l5ctI+2849yBCcTuTEQtCHCDxb4+P3
 0cOQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=9e8EagELhz5ryI6G/xHPAyjxMjPDzX21ufqmKsZd0qY=;
 b=AgN12riwZzshxl5p08IdvBex/E4ZfPFdItGl4t/YpQQ+9Xwy9WPpS21/6TEh0d+nLO
 fHSzbXhDlpgnu+r5X3WnoJtF1j2olL9jkfuIjtUObcl9WJJa0nct/IaFdi3/AwAxidrs
 IsAbIKnOnF7kyYp6nB+b7tyUuDlSPeEOqWwQwgLHh4e9pnluOik03lTlcqsdHUymiZVz
 SH8QhAX56/6Qbby30yi+BK8M0YOdsG655EWfQWGhDr4Eq+NNomvXYYCizOaWjhNC4mAT
 0LxaClkELAlKYTkc0916LMMZow2hulzxZiEwUaWGWJQ90GGhKkSODOd/mGNJOIGNvISh
 dE4A==
X-Gm-Message-State: AOPr4FU04PTm6ZMYc7wjZ+mtNpU2/CV3cThXPu7MXgDHInAjpTGs6am/Ts32eH9wIpNKe9KPBHe08he9zkEa7g==
X-Received: by 10.50.18.132 with SMTP id w4mr8279099igd.83.1463065086880; Thu,
 12 May 2016 07:58:06 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.4.198 with HTTP; Thu, 12 May 2016 07:57:47 -0700 (PDT)
From: Take Ceara <dumitru.ceara@gmail.com>
Date: Thu, 12 May 2016 16:57:47 +0200
Message-ID: <CAKKV4w96UgQK=1P9YQuA0QM2JzrZChvoxmy9R+6ycB1GQeMFqw@mail.gmail.com>
To: users@dpdk.org
Content-Type: text/plain; charset=UTF-8
Subject: [dpdk-users] Performance degradation in dpdk 2.2 when using
 multiple NICs on different CPU sockets
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: usage discussions <users.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 12 May 2016 14:58:07 -0000

Hello,

We're working on a project where we use DPDK and our own TCP
implementation on top in order to setup a high number of TCP
sessions/s between DPDK-controlled ethernet ports.

Our reference hardware platform is:
- Super X10DRX dual socket motherboard
- 2 Intel E5-2660 v3 Processor (10 cores * 2hw threads)
- 128GB RAM, using 16x 8G DDR4 2133Mhz to fill all the memory slots
- 2 40G Intel XL710-QDA1 adapters

The CPU layout is:
$ $RTE_SDK/tools/cpu_layout.py
============================================================
Core and Socket Information (as reported by '/proc/cpuinfo')
============================================================

cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
sockets =  [0, 1]

         Socket 0        Socket 1
         --------        --------
Core 0  [0, 20]         [10, 30]
Core 1  [1, 21]         [11, 31]
Core 2  [2, 22]         [12, 32]
Core 3  [3, 23]         [13, 33]
Core 4  [4, 24]         [14, 34]
Core 8  [5, 25]         [15, 35]
Core 9  [6, 26]         [16, 36]
Core 10 [7, 27]         [17, 37]
Core 11 [8, 28]         [18, 38]
Core 12 [9, 29]         [19, 39]

Following the DPDK performance guidelines [1], according to section
7.2, point 3:
"Note: To get the best performance, ensure that the core and NICs are
in the same socket. In the example above 85:00.0 is on socket 1 and
should be used by cores on socket 1 for the best performance."

For testing we connected the two 40G ports back to back and decided to
install them on different sockets. We made sure that the NICs are
controlled by cores that are in the same socket:
NIC0:
PCI 02:00.0 (socket 0) -> cores: 2-9 (all on socket 0)
NIC1:
PCI 82:00.0 (socket 1) -> cores: 12-19 (all on socket 1)

With this configuration our implementation could achieve a TCP setup
rate (assuming NIC0 running the clients and NIC1 running the servers)
of ~3.2M sessions/s.
According to our previous benchmarks on single socket servers this was
a really low performance as we were expecting around 12M sessions/s.

(At least in theory) our implementation should scale almost linearly:
- we use RSS hashing to distribute traffic between queues and
completely split the TCP stack into independent per core-queue stacks.
- there's no locking between any of the cores handling the queues.
- there's virtually no atomic variable usage while generating the traffic.
- all the memory used for generating the TCP sessions is allocated
from the local socket of the core.
- we use per socket mbuf pools.

I then found this note [1] (section 7.1.1):
"Care should be take with NUMA. If you are using 2 or more ports from
different NICs, it is best to ensure that these NICs are on the same
CPU socket. An example of how to determine this is shown further
below."

We then moved both NICs on socket 1 and used the following configuration:
NIC0:
0000:83:00.0 (socket 1) -> cores: 12-15 (socket 1)
NIC1:
0000:84:00.0 (socket 1) -> cores: 16-19 (socket 1)

In this case the setup rate scaled almost linearly to ~6M sessions/s
as we originally expected.

I thought initially that the performance drop was due to the way the
driver allocates and polls the queues. However, when going a bit
through the i40e driver code, as far as I see, all the memory
allocations are also done based on the socket-id that's passed when
setting up the RX queues (i40e_dev_rx_queue_setup) which contradicts
my guess.

We'd like to eventually use both sockets at the same time and this
performance degradation raises a problem.
What are the alternatives to overcome this limitation?

Would running two DPDK instances (e.g., in VMs) have the same limitation?

Thanks,
Dumitru Ceara

[1] http://dpdk.org/doc/guides/linux_gsg/nic_perf_intel_platform.html