From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dumitru.ceara@gmail.com>
Received: from mail-it0-f48.google.com (mail-it0-f48.google.com
 [209.85.214.48]) by dpdk.org (Postfix) with ESMTP id 61B65CB2E
 for <dev@dpdk.org>; Thu, 16 Jun 2016 20:20:54 +0200 (CEST)
Received: by mail-it0-f48.google.com with SMTP id a5so142161061ita.1
 for <dev@dpdk.org>; Thu, 16 Jun 2016 11:20:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=6vCKsAPhYOED6BqCU1Jhhoy17YhHUFeCzej0jzapKus=;
 b=k+lJqAq6qMjhzchGHonXnAYm9SqtcSI/PQnBQIMsXTwFPKkGmDnQ6HeQXqjgnXSOtp
 jCHVe9xvQ6bxIJP8jqt+uwa4AJCCJeOj2BngA641X38C3+KEod+GTs9wbmHoYzaA2VjW
 RXQyBxxq2ruP7Zexyj1vbWBTKJePITx0RJskBXjExiYMflxtFOHbOn2iwCpKKzLi9nDG
 zJm7F30mzO+bQoI09y3kaVw6td6v1VK1Q0YUl/fhQtERIA0gAi/9jvs+Cli3J3HLmENu
 8dvK3a/Jy1D5hSEf0mvCl1NVuY7bWiYeNO2sPDvZme+XsOLNZDAyMG5RmhMCSWG4jCM8
 bfFg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=6vCKsAPhYOED6BqCU1Jhhoy17YhHUFeCzej0jzapKus=;
 b=dIph6WmuAt9TqecC2UaRKzPEMrotFBEhCTwEEQ8vBE8C7GZVcAWHu4tSg13qNh59as
 dcp+fXJkUD+139NOzcP0/mm0lNdqf8gNfTpDvuAAi8QSzWpODThwHStNUkigJCmdaAA9
 4EreQ2RInpQc67/3yaLP2eTRsQFGuYZtwbYCeGJFPFxghUBp9WpT80yS5X/HMB1asxJW
 zYt28kJvp3K/zafL4VpuviCpiBFX/uoIZNUnQX5UQYG04DNizMu+zzUVvGx2m2c3mDAZ
 wK6BFWX1z8siHMeoMR7jePtDwCFdkH4HAAwr1GttlOWBg4kXlmXG8aYMqL7Xtg95P8TF
 IGEA==
X-Gm-Message-State: ALyK8tJXF1mGNvotk64jPmNLhZxiTeeHP0j6MyuJJiidj6LPWVFOZqPGnPX7NiuuahboWXDFKx6+faPTtww0Hg==
X-Received: by 10.36.14.76 with SMTP id 73mr9838994ite.70.1466101253645; Thu,
 16 Jun 2016 11:20:53 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.20.197 with HTTP; Thu, 16 Jun 2016 11:20:34 -0700 (PDT)
In-Reply-To: <914ABD39-039F-4142-8ADB-BE20465676C1@intel.com>
References: <CAKKV4w-FYcK8XLv-HKUBJghNr-Fpsm-_EUqPSCb09zu2aZwxvQ@mail.gmail.com>
 <1DB024A9-9185-41C8-9FA5-67C41891189A@intel.com>
 <CAKKV4w-JsxmErDprTH5y-wM8NMrZg37x6_YHHRszRYYKc_dZvQ@mail.gmail.com>
 <2A957DA5-72A6-45B0-8B76-FE0DBDE758FD@intel.com>
 <CAKKV4w8SbsYqO5scCbSG2MNJcMBGDtaCyAkAAhjpfnbvLSXgzw@mail.gmail.com>
 <BC024F94-F403-4399-A824-4E817D87AA64@intel.com>
 <CAKKV4w9WTd0sUEHp2XGg9dxXxGPy4A+zd_TNWknS3mq6zNBVPw@mail.gmail.com>
 <BC3153D5-B7BB-465D-BE56-CD872D497268@intel.com>
 <CAKKV4w9-mNTOwiuNd1enS4ixoQrK4nOZ2Va6-ZvdZKzLtVS_Gg@mail.gmail.com>
 <9FD66398-C7F0-43AF-89F6-79BB95A16A37@intel.com>
 <914ABD39-039F-4142-8ADB-BE20465676C1@intel.com>
From: Take Ceara <dumitru.ceara@gmail.com>
Date: Thu, 16 Jun 2016 20:20:34 +0200
Message-ID: <CAKKV4w9CtNsU5inpFm6qqEJxUa1MwdN_exbVAi9Gsbav9-YNoA@mail.gmail.com>
To: "Wiles, Keith" <keith.wiles@intel.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [dpdk-dev] Performance hit - NICs on different CPU sockets
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Jun 2016 18:20:54 -0000

On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles@intel.com> wrote=
:
>
> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces@dpdk.o=
rg on behalf of keith.wiles@intel.com> wrote:
>
>>
>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara@gmail.com> wrote:
>>
>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles@intel.com> wr=
ote:
>>>
>>>>
>>>> Right now I do not know what the issue is with the system. Could be to=
o many Rx/Tx ring pairs per port and limiting the memory in the NICs, which=
 is why you get better performance when you have 8 core per port. I am not =
really seeing the whole picture and how DPDK is configured to help more. So=
rry.
>>>
>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>cores per port as I've tried with two different machines connected
>>>back to back each with one X710 port and 16 cores on each of them
>>>running on that port. In that case our performance doubled as
>>>expected.
>>>
>>>>
>>>> Maybe seeing the DPDK command line would help.
>>>
>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>
>>>Our own qmap args allow the user to control exactly how cores are
>>>split between ports. In this case we end up with:
>>>
>>>warp17> show port map
>>>Port 0[socket: 0]:
>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>
>>>Port 1[socket: 1]:
>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>
>>On each socket you have 10 physical cores or 20 lcores per socket for 40 =
lcores total.
>>
>>The above is listing the LCORES (or hyper-threads) and not COREs, which I=
 understand some like to think they are interchangeable. The problem is the=
 hyper-threads are logically interchangeable, but not performance wise. If =
you have two run-to-completion threads on a single physical core each on a =
different hyper-thread of that core [0,1], then the second lcore or thread =
(1) on that physical core will only get at most about 30-20% of the CPU cyc=
les. Normally it is much less, unless you tune the code to make sure each t=
hread is not trying to share the internal execution units, but some interna=
l execution units are always shared.
>>
>>To get the best performance when hyper-threading is enable is to not run =
both threads on a single physical core, but only run one hyper-thread-0.
>>
>>In the table below the table lists the physical core id and each of the l=
core ids per socket. Use the first lcore per socket for the best performanc=
e:
>>Core 1 [1, 21]    [11, 31]
>>Use lcore 1 or 11 depending on the socket you are on.
>>
>>The info below is most likely the best performance and utilization of you=
r system. If I got the values right =E2=98=BA
>>
>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>
>>Port 0[socket: 0]:
>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>
>>8 cores on first socket leaving 0-1 lcores for Linux.
>
> 9 cores and leaving the first core or two lcores for Linux
>>
>>Port 1[socket: 1]:
>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>
>>All 10 cores on the second socket.

The values were almost right :) But that's because we reserve the
first two lcores that are passed to dpdk for our own management part.
I was aware that lcores are not physical cores so we don't expect
performance to scale linearly with the number of lcores. However, if
there's a chance that another hyperthread can run while the paired one
is stalling we'd like to take advantage of those cycles if possible.

Leaving that aside I just ran two more tests while using only one of
the two hwthreads in a core.

a. 2 ports on different sockets with 8 cores/port:
./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
-- --qmap 0.0x3FC --qmap 1.0xFF000
warp17> show port map
Port 0[socket: 0]:
   Core 2[socket:0] (Tx: 0, Rx: 0)
   Core 3[socket:0] (Tx: 1, Rx: 1)
   Core 4[socket:0] (Tx: 2, Rx: 2)
   Core 5[socket:0] (Tx: 3, Rx: 3)
   Core 6[socket:0] (Tx: 4, Rx: 4)
   Core 7[socket:0] (Tx: 5, Rx: 5)
   Core 8[socket:0] (Tx: 6, Rx: 6)
   Core 9[socket:0] (Tx: 7, Rx: 7)

Port 1[socket: 1]:
   Core 12[socket:1] (Tx: 0, Rx: 0)
   Core 13[socket:1] (Tx: 1, Rx: 1)
   Core 14[socket:1] (Tx: 2, Rx: 2)
   Core 15[socket:1] (Tx: 3, Rx: 3)
   Core 16[socket:1] (Tx: 4, Rx: 4)
   Core 17[socket:1] (Tx: 5, Rx: 5)
   Core 18[socket:1] (Tx: 6, Rx: 6)
   Core 19[socket:1] (Tx: 7, Rx: 7)

This gives a session setup rate of only 2M sessions/s.

b. 2 ports on socket 0 with 4 cores/port:
./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
--qmap 0.0x3C0 --qmap 1.0x03C
warp17> show port map
Port 0[socket: 0]:
   Core 6[socket:0] (Tx: 0, Rx: 0)
   Core 7[socket:0] (Tx: 1, Rx: 1)
   Core 8[socket:0] (Tx: 2, Rx: 2)
   Core 9[socket:0] (Tx: 3, Rx: 3)

Port 1[socket: 0]:
   Core 2[socket:0] (Tx: 0, Rx: 0)
   Core 3[socket:0] (Tx: 1, Rx: 1)
   Core 4[socket:0] (Tx: 2, Rx: 2)
   Core 5[socket:0] (Tx: 3, Rx: 3)

Surprisingly this gives a session setup rate of 3M sess/s!!

The packet processing cores are totally independent and only access
local socket memory/ports.
There is no locking or atomic variable access in fast path in our code.
The mbuf pools are not shared between cores handling the same port so
there should be no contention when allocating/freeing mbufs.
In this specific test scenario all the cores handling port 0 are
essentially executing the same code (TCP clients) and the cores on
port 1 as well (TCP servers).

Do you have any tips about what other things to check for?

Thanks,
Dumitru


>>
>>++Keith
>>
>>>
>>>Just for reference, the cpu_layout script shows:
>>>$ $RTE_SDK/tools/cpu_layout.py
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>
>>>cores =3D  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>sockets =3D  [0, 1]
>>>
>>>        Socket 0        Socket 1
>>>        --------        --------
>>>Core 0  [0, 20]         [10, 30]
>>>Core 1  [1, 21]         [11, 31]
>>>Core 2  [2, 22]         [12, 32]
>>>Core 3  [3, 23]         [13, 33]
>>>Core 4  [4, 24]         [14, 34]
>>>Core 8  [5, 25]         [15, 35]
>>>Core 9  [6, 26]         [16, 36]
>>>Core 10 [7, 27]         [17, 37]
>>>Core 11 [8, 28]         [18, 38]
>>>Core 12 [9, 29]         [19, 39]
>>>
>>>I know it might be complicated to gigure out exactly what's happening
>>>in our setup with our own code so please let me know if you need
>>>additional information.
>>>
>>>I appreciate the help!
>>>
>>>Thanks,
>>>Dumitru
>>>
>>
>>
>>
>>
>
>
>