From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f51.google.com (mail-it0-f51.google.com [209.85.214.51]) by dpdk.org (Postfix) with ESMTP id 8380ECB44 for ; Thu, 16 Jun 2016 22:27:35 +0200 (CEST) Received: by mail-it0-f51.google.com with SMTP id a5so145090309ita.1 for ; Thu, 16 Jun 2016 13:27:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=49feFfVFXekNkFA9LlEyAW61wSxjzvmc6AQapHRa7i0=; b=cc8d6IXf4D+VhYO5TK0NQ96w+TMaQBGYAP7JVAXU7uvhJCYg+/oQLKPee5ewnRoCJO uJB8kICNkoNlpALJVTCSY/pmhvYmP6dCM1CoR1UppjppjUM0s3ZcJtH1wrEunFIJ6v/y fyv5lObHhDAPyKkkv2Z7ewPll94Apb0TAK2sjlzVJQQoRb+0EqR3ZY3zsk7jTcHg2noc wLHeN8XNBzQbBknflbEfh6JEGVP+6bknYEkpwTfFUZTMR9NoVikS9o4sZtGBB14upPuk 3df7+Zktac0aHVuKIeErcoKMZwFK4TcRedxLne/LDdOiuhdkWih3f/RqezTQ/lcjNIvu xkUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=49feFfVFXekNkFA9LlEyAW61wSxjzvmc6AQapHRa7i0=; b=dUcvOm4v2JrrjiPQsRaDpqV20AgJkECfZTYr/GGaOUOuVRHsTSGY2tT1sbKhKV6zcO WU3DmlabbTcE7YoKbw6MGlKA1jjDxN35Me6Ql1PT9mv5PL+HTe1Z9Rsq0ilMMvFuDZLr dd5SFv0eWDmqz4ClpV9qF3j5hX4n4ez1VFYjo36gayhJiAJZbTywP8foivNfXpJN3qkz Gni+jISdINnilQrhRmjkGiHDm9nN4lH0t7Q3WpjyqILcH5mNzCIjZOZVxJJRxi9NfIMk 9pORfn8pqPPkZ4SY1+h8cvHFa0EcykcDh/eM1NiXIAkPUuf/hhdYHhrWuedmv4urh8lB hmcA== X-Gm-Message-State: ALyK8tLA2kEUuhE9PyrmRu3ooi/HWwVq+5kuzR7zVcBuM+Iz8M4sCmBjTbnnoZKv0TeoLwLZYdAQfopCWHD07A== X-Received: by 10.36.14.76 with SMTP id 73mr10715130ite.70.1466108854785; Thu, 16 Jun 2016 13:27:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.20.197 with HTTP; Thu, 16 Jun 2016 13:27:15 -0700 (PDT) In-Reply-To: <20C18CE3-21D5-4DAA-9E01-1AF3CD6B930C@intel.com> References: <1DB024A9-9185-41C8-9FA5-67C41891189A@intel.com> <2A957DA5-72A6-45B0-8B76-FE0DBDE758FD@intel.com> <9FD66398-C7F0-43AF-89F6-79BB95A16A37@intel.com> <914ABD39-039F-4142-8ADB-BE20465676C1@intel.com> <5A730B5F-BD52-4DDD-828C-62468439E1CF@intel.com> <04F75F9D-425C-4D7C-955C-4A1456521341@intel.com> <20C18CE3-21D5-4DAA-9E01-1AF3CD6B930C@intel.com> From: Take Ceara Date: Thu, 16 Jun 2016 22:27:15 +0200 Message-ID: To: "Wiles, Keith" Cc: "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] Performance hit - NICs on different CPU sockets X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Jun 2016 20:27:36 -0000 On Thu, Jun 16, 2016 at 10:19 PM, Wiles, Keith wrot= e: > > On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith" wrote: > >> >>On 6/16/16, 3:00 PM, "Take Ceara" wrote: >> >>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith wr= ote: >>>> On 6/16/16, 1:20 PM, "Take Ceara" wrote: >>>> >>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith = wrote: >>>>>> >>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" wrote: >>>>>> >>>>>>> >>>>>>>On 6/16/16, 11:20 AM, "Take Ceara" wrote: >>>>>>> >>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Right now I do not know what the issue is with the system. Could = be too many Rx/Tx ring pairs per port and limiting the memory in the NICs, = which is why you get better performance when you have 8 core per port. I am= not really seeing the whole picture and how DPDK is configured to help mor= e. Sorry. >>>>>>>> >>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs= 8 >>>>>>>>cores per port as I've tried with two different machines connected >>>>>>>>back to back each with one X710 port and 16 cores on each of them >>>>>>>>running on that port. In that case our performance doubled as >>>>>>>>expected. >>>>>>>> >>>>>>>>> >>>>>>>>> Maybe seeing the DPDK command line would help. >>>>>>>> >>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is: >>>>>>>>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3= -- >>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >>>>>>>> >>>>>>>>Our own qmap args allow the user to control exactly how cores are >>>>>>>>split between ports. In this case we end up with: >>>>>>>> >>>>>>>>warp17> show port map >>>>>>>>Port 0[socket: 0]: >>>>>>>> Core 4[socket:0] (Tx: 0, Rx: 0) >>>>>>>> Core 5[socket:0] (Tx: 1, Rx: 1) >>>>>>>> Core 6[socket:0] (Tx: 2, Rx: 2) >>>>>>>> Core 7[socket:0] (Tx: 3, Rx: 3) >>>>>>>> Core 8[socket:0] (Tx: 4, Rx: 4) >>>>>>>> Core 9[socket:0] (Tx: 5, Rx: 5) >>>>>>>> Core 20[socket:0] (Tx: 6, Rx: 6) >>>>>>>> Core 21[socket:0] (Tx: 7, Rx: 7) >>>>>>>> Core 22[socket:0] (Tx: 8, Rx: 8) >>>>>>>> Core 23[socket:0] (Tx: 9, Rx: 9) >>>>>>>> Core 24[socket:0] (Tx: 10, Rx: 10) >>>>>>>> Core 25[socket:0] (Tx: 11, Rx: 11) >>>>>>>> Core 26[socket:0] (Tx: 12, Rx: 12) >>>>>>>> Core 27[socket:0] (Tx: 13, Rx: 13) >>>>>>>> Core 28[socket:0] (Tx: 14, Rx: 14) >>>>>>>> Core 29[socket:0] (Tx: 15, Rx: 15) >>>>>>>> >>>>>>>>Port 1[socket: 1]: >>>>>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>>>>> Core 30[socket:1] (Tx: 10, Rx: 10) >>>>>>>> Core 31[socket:1] (Tx: 11, Rx: 11) >>>>>>>> Core 32[socket:1] (Tx: 12, Rx: 12) >>>>>>>> Core 33[socket:1] (Tx: 13, Rx: 13) >>>>>>>> Core 34[socket:1] (Tx: 14, Rx: 14) >>>>>>>> Core 35[socket:1] (Tx: 15, Rx: 15) >>>>>>> >>>>>>>On each socket you have 10 physical cores or 20 lcores per socket fo= r 40 lcores total. >>>>>>> >>>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, wh= ich I understand some like to think they are interchangeable. The problem i= s the hyper-threads are logically interchangeable, but not performance wise= . If you have two run-to-completion threads on a single physical core each = on a different hyper-thread of that core [0,1], then the second lcore or th= read (1) on that physical core will only get at most about 30-20% of the CP= U cycles. Normally it is much less, unless you tune the code to make sure e= ach thread is not trying to share the internal execution units, but some in= ternal execution units are always shared. >>>>>>> >>>>>>>To get the best performance when hyper-threading is enable is to not= run both threads on a single physical core, but only run one hyper-thread-= 0. >>>>>>> >>>>>>>In the table below the table lists the physical core id and each of = the lcore ids per socket. Use the first lcore per socket for the best perfo= rmance: >>>>>>>Core 1 [1, 21] [11, 31] >>>>>>>Use lcore 1 or 11 depending on the socket you are on. >>>>>>> >>>>>>>The info below is most likely the best performance and utilization o= f your system. If I got the values right =E2=98=BA >>>>>>> >>>>>>>./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 = -- >>>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 >>>>>>> >>>>>>>Port 0[socket: 0]: >>>>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>>>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>>>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>>>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>>>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>>>>> >>>>>>>8 cores on first socket leaving 0-1 lcores for Linux. >>>>>> >>>>>> 9 cores and leaving the first core or two lcores for Linux >>>>>>> >>>>>>>Port 1[socket: 1]: >>>>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>>>> >>>>>>>All 10 cores on the second socket. >>>>> >>>>>The values were almost right :) But that's because we reserve the >>>>>first two lcores that are passed to dpdk for our own management part. >>>>>I was aware that lcores are not physical cores so we don't expect >>>>>performance to scale linearly with the number of lcores. However, if >>>>>there's a chance that another hyperthread can run while the paired one >>>>>is stalling we'd like to take advantage of those cycles if possible. >>>>> >>>>>Leaving that aside I just ran two more tests while using only one of >>>>>the two hwthreads in a core. >>>>> >>>>>a. 2 ports on different sockets with 8 cores/port: >>>>>./build/warp17 -c 0xFF3FF -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 >>>>>-- --qmap 0.0x3FC --qmap 1.0xFF000 >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>>> >>>>>Port 1[socket: 1]: >>>>> Core 12[socket:1] (Tx: 0, Rx: 0) >>>>> Core 13[socket:1] (Tx: 1, Rx: 1) >>>>> Core 14[socket:1] (Tx: 2, Rx: 2) >>>>> Core 15[socket:1] (Tx: 3, Rx: 3) >>>>> Core 16[socket:1] (Tx: 4, Rx: 4) >>>>> Core 17[socket:1] (Tx: 5, Rx: 5) >>>>> Core 18[socket:1] (Tx: 6, Rx: 6) >>>>> Core 19[socket:1] (Tx: 7, Rx: 7) >>>>> >>>>>This gives a session setup rate of only 2M sessions/s. >>>>> >>>>>b. 2 ports on socket 0 with 4 cores/port: >>>>>./build/warp17 -c 0x3FF -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 -- >>>>>--qmap 0.0x3C0 --qmap 1.0x03C >>>> >>>> One more thing to try change the =E2=80=93m 32768 to =E2=80=93socket-m= em 16384,16384 to make sure the memory is split between the sockets. You ma= y need to remove the /dev/huepages/* files or wherever you put them. >>>> >>>> What is the dpdk =E2=80=93n option set to on your system? Mine is set = to =E2=80=98=E2=80=93n 4=E2=80=99 >>>> >>> >>>I tried with =E2=80=93socket-mem 16384,16384 but it doesn't make any >>>difference. We call anyway rte_malloc_socket for everything that might >>>be accessed in fast path and the mempools are per-core and created >>>with the correct socket-id. Even when starting with '-m 32768' I see >>>that 16 hugepages get allocated on each of the sockets. >>> >>>On the test server I have 4 memory channels so '-n 4'. >>> >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 6[socket:0] (Tx: 0, Rx: 0) >>>>> Core 7[socket:0] (Tx: 1, Rx: 1) >>>>> Core 8[socket:0] (Tx: 2, Rx: 2) >>>>> Core 9[socket:0] (Tx: 3, Rx: 3) >>>>> >>>>>Port 1[socket: 0]: >>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >> >>I do not know now. It seems like something else is going on here that we = have not identified. > > Maybe vTune or some other type of debug performance tool would be the nex= t step here. > Thanks for the patience Keith. I'll try some profiling and see where it takes us from there. I'll update this thread when I have some new info. Regards, Dumitru >> >>>>> >>>>>Surprisingly this gives a session setup rate of 3M sess/s!! >>>>> >>>>>The packet processing cores are totally independent and only access >>>>>local socket memory/ports. >>>>>There is no locking or atomic variable access in fast path in our code= . >>>>>The mbuf pools are not shared between cores handling the same port so >>>>>there should be no contention when allocating/freeing mbufs. >>>>>In this specific test scenario all the cores handling port 0 are >>>>>essentially executing the same code (TCP clients) and the cores on >>>>>port 1 as well (TCP servers). >>>>> >>>>>Do you have any tips about what other things to check for? >>>>> >>>>>Thanks, >>>>>Dumitru >>>>> >>>>> >>>>> >>>>>>> >>>>>>>++Keith >>>>>>> >>>>>>>> >>>>>>>>Just for reference, the cpu_layout script shows: >>>>>>>>$ $RTE_SDK/tools/cpu_layout.py >>>>>>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo') >>>>>>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>>>>> >>>>>>>>cores =3D [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>>>>>>>sockets =3D [0, 1] >>>>>>>> >>>>>>>> Socket 0 Socket 1 >>>>>>>> -------- -------- >>>>>>>>Core 0 [0, 20] [10, 30] >>>>>>>>Core 1 [1, 21] [11, 31] >>>>>>>>Core 2 [2, 22] [12, 32] >>>>>>>>Core 3 [3, 23] [13, 33] >>>>>>>>Core 4 [4, 24] [14, 34] >>>>>>>>Core 8 [5, 25] [15, 35] >>>>>>>>Core 9 [6, 26] [16, 36] >>>>>>>>Core 10 [7, 27] [17, 37] >>>>>>>>Core 11 [8, 28] [18, 38] >>>>>>>>Core 12 [9, 29] [19, 39] >>>>>>>> >>>>>>>>I know it might be complicated to gigure out exactly what's happeni= ng >>>>>>>>in our setup with our own code so please let me know if you need >>>>>>>>additional information. >>>>>>>> >>>>>>>>I appreciate the help! >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Dumitru >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> >> > > >