From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) by dpdk.org (Postfix) with ESMTP id 4236ECB9E for ; Thu, 16 Jun 2016 22:00:44 +0200 (CEST) Received: by mail-io0-f173.google.com with SMTP id n127so60201721iof.3 for ; Thu, 16 Jun 2016 13:00:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=hZjZXdU3520Ggi7ayUU6Z9KZfcxO4q/MRwEU3CupGZE=; b=MhLezVubgrBWx4Z/sbcfDAKp/NYJ5fZ4DZND2l/5KWwkTpo/s4JsxYOxWZCizTrwCt sT1K6WEp6jA9QPNRk2HanF5lLS/bI+H40CZdA0XauzSwH0MWVfWIz7q3ydBBRWyJtI3z /VNJDCburlFTn+ezlhjX3LhJeVtN+uctPEvOxbe4VgH+eNEtK1rfB8YlkBbwlysRGc+f DWzllz/Rd2vfRqmDQRpzoRTQcaoNSpKTWeH2wE21mD3VA8uaf7y1oxTl2DDQgdFg+e5u jWXI8Dw5Q6JJYFEgT6tlGdvrTkwRvdSLX857u6HVOORcnKdZClysBiPxCczcq3YizSjZ BY0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=hZjZXdU3520Ggi7ayUU6Z9KZfcxO4q/MRwEU3CupGZE=; b=HCl0zSwVS+e+75g8FcuZhOXvqQvcO7Xk4k3/lD7iX83KJ0NZJALa2wdQXPGUkqGurd JcG7OKpUCvHikmdjTm2DLdYxyaOftt788g7ZcKLyxjASwaLu+osNLIfL5s7LP6QmJJ1p 38indg163VpNwf1UttRgDMYh6ZEqYPs6kqJSTMcSmxJQ0SzHYb6wj780pC2D2bL8+Uq4 NvZQrfiYsCHD6Nr/12jPkCIi4cNzWkaaX8qr6VjXFaVlf1oRLHwm1IQWSgha0GcKA1u6 qt+JUD6BkCVsbQw0RsoojYRNbDoCfWNDr6ACLqvnwkcEXyqHxQNtNjeAU1zkzcd+zbDH rb7Q== X-Gm-Message-State: ALyK8tIpo3uNcGKqAeBUQpyz3kktpsQSxcsVrtvSnBRkst9wVVClNAclF/6MrEefaVi4zTqPPzCpft94u/kEUg== X-Received: by 10.107.172.194 with SMTP id v185mr10784522ioe.5.1466107241875; Thu, 16 Jun 2016 13:00:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.20.197 with HTTP; Thu, 16 Jun 2016 13:00:22 -0700 (PDT) In-Reply-To: <5A730B5F-BD52-4DDD-828C-62468439E1CF@intel.com> References: <1DB024A9-9185-41C8-9FA5-67C41891189A@intel.com> <2A957DA5-72A6-45B0-8B76-FE0DBDE758FD@intel.com> <9FD66398-C7F0-43AF-89F6-79BB95A16A37@intel.com> <914ABD39-039F-4142-8ADB-BE20465676C1@intel.com> <5A730B5F-BD52-4DDD-828C-62468439E1CF@intel.com> From: Take Ceara Date: Thu, 16 Jun 2016 22:00:22 +0200 Message-ID: To: "Wiles, Keith" Cc: "dev@dpdk.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] Performance hit - NICs on different CPU sockets X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Jun 2016 20:00:44 -0000 On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith wrote= : > On 6/16/16, 1:20 PM, "Take Ceara" wrote: > >>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith wro= te: >>> >>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" wrote: >>> >>>> >>>>On 6/16/16, 11:20 AM, "Take Ceara" wrote: >>>> >>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith = wrote: >>>>> >>>>>> >>>>>> Right now I do not know what the issue is with the system. Could be = too many Rx/Tx ring pairs per port and limiting the memory in the NICs, whi= ch is why you get better performance when you have 8 core per port. I am no= t really seeing the whole picture and how DPDK is configured to help more. = Sorry. >>>>> >>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8 >>>>>cores per port as I've tried with two different machines connected >>>>>back to back each with one X710 port and 16 cores on each of them >>>>>running on that port. In that case our performance doubled as >>>>>expected. >>>>> >>>>>> >>>>>> Maybe seeing the DPDK command line would help. >>>>> >>>>>The command line I use with ports 01:00.3 and 81:00.3 is: >>>>>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >>>>> >>>>>Our own qmap args allow the user to control exactly how cores are >>>>>split between ports. In this case we end up with: >>>>> >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 4[socket:0] (Tx: 0, Rx: 0) >>>>> Core 5[socket:0] (Tx: 1, Rx: 1) >>>>> Core 6[socket:0] (Tx: 2, Rx: 2) >>>>> Core 7[socket:0] (Tx: 3, Rx: 3) >>>>> Core 8[socket:0] (Tx: 4, Rx: 4) >>>>> Core 9[socket:0] (Tx: 5, Rx: 5) >>>>> Core 20[socket:0] (Tx: 6, Rx: 6) >>>>> Core 21[socket:0] (Tx: 7, Rx: 7) >>>>> Core 22[socket:0] (Tx: 8, Rx: 8) >>>>> Core 23[socket:0] (Tx: 9, Rx: 9) >>>>> Core 24[socket:0] (Tx: 10, Rx: 10) >>>>> Core 25[socket:0] (Tx: 11, Rx: 11) >>>>> Core 26[socket:0] (Tx: 12, Rx: 12) >>>>> Core 27[socket:0] (Tx: 13, Rx: 13) >>>>> Core 28[socket:0] (Tx: 14, Rx: 14) >>>>> Core 29[socket:0] (Tx: 15, Rx: 15) >>>>> >>>>>Port 1[socket: 1]: >>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>> Core 30[socket:1] (Tx: 10, Rx: 10) >>>>> Core 31[socket:1] (Tx: 11, Rx: 11) >>>>> Core 32[socket:1] (Tx: 12, Rx: 12) >>>>> Core 33[socket:1] (Tx: 13, Rx: 13) >>>>> Core 34[socket:1] (Tx: 14, Rx: 14) >>>>> Core 35[socket:1] (Tx: 15, Rx: 15) >>>> >>>>On each socket you have 10 physical cores or 20 lcores per socket for 4= 0 lcores total. >>>> >>>>The above is listing the LCORES (or hyper-threads) and not COREs, which= I understand some like to think they are interchangeable. The problem is t= he hyper-threads are logically interchangeable, but not performance wise. I= f you have two run-to-completion threads on a single physical core each on = a different hyper-thread of that core [0,1], then the second lcore or threa= d (1) on that physical core will only get at most about 30-20% of the CPU c= ycles. Normally it is much less, unless you tune the code to make sure each= thread is not trying to share the internal execution units, but some inter= nal execution units are always shared. >>>> >>>>To get the best performance when hyper-threading is enable is to not ru= n both threads on a single physical core, but only run one hyper-thread-0. >>>> >>>>In the table below the table lists the physical core id and each of the= lcore ids per socket. Use the first lcore per socket for the best performa= nce: >>>>Core 1 [1, 21] [11, 31] >>>>Use lcore 1 or 11 depending on the socket you are on. >>>> >>>>The info below is most likely the best performance and utilization of y= our system. If I got the values right =E2=98=BA >>>> >>>>./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 >>>> >>>>Port 0[socket: 0]: >>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>> >>>>8 cores on first socket leaving 0-1 lcores for Linux. >>> >>> 9 cores and leaving the first core or two lcores for Linux >>>> >>>>Port 1[socket: 1]: >>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>> >>>>All 10 cores on the second socket. >> >>The values were almost right :) But that's because we reserve the >>first two lcores that are passed to dpdk for our own management part. >>I was aware that lcores are not physical cores so we don't expect >>performance to scale linearly with the number of lcores. However, if >>there's a chance that another hyperthread can run while the paired one >>is stalling we'd like to take advantage of those cycles if possible. >> >>Leaving that aside I just ran two more tests while using only one of >>the two hwthreads in a core. >> >>a. 2 ports on different sockets with 8 cores/port: >>./build/warp17 -c 0xFF3FF -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 >>-- --qmap 0.0x3FC --qmap 1.0xFF000 >>warp17> show port map >>Port 0[socket: 0]: >> Core 2[socket:0] (Tx: 0, Rx: 0) >> Core 3[socket:0] (Tx: 1, Rx: 1) >> Core 4[socket:0] (Tx: 2, Rx: 2) >> Core 5[socket:0] (Tx: 3, Rx: 3) >> Core 6[socket:0] (Tx: 4, Rx: 4) >> Core 7[socket:0] (Tx: 5, Rx: 5) >> Core 8[socket:0] (Tx: 6, Rx: 6) >> Core 9[socket:0] (Tx: 7, Rx: 7) >> >>Port 1[socket: 1]: >> Core 12[socket:1] (Tx: 0, Rx: 0) >> Core 13[socket:1] (Tx: 1, Rx: 1) >> Core 14[socket:1] (Tx: 2, Rx: 2) >> Core 15[socket:1] (Tx: 3, Rx: 3) >> Core 16[socket:1] (Tx: 4, Rx: 4) >> Core 17[socket:1] (Tx: 5, Rx: 5) >> Core 18[socket:1] (Tx: 6, Rx: 6) >> Core 19[socket:1] (Tx: 7, Rx: 7) >> >>This gives a session setup rate of only 2M sessions/s. >> >>b. 2 ports on socket 0 with 4 cores/port: >>./build/warp17 -c 0x3FF -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 -- >>--qmap 0.0x3C0 --qmap 1.0x03C > > One more thing to try change the =E2=80=93m 32768 to =E2=80=93socket-mem = 16384,16384 to make sure the memory is split between the sockets. You may n= eed to remove the /dev/huepages/* files or wherever you put them. > > What is the dpdk =E2=80=93n option set to on your system? Mine is set to = =E2=80=98=E2=80=93n 4=E2=80=99 > I tried with =E2=80=93socket-mem 16384,16384 but it doesn't make any difference. We call anyway rte_malloc_socket for everything that might be accessed in fast path and the mempools are per-core and created with the correct socket-id. Even when starting with '-m 32768' I see that 16 hugepages get allocated on each of the sockets. On the test server I have 4 memory channels so '-n 4'. >>warp17> show port map >>Port 0[socket: 0]: >> Core 6[socket:0] (Tx: 0, Rx: 0) >> Core 7[socket:0] (Tx: 1, Rx: 1) >> Core 8[socket:0] (Tx: 2, Rx: 2) >> Core 9[socket:0] (Tx: 3, Rx: 3) >> >>Port 1[socket: 0]: >> Core 2[socket:0] (Tx: 0, Rx: 0) >> Core 3[socket:0] (Tx: 1, Rx: 1) >> Core 4[socket:0] (Tx: 2, Rx: 2) >> Core 5[socket:0] (Tx: 3, Rx: 3) >> >>Surprisingly this gives a session setup rate of 3M sess/s!! >> >>The packet processing cores are totally independent and only access >>local socket memory/ports. >>There is no locking or atomic variable access in fast path in our code. >>The mbuf pools are not shared between cores handling the same port so >>there should be no contention when allocating/freeing mbufs. >>In this specific test scenario all the cores handling port 0 are >>essentially executing the same code (TCP clients) and the cores on >>port 1 as well (TCP servers). >> >>Do you have any tips about what other things to check for? >> >>Thanks, >>Dumitru >> >> >> >>>> >>>>++Keith >>>> >>>>> >>>>>Just for reference, the cpu_layout script shows: >>>>>$ $RTE_SDK/tools/cpu_layout.py >>>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>>Core and Socket Information (as reported by '/proc/cpuinfo') >>>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>>> >>>>>cores =3D [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>>>>sockets =3D [0, 1] >>>>> >>>>> Socket 0 Socket 1 >>>>> -------- -------- >>>>>Core 0 [0, 20] [10, 30] >>>>>Core 1 [1, 21] [11, 31] >>>>>Core 2 [2, 22] [12, 32] >>>>>Core 3 [3, 23] [13, 33] >>>>>Core 4 [4, 24] [14, 34] >>>>>Core 8 [5, 25] [15, 35] >>>>>Core 9 [6, 26] [16, 36] >>>>>Core 10 [7, 27] [17, 37] >>>>>Core 11 [8, 28] [18, 38] >>>>>Core 12 [9, 29] [19, 39] >>>>> >>>>>I know it might be complicated to gigure out exactly what's happening >>>>>in our setup with our own code so please let me know if you need >>>>>additional information. >>>>> >>>>>I appreciate the help! >>>>> >>>>>Thanks, >>>>>Dumitru >>>>> >>>> >>>> >>>> >>>> >>> >>> >>> >> > > >