From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f178.google.com (mail-lb0-f178.google.com [209.85.217.178]) by dpdk.org (Postfix) with ESMTP id 71A62902 for ; Mon, 2 Nov 2015 14:57:13 +0100 (CET) Received: by lbbwb3 with SMTP id wb3so88379219lbb.1 for ; Mon, 02 Nov 2015 05:57:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=CIB/zx4xZ3IBaVF6GX4YD66WodJzw58svRHCjlbVdGA=; b=ehuVjPXYMPV2eblX6dasUMAaeyxPWq7CdtaMp7l4RXmIv5HXpX7R5pkJRVnKpEzyOc 9LjC8rBj1QUNRRXsoHh3WQ325+q5SKtD5P7ov3sTUzCpfKMQXs6Ea9WJxSR3hCKQVAaz kXAFTHWMSLo99OFzs6Rbh2cur0Ep4vx07hZMU0M7TBqPyuKnGnEj24gFIO3f3Lvtjhi7 uSq6qdpgnQZP7EejtZhBryWovt1/9aFF4TxkB3KM893MjdtP9iOL93KOBCquJ0vLN1RN d3jRArNNgKnnLsmKZp0LKmMUGg1oC9VCtDDV9V1S8T6T0l37Lf+dcFmIiQbJT/lSqyCA 7Vsg== MIME-Version: 1.0 X-Received: by 10.112.92.231 with SMTP id cp7mr10175740lbb.7.1446472633027; Mon, 02 Nov 2015 05:57:13 -0800 (PST) Received: by 10.25.167.75 with HTTP; Mon, 2 Nov 2015 05:57:12 -0800 (PST) In-Reply-To: References: <36170571.QMO0L8HZgB@xps13> Date: Mon, 2 Nov 2015 14:57:12 +0100 Message-ID: From: Jesper Wramberg To: Olga Shern Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "users@dpdk.org" Subject: Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Nov 2015 13:57:13 -0000 Hi again, Sorry I missed your first email. Wow, I can't believe I missed that. I read the output from ethernet_bw as Mbit/s :-( That's kind of embarrassing. You are right. My calculations are wrong. Sorry for bothering you with my bad math. For whats it's worth I have spent quite some time wondering what was wrong. I still have some way to go though, since my original problems started in a much larger, more complicated setup. But I'm glad this basic Tx/Rx setup works as expected. Thank you, best regards Jesper 2015-11-02 13:57 GMT+01:00 Jesper Wramberg : > Hey, > > As a follow-up I tried changing interrupts around without any changes to > achieved speed. > Lastly, after some iperf testing using 10 threads it would seem that it i= s > impossible to achieve over 10G BW. > > Got an interesting output from "perf top -p " however while running > the raw_ethernet_bw script. > > 37.22% libpthread-2.17.so [.] pthread_spin_lock > 10.00% libmlx4-rdmav2.so [.] 0x000000000000b05a > 1.20% libmlx4-rdmav2.so [.] 0x000000000000b3ec > 1.07% libmlx4-rdmav2.so [.] 0x000000000000b06c > 1.07% libmlx4-rdmav2.so [.] 0x000000000000afc0 > 1.06% raw_ethernet_bw [.] 0x000000000001484f > 1.06% raw_ethernet_bw [.] 0x0000000000014869 > 1.06% raw_ethernet_bw [.] 0x00000000000142ec > 1.05% libmlx4-rdmav2.so [.] 0x000000000000b41c > 1.05% libmlx4-rdmav2.so [.] 0x000000000000aff6 > 1.05% raw_ethernet_bw [.] 0x0000000000014f09 > 1.03% libmlx4-rdmav2.so [.] 0x0000000000005a60 > 1.03% libmlx4-rdmav2.so [.] 0x000000000000be51 > 1.03% libpthread-2.17.so [.] pthread_spin_unlock > 1.01% libmlx4-rdmav2.so [.] 0x000000000000afdc > 1.00% libmlx4-rdmav2.so [.] 0x000000000000b042 > 1.00% raw_ethernet_bw [.] 0x0000000000014314 > 0.98% libmlx4-rdmav2.so [.] 0x000000000000bf38 > 0.97% libmlx4-rdmav2.so [.] 0x000000000000b3d2 > 0.97% raw_ethernet_bw [.] 0x00000000000142a4 > 0.96% raw_ethernet_bw [.] 0x0000000000014282 > 0.96% libmlx4-rdmav2.so [.] 0x000000000000b415 > 0.96% raw_ethernet_bw [.] 0x000000000001425e > > I wonder if the tool is supposed to spend so much time in > pthread_spin_lock.. > > Best regards, > Jesper > > 2015-11-02 11:59 GMT+01:00 Jesper Wramberg : > >> Hi again, >> >> Thank you for your input. I have now switched to using the >> raw_ethernet_bw script as transmitter and the test-pmd as receiver. An >> immediate result I discovered was that the raw_ethernet_bw tool achieves >> very similar TX performance as my DPDK transmitter. >> >> >> (note both cpu10 and mlx4_0 is on same numa node as wanted) >> taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20 >> -s 1480 --dest_mac F4:52:14:7A:59:80 >> >> ------------------------------------------------------------------------= --------------- >> Post List requested - CQ moderation will be the size of the post list >> >> ------------------------------------------------------------------------= --------------- >> Send Post List BW Test >> Dual-port : OFF Device : mlx4_0 >> Number of qps : 1 Transport type : IB >> Connection type : RawEth Using SRQ : OFF >> TX depth : 128 >> Post List : 3 >> CQ Moderation : 3 >> Mtu : 1518[B] >> Link type : Ethernet >> Gid index : 0 >> Max inline data : 0[B] >> rdma_cm QPs : OFF >> Data ex. method : Ethernet >> >> ------------------------------------------------------------------------= --------------- >> **raw ethernet header**************************************** >> >> -------------------------------------------------------------- >> | Dest MAC | Src MAC | Packet Type | >> |------------------------------------------------------------| >> | F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT | >> |------------------------------------------------------------| >> >> >> ------------------------------------------------------------------------= --------------- >> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] >> MsgRate[Mpps] >> 1480 33242748 0.00 4691.58 >> 3.323974 >> >> ------------------------------------------------------------------------= --------------- >> >> >> Running it with the 64 byte packets Olga specified gives me the followin= g >> result: >> >> >> ------------------------------------------------------------------------= --------------- >> #bytes #iterations BW peak[MB/sec] BW average[MB/sec] >> MsgRate[Mpps] >> 64 166585650 0.00 1016.67 >> 16.657163 >> >> ------------------------------------------------------------------------= --------------- >> >> >> The results are the same with and without flow control. I have followed >> the Mellanox DPDK QSG and done everything in the performance section >> (except the things regarding interrupts). >> >> So to answer Olga's questions :-) >> >> 1: Unfortunately I can't. If I try the FW update complains since the >> cards came with Dell configuration (PSID: DEL0A70000023). >> >> 2: In my final setup I need jumboframes but just for the sake of testing >> I tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config= . >> This did not really change anything, neither in my initial setup nor the >> one described above. >> >> 3: In the final setup, I plan to share the NICs between multiple >> independent processes. For this reason, I wanted to use SR-IOV and >> whitelist a single VF to each process. Anyway, for the tests above I hav= e >> used the PFs for simplicity. >> (Side note: I discovered that multiple DPDK instances can use the same >> PCI address which might eliminate the need for SR-IOV. I wonder how that >> works :-)) >> >> So conclusively, isn't the raw_ethernet_bw tool supposed to have larger >> output BW with 1480 byte packets ? >> >> I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I >> can mail this to anyone who have the time to look further into it. >> >> Thank you for your help, best regards >> Jesper >> >> 2015-11-01 11:05 GMT+01:00 Olga Shern : >> >>> Hi Jesper, >>> >>> Several suggestions, >>> 1. Any chance you can install latest FW from Mellanox web site or >>> the one that is included in OFED 3.1 version that you have downloaded? = The >>> latest version is 2.35.5100. >>> 2. Please configure SGE_NUM=3D1 in DPDK config file in case you >>> don't need jumbo frames. This will improve performance. >>> 3. Not clear from your description, if you are running DPDK on VM = ? >>> Are you suing SRIOV ? >>> 4. I suggest you to run first, testpmd application. The traffic >>> generator can be raw_ethernet_bw application that coming with MLNX_OFED= , it >>> can generate L2, IPV4 and TCP/UDP packets >>> For example: taskset -c 10 raw_ethernet_bw --client -d mlx4_0 >>> -i 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 & >>> This will send L2 packets via mlx4_0 NIC port 1 , packet size = =3D >>> 64, for 10 sec, batch =3D 3 (-l) >>> You can see according to testpmd counters the performance. >>> >>> Please check Mellanox community posts, I think they can help you. >>> https://community.mellanox.com/docs/DOC-1502 >>> >>> We also have performance suggestions in our QSG: >>> >>> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Star= t_Guide_v2%201_1%201.pdf >>> >>> Best Regards, >>> Olga >>> >>> >>> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date >>> : samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg < >>> jesper.wramberg@gmail.com> =C3=80 : users@dpdk.org >>> >>> Hi all, >>> >>> >>> >>> I am experiencing some performance issues in a somewhat custom setup >>> with two Mellanox ConnectX-3 NICs. I realize these issues might be due = to >>> the setup, but I was hoping someone might be able to pinpoint some poss= ible >>> problems/bottlenecks. >>> >>> >>> >>> >>> The server: >>> >>> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on >>> each socket). I have a minimal Centos 7.1.1503 installed with kernel- >>> 3.10.0-229. >>> Note that this kernel is re-build with most things disabled to minimize >>> size, etc. It has infiniband enabled, however, and mlx4_core as a modul= e >>> (since nothing works otherwise). Finally, I have connected the two NICs >>> from port 2 to port 2. >>> >>> >>> >>> The firmware: >>> >>> I have installed the latest firmware for the NICs from dell which is >>> 2.34.5060. >>> >>> >>> >>> The drivers, modules, etc.: >>> >>> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used >>> its rebuild feature to build it against the custom kernel. I have insta= lled >>> it using the --basic option since I just want libibverbs, libmlx4, kern= el >>> modules and openibd service stuff. The mlx4_core.conf is set for ethern= et >>> on all ports. Moreover, it is configured for flow steering mode -7 and = a >>> few VFs. I can restart the openibd service successfully and everything >>> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. Th= e >>> only problems I have encountered at this stage is that the links doesn'= t >>> always seem to come up unless I unplug and re-plug the cables. >>> >>> >>> >>> DPDK setup: >>> >>> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED >>> package. I build it using the default values for everything. Running th= e >>> simple hello world example I can see that everything is initialized >>> correctly, etc. >>> >>> >>> >>> Test setup: >>> >>> To test the performance of the NICs I have the following setup. Two >>> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4, >>> running on NIC B. All processes use virtual functions on their respecti= ve >>> NICs. Depending on the test, the processes can either transmit or recei= ve >>> data. To transmit, I use a simple DPDK program which generates 32000 >>> packets and transmits them over and over until it has sent 640 million >>> packets. Similarly, I use a simple DPDK program to receive which is >>> basically the layer 2 forwarding example without re-transmission. >>> >>> >>> >>> First test: >>> >>> In my first test, P1 transmits data to P3 while the other processes are >>> idle. >>> >>> Packet size: 1480 byte packets >>> >>> Flow control: On/Off, doesn=E2=80=99t matter I get same result. >>> >>> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps = ~ >>> 4.9Gbit/s >>> >>> >>> >>> Second test: >>> >>> I my second test, I attempt to increase the amount of data transmitted >>> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to= P4. >>> >>> Packet size: 1480 byte packets >>> >>> Flow control: On/Off, doesn=E2=80=99t matter I get same result. >>> >>> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ >>> 1.75 Mpps ~ 2.6Gbit/s for a single process to get its data transmitted. >>> >>> >>> >>> >>> >>> Does anyone has any idea what I am doing wrong here ? In the second tes= t >>> I would expect P1 to transmit with the same speed as in the first test.= It >>> seems that there is a bottleneck somewhere, however. I have left most >>> things to their default values but have also tried tweaking queue sizes= , >>> number of queues, interrupts, etc. with no luck >>> >>> >>> >>> >>> >>> Best Regards, >>> >>> Jesper >>> >> >> >