DPDK usage discussions
 help / color / mirror / Atom feed
* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
       [not found] <36170571.QMO0L8HZgB@xps13>
@ 2015-11-01 10:05 ` Olga Shern
  2015-11-02 10:59   ` Jesper Wramberg
  0 siblings, 1 reply; 7+ messages in thread
From: Olga Shern @ 2015-11-01 10:05 UTC (permalink / raw)
  To: jesper.wramberg, users

Hi Jesper, 

Several suggestions, 
1.	Any chance you can install latest FW from Mellanox web site or the one that is included in OFED 3.1 version that you have downloaded?  The latest version is  2.35.5100.
2.	Please configure SGE_NUM=1 in DPDK config file  in case you don't need jumbo frames. This will improve performance.
3. 	Not clear from your description, if you are running DPDK on VM ? Are you suing SRIOV ?
4. 	I suggest you  to run  first,  testpmd application. The traffic generator can be raw_ethernet_bw application that coming with MLNX_OFED, it can generate L2, IPV4 and TCP/UDP packets
	For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
	This will send L2 packets via mlx4_0 NIC port 1 , packet size = 64, for 10 sec, batch = 3 (-l)  
	You can see according to testpmd counters the performance.

Please check Mellanox community  posts, I think they can help you. 
https://community.mellanox.com/docs/DOC-1502

We also have performance suggestions in our QSG: 
http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf

Best Regards,
Olga 


Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date : samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <jesper.wramberg@gmail.com>  À : users@dpdk.org

Hi all,



I am experiencing some performance issues in a somewhat custom setup with two Mellanox ConnectX-3 NICs. I realize these issues might be due to the setup, but I was hoping someone might be able to pinpoint some possible problems/bottlenecks.




The server:

I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on each socket). I have a minimal Centos 7.1.1503 installed with kernel-3.10.0-229.
Note that this kernel is re-build with most things disabled to minimize size, etc. It has infiniband enabled, however, and mlx4_core as a module (since nothing works otherwise). Finally, I have connected the two NICs from port 2 to port 2.



The firmware:

I have installed the latest firmware for the NICs from dell which is 2.34.5060.



The drivers, modules, etc.:

I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used its rebuild feature to build it against the custom kernel. I have installed it using the --basic option since I just want libibverbs, libmlx4, kernel modules and openibd service stuff. The mlx4_core.conf is set for ethernet on all ports. Moreover, it is configured for flow steering mode -7 and a few VFs. I can restart the openibd service successfully and everything seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The only problems I have encountered at this stage is that the links doesn't always seem to come up unless I unplug and re-plug the cables.



DPDK setup:

I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED package. I build it using the default values for everything. Running the simple hello world example I can see that everything is initialized correctly, etc.



Test setup:

To test the performance of the NICs I have the following setup. Two processes, P1 and P2, running on NIC A. Two other processes, P3 and P4, running on NIC B. All processes use virtual functions on their respective NICs. Depending on the test, the processes can either transmit or receive data. To transmit, I use a simple DPDK program which generates 32000 packets and transmits them over and over until it has sent 640 million packets. Similarly, I use a simple DPDK program to receive which is basically the layer 2 forwarding example without re-transmission.



First test:

In my first test, P1 transmits data to P3 while the other processes are idle.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~ 4.9Gbit/s



Second test:

I my second test, I attempt to increase the amount of data transmitted over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75 Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.





Does anyone has any idea what I am doing wrong here ? In the second test I would expect P1 to transmit with the same speed as in the first test. It seems that there is a bottleneck somewhere, however. I have left most things to their default values but have also tried tweaking queue sizes, number of queues, interrupts, etc. with no luck





Best Regards,

Jesper

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
  2015-11-01 10:05 ` [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Olga Shern
@ 2015-11-02 10:59   ` Jesper Wramberg
  2015-11-02 12:31     ` Olga Shern
  2015-11-02 12:57     ` Jesper Wramberg
  0 siblings, 2 replies; 7+ messages in thread
From: Jesper Wramberg @ 2015-11-02 10:59 UTC (permalink / raw)
  To: Olga Shern; +Cc: users

Hi again,

Thank you for your input. I have now switched to using the raw_ethernet_bw
script as transmitter and the test-pmd as receiver. An immediate result I
discovered was that the raw_ethernet_bw tool achieves very similar TX
performance as my DPDK transmitter.


(note both cpu10 and mlx4_0 is on same numa node as wanted)
taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20 -s
1480 --dest_mac F4:52:14:7A:59:80
---------------------------------------------------------------------------------------
Post List requested - CQ moderation will be the size of the post list
---------------------------------------------------------------------------------------
                    Send Post List BW Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RawEth               Using SRQ      : OFF
 TX depth        : 128
 Post List       : 3
 CQ Moderation   : 3
 Mtu             : 1518[B]
 Link type       : Ethernet
 Gid index       : 0
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
**raw ethernet header****************************************

--------------------------------------------------------------
| Dest MAC         | Src MAC          | Packet Type          |
|------------------------------------------------------------|
| F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
|------------------------------------------------------------|

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
MsgRate[Mpps]
 1480       33242748         0.00               4691.58            3.323974
---------------------------------------------------------------------------------------


Running it with the 64 byte packets Olga specified gives me the following
result:

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
MsgRate[Mpps]
 64         166585650        0.00               1016.67            16.657163
---------------------------------------------------------------------------------------


The results are the same with and without flow control. I have followed the
Mellanox DPDK QSG and done everything in the performance section (except
the things regarding interrupts).

So to answer Olga's questions :-)

1: Unfortunately I can't. If I try the FW update complains since the cards
came with Dell configuration (PSID: DEL0A70000023).

2: In my final setup I need jumboframes but just for the sake of testing I
tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config.
This did not really change anything, neither in my initial setup nor the
one described above.

3: In the final setup, I plan to share the NICs between multiple
independent processes. For this reason, I wanted to use SR-IOV and
whitelist a single VF to each process. Anyway, for the tests above I have
used the PFs for simplicity.
(Side note: I discovered that multiple DPDK instances can use the same PCI
address which might eliminate the need for SR-IOV. I wonder how that works
:-))

So conclusively, isn't the raw_ethernet_bw tool supposed to have larger
output BW with 1480 byte packets ?

I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I can
mail this to anyone who have the time to look further into it.

Thank you for your help, best regards
Jesper

2015-11-01 11:05 GMT+01:00 Olga Shern <olgas@mellanox.com>:

> Hi Jesper,
>
> Several suggestions,
> 1.      Any chance you can install latest FW from Mellanox web site or the
> one that is included in OFED 3.1 version that you have downloaded?  The
> latest version is  2.35.5100.
> 2.      Please configure SGE_NUM=1 in DPDK config file  in case you don't
> need jumbo frames. This will improve performance.
> 3.      Not clear from your description, if you are running DPDK on VM ?
> Are you suing SRIOV ?
> 4.      I suggest you  to run  first,  testpmd application. The traffic
> generator can be raw_ethernet_bw application that coming with MLNX_OFED, it
> can generate L2, IPV4 and TCP/UDP packets
>         For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i
> 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
>         This will send L2 packets via mlx4_0 NIC port 1 , packet size =
> 64, for 10 sec, batch = 3 (-l)
>         You can see according to testpmd counters the performance.
>
> Please check Mellanox community  posts, I think they can help you.
> https://community.mellanox.com/docs/DOC-1502
>
> We also have performance suggestions in our QSG:
>
> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf
>
> Best Regards,
> Olga
>
>
> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date :
> samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <
> jesper.wramberg@gmail.com>  À : users@dpdk.org
>
> Hi all,
>
>
>
> I am experiencing some performance issues in a somewhat custom setup with
> two Mellanox ConnectX-3 NICs. I realize these issues might be due to the
> setup, but I was hoping someone might be able to pinpoint some possible
> problems/bottlenecks.
>
>
>
>
> The server:
>
> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on
> each socket). I have a minimal Centos 7.1.1503 installed with kernel-
> 3.10.0-229.
> Note that this kernel is re-build with most things disabled to minimize
> size, etc. It has infiniband enabled, however, and mlx4_core as a module
> (since nothing works otherwise). Finally, I have connected the two NICs
> from port 2 to port 2.
>
>
>
> The firmware:
>
> I have installed the latest firmware for the NICs from dell which is
> 2.34.5060.
>
>
>
> The drivers, modules, etc.:
>
> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used
> its rebuild feature to build it against the custom kernel. I have installed
> it using the --basic option since I just want libibverbs, libmlx4, kernel
> modules and openibd service stuff. The mlx4_core.conf is set for ethernet
> on all ports. Moreover, it is configured for flow steering mode -7 and a
> few VFs. I can restart the openibd service successfully and everything
> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
> only problems I have encountered at this stage is that the links doesn't
> always seem to come up unless I unplug and re-plug the cables.
>
>
>
> DPDK setup:
>
> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
> package. I build it using the default values for everything. Running the
> simple hello world example I can see that everything is initialized
> correctly, etc.
>
>
>
> Test setup:
>
> To test the performance of the NICs I have the following setup. Two
> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
> running on NIC B. All processes use virtual functions on their respective
> NICs. Depending on the test, the processes can either transmit or receive
> data. To transmit, I use a simple DPDK program which generates 32000
> packets and transmits them over and over until it has sent 640 million
> packets. Similarly, I use a simple DPDK program to receive which is
> basically the layer 2 forwarding example without re-transmission.
>
>
>
> First test:
>
> In my first test, P1 transmits data to P3 while the other processes are
> idle.
>
> Packet size: 1480 byte packets
>
> Flow control: On/Off, doesn’t matter I get same result.
>
> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
> 4.9Gbit/s
>
>
>
> Second test:
>
> I my second test, I attempt to increase the amount of data transmitted
> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>
> Packet size: 1480 byte packets
>
> Flow control: On/Off, doesn’t matter I get same result.
>
> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75
> Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>
>
>
>
>
> Does anyone has any idea what I am doing wrong here ? In the second test I
> would expect P1 to transmit with the same speed as in the first test. It
> seems that there is a bottleneck somewhere, however. I have left most
> things to their default values but have also tried tweaking queue sizes,
> number of queues, interrupts, etc. with no luck
>
>
>
>
>
> Best Regards,
>
> Jesper
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
  2015-11-02 10:59   ` Jesper Wramberg
@ 2015-11-02 12:31     ` Olga Shern
  2015-11-02 12:57     ` Jesper Wramberg
  1 sibling, 0 replies; 7+ messages in thread
From: Olga Shern @ 2015-11-02 12:31 UTC (permalink / raw)
  To: Jesper Wramberg; +Cc: users

Hi Jesper,

I think your calculation is wrong ☹

3.32 Mpps for 1480B message  = 3.32 * 1480B =  ~ 39Gbit/s
So this is almost line rate for one 40G port

Maybe I am missing something, correct me if I am wrong

Best Regards,
Olga


P.S. The output of raw_ethernet_bw is

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1480       33242748         0.00               4691.58 MB/s           3.323974
---------------------------------------------------------------------------------------


From: Jesper Wramberg [mailto:jesper.wramberg@gmail.com]
Sent: Monday, November 02, 2015 1:00 PM
To: Olga Shern <olgas@mellanox.com>
Cc: users@dpdk.org
Subject: Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC

Hi again,

Thank you for your input. I have now switched to using the raw_ethernet_bw script as transmitter and the test-pmd as receiver. An immediate result I discovered was that the raw_ethernet_bw tool achieves very similar TX performance as my DPDK transmitter.


(note both cpu10 and mlx4_0 is on same numa node as wanted)
taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20 -s 1480 --dest_mac F4:52:14:7A:59:80
---------------------------------------------------------------------------------------
Post List requested - CQ moderation will be the size of the post list
---------------------------------------------------------------------------------------
                    Send Post List BW Test
 Dual-port       : OFF          Device         : mlx4_0
 Number of qps   : 1            Transport type : IB
 Connection type : RawEth               Using SRQ      : OFF
 TX depth        : 128
 Post List       : 3
 CQ Moderation   : 3
 Mtu             : 1518[B]
 Link type       : Ethernet
 Gid index       : 0
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
**raw ethernet header****************************************

--------------------------------------------------------------
| Dest MAC         | Src MAC          | Packet Type          |
|------------------------------------------------------------|
| F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
|------------------------------------------------------------|

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 1480       33242748         0.00               4691.58            3.323974
---------------------------------------------------------------------------------------


Running it with the 64 byte packets Olga specified gives me the following result:

---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 64         166585650        0.00               1016.67            16.657163
---------------------------------------------------------------------------------------


The results are the same with and without flow control. I have followed the Mellanox DPDK QSG and done everything in the performance section (except the things regarding interrupts).

So to answer Olga's questions :-)

1: Unfortunately I can't. If I try the FW update complains since the cards came with Dell configuration (PSID: DEL0A70000023).

2: In my final setup I need jumboframes but just for the sake of testing I tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config. This did not really change anything, neither in my initial setup nor the one described above.

3: In the final setup, I plan to share the NICs between multiple independent processes. For this reason, I wanted to use SR-IOV and whitelist a single VF to each process. Anyway, for the tests above I have used the PFs for simplicity.
(Side note: I discovered that multiple DPDK instances can use the same PCI address which might eliminate the need for SR-IOV. I wonder how that works :-))

So conclusively, isn't the raw_ethernet_bw tool supposed to have larger output BW with 1480 byte packets ?

I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I can mail this to anyone who have the time to look further into it.

Thank you for your help, best regards
Jesper

2015-11-01 11:05 GMT+01:00 Olga Shern <olgas@mellanox.com<mailto:olgas@mellanox.com>>:
Hi Jesper,

Several suggestions,
1.      Any chance you can install latest FW from Mellanox web site or the one that is included in OFED 3.1 version that you have downloaded?  The latest version is  2.35.5100.
2.      Please configure SGE_NUM=1 in DPDK config file  in case you don't need jumbo frames. This will improve performance.
3.      Not clear from your description, if you are running DPDK on VM ? Are you suing SRIOV ?
4.      I suggest you  to run  first,  testpmd application. The traffic generator can be raw_ethernet_bw application that coming with MLNX_OFED, it can generate L2, IPV4 and TCP/UDP packets
        For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
        This will send L2 packets via mlx4_0 NIC port 1 , packet size = 64, for 10 sec, batch = 3 (-l)
        You can see according to testpmd counters the performance.

Please check Mellanox community  posts, I think they can help you.
https://community.mellanox.com/docs/DOC-1502

We also have performance suggestions in our QSG:
http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf

Best Regards,
Olga


Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date : samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <jesper.wramberg@gmail.com<mailto:jesper.wramberg@gmail.com>>  À : users@dpdk.org<mailto:users@dpdk.org>

Hi all,



I am experiencing some performance issues in a somewhat custom setup with two Mellanox ConnectX-3 NICs. I realize these issues might be due to the setup, but I was hoping someone might be able to pinpoint some possible problems/bottlenecks.




The server:

I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on each socket). I have a minimal Centos 7.1.1503 installed with kernel-3.10.0-229<tel:3.10.0-229>.
Note that this kernel is re-build with most things disabled to minimize size, etc. It has infiniband enabled, however, and mlx4_core as a module (since nothing works otherwise). Finally, I have connected the two NICs from port 2 to port 2.



The firmware:

I have installed the latest firmware for the NICs from dell which is 2.34.5060.



The drivers, modules, etc.:

I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used its rebuild feature to build it against the custom kernel. I have installed it using the --basic option since I just want libibverbs, libmlx4, kernel modules and openibd service stuff. The mlx4_core.conf is set for ethernet on all ports. Moreover, it is configured for flow steering mode -7 and a few VFs. I can restart the openibd service successfully and everything seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The only problems I have encountered at this stage is that the links doesn't always seem to come up unless I unplug and re-plug the cables.



DPDK setup:

I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED package. I build it using the default values for everything. Running the simple hello world example I can see that everything is initialized correctly, etc.



Test setup:

To test the performance of the NICs I have the following setup. Two processes, P1 and P2, running on NIC A. Two other processes, P3 and P4, running on NIC B. All processes use virtual functions on their respective NICs. Depending on the test, the processes can either transmit or receive data. To transmit, I use a simple DPDK program which generates 32000 packets and transmits them over and over until it has sent 640 million packets. Similarly, I use a simple DPDK program to receive which is basically the layer 2 forwarding example without re-transmission.



First test:

In my first test, P1 transmits data to P3 while the other processes are idle.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~ 4.9Gbit/s



Second test:

I my second test, I attempt to increase the amount of data transmitted over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75 Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.




Does anyone has any idea what I am doing wrong here ? In the second test I would expect P1 to transmit with the same speed as in the first test. It seems that there is a bottleneck somewhere, however. I have left most things to their default values but have also tried tweaking queue sizes, number of queues, interrupts, etc. with no luck





Best Regards,

Jesper


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
  2015-11-02 10:59   ` Jesper Wramberg
  2015-11-02 12:31     ` Olga Shern
@ 2015-11-02 12:57     ` Jesper Wramberg
  2015-11-02 13:57       ` Jesper Wramberg
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Wramberg @ 2015-11-02 12:57 UTC (permalink / raw)
  To: Olga Shern; +Cc: users

Hey,

As a follow-up I tried changing interrupts around without any changes to
achieved speed.
Lastly, after some iperf testing using 10 threads it would seem that it is
impossible to achieve over 10G BW.

Got an interesting output from "perf top -p <pid>" however while running
the raw_ethernet_bw script.

  37.22%  libpthread-2.17.so  [.] pthread_spin_lock
  10.00%  libmlx4-rdmav2.so   [.] 0x000000000000b05a
   1.20%  libmlx4-rdmav2.so   [.] 0x000000000000b3ec
   1.07%  libmlx4-rdmav2.so   [.] 0x000000000000b06c
   1.07%  libmlx4-rdmav2.so   [.] 0x000000000000afc0
   1.06%  raw_ethernet_bw     [.] 0x000000000001484f
   1.06%  raw_ethernet_bw     [.] 0x0000000000014869
   1.06%  raw_ethernet_bw     [.] 0x00000000000142ec
   1.05%  libmlx4-rdmav2.so   [.] 0x000000000000b41c
   1.05%  libmlx4-rdmav2.so   [.] 0x000000000000aff6
   1.05%  raw_ethernet_bw     [.] 0x0000000000014f09
   1.03%  libmlx4-rdmav2.so   [.] 0x0000000000005a60
   1.03%  libmlx4-rdmav2.so   [.] 0x000000000000be51
   1.03%  libpthread-2.17.so  [.] pthread_spin_unlock
   1.01%  libmlx4-rdmav2.so   [.] 0x000000000000afdc
   1.00%  libmlx4-rdmav2.so   [.] 0x000000000000b042
   1.00%  raw_ethernet_bw     [.] 0x0000000000014314
   0.98%  libmlx4-rdmav2.so   [.] 0x000000000000bf38
   0.97%  libmlx4-rdmav2.so   [.] 0x000000000000b3d2
   0.97%  raw_ethernet_bw     [.] 0x00000000000142a4
   0.96%  raw_ethernet_bw     [.] 0x0000000000014282
   0.96%  libmlx4-rdmav2.so   [.] 0x000000000000b415
   0.96%  raw_ethernet_bw     [.] 0x000000000001425e

I wonder if the tool is supposed to spend so much time in
pthread_spin_lock..

Best regards,
Jesper

2015-11-02 11:59 GMT+01:00 Jesper Wramberg <jesper.wramberg@gmail.com>:

> Hi again,
>
> Thank you for your input. I have now switched to using the raw_ethernet_bw
> script as transmitter and the test-pmd as receiver. An immediate result I
> discovered was that the raw_ethernet_bw tool achieves very similar TX
> performance as my DPDK transmitter.
>
>
> (note both cpu10 and mlx4_0 is on same numa node as wanted)
> taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20
> -s 1480 --dest_mac F4:52:14:7A:59:80
>
> ---------------------------------------------------------------------------------------
> Post List requested - CQ moderation will be the size of the post list
>
> ---------------------------------------------------------------------------------------
>                     Send Post List BW Test
>  Dual-port       : OFF          Device         : mlx4_0
>  Number of qps   : 1            Transport type : IB
>  Connection type : RawEth               Using SRQ      : OFF
>  TX depth        : 128
>  Post List       : 3
>  CQ Moderation   : 3
>  Mtu             : 1518[B]
>  Link type       : Ethernet
>  Gid index       : 0
>  Max inline data : 0[B]
>  rdma_cm QPs     : OFF
>  Data ex. method : Ethernet
>
> ---------------------------------------------------------------------------------------
> **raw ethernet header****************************************
>
> --------------------------------------------------------------
> | Dest MAC         | Src MAC          | Packet Type          |
> |------------------------------------------------------------|
> | F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
> |------------------------------------------------------------|
>
>
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> MsgRate[Mpps]
>  1480       33242748         0.00               4691.58            3.323974
>
> ---------------------------------------------------------------------------------------
>
>
> Running it with the 64 byte packets Olga specified gives me the following
> result:
>
>
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> MsgRate[Mpps]
>  64         166585650        0.00               1016.67
>  16.657163
>
> ---------------------------------------------------------------------------------------
>
>
> The results are the same with and without flow control. I have followed
> the Mellanox DPDK QSG and done everything in the performance section
> (except the things regarding interrupts).
>
> So to answer Olga's questions :-)
>
> 1: Unfortunately I can't. If I try the FW update complains since the cards
> came with Dell configuration (PSID: DEL0A70000023).
>
> 2: In my final setup I need jumboframes but just for the sake of testing I
> tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config.
> This did not really change anything, neither in my initial setup nor the
> one described above.
>
> 3: In the final setup, I plan to share the NICs between multiple
> independent processes. For this reason, I wanted to use SR-IOV and
> whitelist a single VF to each process. Anyway, for the tests above I have
> used the PFs for simplicity.
> (Side note: I discovered that multiple DPDK instances can use the same PCI
> address which might eliminate the need for SR-IOV. I wonder how that works
> :-))
>
> So conclusively, isn't the raw_ethernet_bw tool supposed to have larger
> output BW with 1480 byte packets ?
>
> I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I can
> mail this to anyone who have the time to look further into it.
>
> Thank you for your help, best regards
> Jesper
>
> 2015-11-01 11:05 GMT+01:00 Olga Shern <olgas@mellanox.com>:
>
>> Hi Jesper,
>>
>> Several suggestions,
>> 1.      Any chance you can install latest FW from Mellanox web site or
>> the one that is included in OFED 3.1 version that you have downloaded?  The
>> latest version is  2.35.5100.
>> 2.      Please configure SGE_NUM=1 in DPDK config file  in case you don't
>> need jumbo frames. This will improve performance.
>> 3.      Not clear from your description, if you are running DPDK on VM ?
>> Are you suing SRIOV ?
>> 4.      I suggest you  to run  first,  testpmd application. The traffic
>> generator can be raw_ethernet_bw application that coming with MLNX_OFED, it
>> can generate L2, IPV4 and TCP/UDP packets
>>         For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i
>> 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
>>         This will send L2 packets via mlx4_0 NIC port 1 , packet size =
>> 64, for 10 sec, batch = 3 (-l)
>>         You can see according to testpmd counters the performance.
>>
>> Please check Mellanox community  posts, I think they can help you.
>> https://community.mellanox.com/docs/DOC-1502
>>
>> We also have performance suggestions in our QSG:
>>
>> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf
>>
>> Best Regards,
>> Olga
>>
>>
>> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date :
>> samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <
>> jesper.wramberg@gmail.com>  À : users@dpdk.org
>>
>> Hi all,
>>
>>
>>
>> I am experiencing some performance issues in a somewhat custom setup with
>> two Mellanox ConnectX-3 NICs. I realize these issues might be due to the
>> setup, but I was hoping someone might be able to pinpoint some possible
>> problems/bottlenecks.
>>
>>
>>
>>
>> The server:
>>
>> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on
>> each socket). I have a minimal Centos 7.1.1503 installed with kernel-
>> 3.10.0-229.
>> Note that this kernel is re-build with most things disabled to minimize
>> size, etc. It has infiniband enabled, however, and mlx4_core as a module
>> (since nothing works otherwise). Finally, I have connected the two NICs
>> from port 2 to port 2.
>>
>>
>>
>> The firmware:
>>
>> I have installed the latest firmware for the NICs from dell which is
>> 2.34.5060.
>>
>>
>>
>> The drivers, modules, etc.:
>>
>> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used
>> its rebuild feature to build it against the custom kernel. I have installed
>> it using the --basic option since I just want libibverbs, libmlx4, kernel
>> modules and openibd service stuff. The mlx4_core.conf is set for ethernet
>> on all ports. Moreover, it is configured for flow steering mode -7 and a
>> few VFs. I can restart the openibd service successfully and everything
>> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
>> only problems I have encountered at this stage is that the links doesn't
>> always seem to come up unless I unplug and re-plug the cables.
>>
>>
>>
>> DPDK setup:
>>
>> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
>> package. I build it using the default values for everything. Running the
>> simple hello world example I can see that everything is initialized
>> correctly, etc.
>>
>>
>>
>> Test setup:
>>
>> To test the performance of the NICs I have the following setup. Two
>> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
>> running on NIC B. All processes use virtual functions on their respective
>> NICs. Depending on the test, the processes can either transmit or receive
>> data. To transmit, I use a simple DPDK program which generates 32000
>> packets and transmits them over and over until it has sent 640 million
>> packets. Similarly, I use a simple DPDK program to receive which is
>> basically the layer 2 forwarding example without re-transmission.
>>
>>
>>
>> First test:
>>
>> In my first test, P1 transmits data to P3 while the other processes are
>> idle.
>>
>> Packet size: 1480 byte packets
>>
>> Flow control: On/Off, doesn’t matter I get same result.
>>
>> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
>> 4.9Gbit/s
>>
>>
>>
>> Second test:
>>
>> I my second test, I attempt to increase the amount of data transmitted
>> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>>
>> Packet size: 1480 byte packets
>>
>> Flow control: On/Off, doesn’t matter I get same result.
>>
>> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75
>> Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>>
>>
>>
>>
>>
>> Does anyone has any idea what I am doing wrong here ? In the second test
>> I would expect P1 to transmit with the same speed as in the first test. It
>> seems that there is a bottleneck somewhere, however. I have left most
>> things to their default values but have also tried tweaking queue sizes,
>> number of queues, interrupts, etc. with no luck
>>
>>
>>
>>
>>
>> Best Regards,
>>
>> Jesper
>>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
  2015-11-02 12:57     ` Jesper Wramberg
@ 2015-11-02 13:57       ` Jesper Wramberg
  0 siblings, 0 replies; 7+ messages in thread
From: Jesper Wramberg @ 2015-11-02 13:57 UTC (permalink / raw)
  To: Olga Shern; +Cc: users

Hi again,

Sorry I missed your first email. Wow, I can't believe I missed that. I read
the output from ethernet_bw as Mbit/s :-( That's kind of embarrassing.
You are right. My calculations are wrong. Sorry for bothering you with my
bad math. For whats it's worth I have spent quite some time wondering what
was wrong.
I still have some way to go though, since my original problems started in a
much larger, more complicated setup. But I'm glad this basic Tx/Rx setup
works as expected.

Thank you, best regards
Jesper


2015-11-02 13:57 GMT+01:00 Jesper Wramberg <jesper.wramberg@gmail.com>:

> Hey,
>
> As a follow-up I tried changing interrupts around without any changes to
> achieved speed.
> Lastly, after some iperf testing using 10 threads it would seem that it is
> impossible to achieve over 10G BW.
>
> Got an interesting output from "perf top -p <pid>" however while running
> the raw_ethernet_bw script.
>
>   37.22%  libpthread-2.17.so  [.] pthread_spin_lock
>   10.00%  libmlx4-rdmav2.so   [.] 0x000000000000b05a
>    1.20%  libmlx4-rdmav2.so   [.] 0x000000000000b3ec
>    1.07%  libmlx4-rdmav2.so   [.] 0x000000000000b06c
>    1.07%  libmlx4-rdmav2.so   [.] 0x000000000000afc0
>    1.06%  raw_ethernet_bw     [.] 0x000000000001484f
>    1.06%  raw_ethernet_bw     [.] 0x0000000000014869
>    1.06%  raw_ethernet_bw     [.] 0x00000000000142ec
>    1.05%  libmlx4-rdmav2.so   [.] 0x000000000000b41c
>    1.05%  libmlx4-rdmav2.so   [.] 0x000000000000aff6
>    1.05%  raw_ethernet_bw     [.] 0x0000000000014f09
>    1.03%  libmlx4-rdmav2.so   [.] 0x0000000000005a60
>    1.03%  libmlx4-rdmav2.so   [.] 0x000000000000be51
>    1.03%  libpthread-2.17.so  [.] pthread_spin_unlock
>    1.01%  libmlx4-rdmav2.so   [.] 0x000000000000afdc
>    1.00%  libmlx4-rdmav2.so   [.] 0x000000000000b042
>    1.00%  raw_ethernet_bw     [.] 0x0000000000014314
>    0.98%  libmlx4-rdmav2.so   [.] 0x000000000000bf38
>    0.97%  libmlx4-rdmav2.so   [.] 0x000000000000b3d2
>    0.97%  raw_ethernet_bw     [.] 0x00000000000142a4
>    0.96%  raw_ethernet_bw     [.] 0x0000000000014282
>    0.96%  libmlx4-rdmav2.so   [.] 0x000000000000b415
>    0.96%  raw_ethernet_bw     [.] 0x000000000001425e
>
> I wonder if the tool is supposed to spend so much time in
> pthread_spin_lock..
>
> Best regards,
> Jesper
>
> 2015-11-02 11:59 GMT+01:00 Jesper Wramberg <jesper.wramberg@gmail.com>:
>
>> Hi again,
>>
>> Thank you for your input. I have now switched to using the
>> raw_ethernet_bw script as transmitter and the test-pmd as receiver. An
>> immediate result I discovered was that the raw_ethernet_bw tool achieves
>> very similar TX performance as my DPDK transmitter.
>>
>>
>> (note both cpu10 and mlx4_0 is on same numa node as wanted)
>> taskset -c 10 raw_ethernet_bw --client -d mlx4_0 -i 2 -l 3 --duration 20
>> -s 1480 --dest_mac F4:52:14:7A:59:80
>>
>> ---------------------------------------------------------------------------------------
>> Post List requested - CQ moderation will be the size of the post list
>>
>> ---------------------------------------------------------------------------------------
>>                     Send Post List BW Test
>>  Dual-port       : OFF          Device         : mlx4_0
>>  Number of qps   : 1            Transport type : IB
>>  Connection type : RawEth               Using SRQ      : OFF
>>  TX depth        : 128
>>  Post List       : 3
>>  CQ Moderation   : 3
>>  Mtu             : 1518[B]
>>  Link type       : Ethernet
>>  Gid index       : 0
>>  Max inline data : 0[B]
>>  rdma_cm QPs     : OFF
>>  Data ex. method : Ethernet
>>
>> ---------------------------------------------------------------------------------------
>> **raw ethernet header****************************************
>>
>> --------------------------------------------------------------
>> | Dest MAC         | Src MAC          | Packet Type          |
>> |------------------------------------------------------------|
>> | F4:52:14:7A:59:80| E6:1D:2D:11:FF:41|DEFAULT               |
>> |------------------------------------------------------------|
>>
>>
>> ---------------------------------------------------------------------------------------
>>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>  1480       33242748         0.00               4691.58
>>  3.323974
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>> Running it with the 64 byte packets Olga specified gives me the following
>> result:
>>
>>
>> ---------------------------------------------------------------------------------------
>>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>  64         166585650        0.00               1016.67
>>  16.657163
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>> The results are the same with and without flow control. I have followed
>> the Mellanox DPDK QSG and done everything in the performance section
>> (except the things regarding interrupts).
>>
>> So to answer Olga's questions :-)
>>
>> 1: Unfortunately I can't. If I try the FW update complains since the
>> cards came with Dell configuration (PSID: DEL0A70000023).
>>
>> 2: In my final setup I need jumboframes but just for the sake of testing
>> I tried changing CONFIG_RTE_LIBRTE_MLX4_SGE_WR_N to 1 in the DPDK config.
>> This did not really change anything, neither in my initial setup nor the
>> one described above.
>>
>> 3: In the final setup, I plan to share the NICs between multiple
>> independent processes. For this reason, I wanted to use SR-IOV and
>> whitelist a single VF to each process. Anyway, for the tests above I have
>> used the PFs for simplicity.
>> (Side note: I discovered that multiple DPDK instances can use the same
>> PCI address which might eliminate the need for SR-IOV. I wonder how that
>> works :-))
>>
>> So conclusively, isn't the raw_ethernet_bw tool supposed to have larger
>> output BW with 1480 byte packets ?
>>
>> I have a sysinfo dump using the Mellanox sysinfo-snapshot.py script. I
>> can mail this to anyone who have the time to look further into it.
>>
>> Thank you for your help, best regards
>> Jesper
>>
>> 2015-11-01 11:05 GMT+01:00 Olga Shern <olgas@mellanox.com>:
>>
>>> Hi Jesper,
>>>
>>> Several suggestions,
>>> 1.      Any chance you can install latest FW from Mellanox web site or
>>> the one that is included in OFED 3.1 version that you have downloaded?  The
>>> latest version is  2.35.5100.
>>> 2.      Please configure SGE_NUM=1 in DPDK config file  in case you
>>> don't need jumbo frames. This will improve performance.
>>> 3.      Not clear from your description, if you are running DPDK on VM ?
>>> Are you suing SRIOV ?
>>> 4.      I suggest you  to run  first,  testpmd application. The traffic
>>> generator can be raw_ethernet_bw application that coming with MLNX_OFED, it
>>> can generate L2, IPV4 and TCP/UDP packets
>>>         For example:  taskset -c 10 raw_ethernet_bw --client -d mlx4_0
>>> -i 1 -l 3 --duration 10 -s 64 --dest_mac F4:52:14:7A:59:80 &
>>>         This will send L2 packets via mlx4_0 NIC port 1 , packet size =
>>> 64, for 10 sec, batch = 3 (-l)
>>>         You can see according to testpmd counters the performance.
>>>
>>> Please check Mellanox community  posts, I think they can help you.
>>> https://community.mellanox.com/docs/DOC-1502
>>>
>>> We also have performance suggestions in our QSG:
>>>
>>> http://www.mellanox.com/related-docs/prod_software/MLNX_DPDK_Quick_Start_Guide_v2%201_1%201.pdf
>>>
>>> Best Regards,
>>> Olga
>>>
>>>
>>> Objet : [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Date
>>> : samedi 31 octobre 2015, 09:54:04 De : Jesper Wramberg <
>>> jesper.wramberg@gmail.com>  À : users@dpdk.org
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I am experiencing some performance issues in a somewhat custom setup
>>> with two Mellanox ConnectX-3 NICs. I realize these issues might be due to
>>> the setup, but I was hoping someone might be able to pinpoint some possible
>>> problems/bottlenecks.
>>>
>>>
>>>
>>>
>>> The server:
>>>
>>> I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on
>>> each socket). I have a minimal Centos 7.1.1503 installed with kernel-
>>> 3.10.0-229.
>>> Note that this kernel is re-build with most things disabled to minimize
>>> size, etc. It has infiniband enabled, however, and mlx4_core as a module
>>> (since nothing works otherwise). Finally, I have connected the two NICs
>>> from port 2 to port 2.
>>>
>>>
>>>
>>> The firmware:
>>>
>>> I have installed the latest firmware for the NICs from dell which is
>>> 2.34.5060.
>>>
>>>
>>>
>>> The drivers, modules, etc.:
>>>
>>> I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used
>>> its rebuild feature to build it against the custom kernel. I have installed
>>> it using the --basic option since I just want libibverbs, libmlx4, kernel
>>> modules and openibd service stuff. The mlx4_core.conf is set for ethernet
>>> on all ports. Moreover, it is configured for flow steering mode -7 and a
>>> few VFs. I can restart the openibd service successfully and everything
>>> seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
>>> only problems I have encountered at this stage is that the links doesn't
>>> always seem to come up unless I unplug and re-plug the cables.
>>>
>>>
>>>
>>> DPDK setup:
>>>
>>> I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
>>> package. I build it using the default values for everything. Running the
>>> simple hello world example I can see that everything is initialized
>>> correctly, etc.
>>>
>>>
>>>
>>> Test setup:
>>>
>>> To test the performance of the NICs I have the following setup. Two
>>> processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
>>> running on NIC B. All processes use virtual functions on their respective
>>> NICs. Depending on the test, the processes can either transmit or receive
>>> data. To transmit, I use a simple DPDK program which generates 32000
>>> packets and transmits them over and over until it has sent 640 million
>>> packets. Similarly, I use a simple DPDK program to receive which is
>>> basically the layer 2 forwarding example without re-transmission.
>>>
>>>
>>>
>>> First test:
>>>
>>> In my first test, P1 transmits data to P3 while the other processes are
>>> idle.
>>>
>>> Packet size: 1480 byte packets
>>>
>>> Flow control: On/Off, doesn’t matter I get same result.
>>>
>>> Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
>>> 4.9Gbit/s
>>>
>>>
>>>
>>> Second test:
>>>
>>> I my second test, I attempt to increase the amount of data transmitted
>>> over NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>>>
>>> Packet size: 1480 byte packets
>>>
>>> Flow control: On/Off, doesn’t matter I get same result.
>>>
>>> Results: P3 and P4 receive all packets but it takes 364.40 seconds ~
>>> 1.75 Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>>>
>>>
>>>
>>>
>>>
>>> Does anyone has any idea what I am doing wrong here ? In the second test
>>> I would expect P1 to transmit with the same speed as in the first test. It
>>> seems that there is a bottleneck somewhere, however. I have left most
>>> things to their default values but have also tried tweaking queue sizes,
>>> number of queues, interrupts, etc. with no luck
>>>
>>>
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Jesper
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
  2015-10-31  8:54 Jesper Wramberg
@ 2015-10-31 16:26 ` Wiles, Keith
  0 siblings, 0 replies; 7+ messages in thread
From: Wiles, Keith @ 2015-10-31 16:26 UTC (permalink / raw)
  To: Jesper Wramberg, users

On 10/31/15, 3:54 AM, "users on behalf of Jesper Wramberg" <users-bounces@dpdk.org on behalf of jesper.wramberg@gmail.com> wrote:

>Hi all,
>
>
>
>I am experiencing some performance issues in a somewhat custom setup with
>two Mellanox ConnectX-3 NICs. I realize these issues might be due to the
>setup, but I was hoping someone might be able to pinpoint some possible
>problems/bottlenecks.
>
>
>
>
>The server:
>
>I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on each
>socket). I have a minimal Centos 7.1.1503 installed with kernel-3.10.0-229.
>Note that this kernel is re-build with most things disabled to minimize
>size, etc. It has infiniband enabled, however, and mlx4_core as a module
>(since nothing works otherwise). Finally, I have connected the two NICs
>from port 2 to port 2.
>
>
>
>The firmware:
>
>I have installed the latest firmware for the NICs from dell which is
>2.34.5060.
>
>
>
>The drivers, modules, etc.:
>
>I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used its
>rebuild feature to build it against the custom kernel. I have installed it
>using the --basic option since I just want libibverbs, libmlx4, kernel
>modules and openibd service stuff. The mlx4_core.conf is set for ethernet
>on all ports. Moreover, it is configured for flow steering mode -7 and a
>few VFs. I can restart the openibd service successfully and everything
>seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
>only problems I have encountered at this stage is that the links doesn't
>always seem to come up unless I unplug and re-plug the cables.
>
>
>
>DPDK setup:
>
>I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
>package. I build it using the default values for everything. Running the
>simple hello world example I can see that everything is initialized
>correctly, etc.
>
>
>
>Test setup:
>
>To test the performance of the NICs I have the following setup. Two
>processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
>running on NIC B. All processes use virtual functions on their respective
>NICs. Depending on the test, the processes can either transmit or receive
>data. To transmit, I use a simple DPDK program which generates 32000
>packets and transmits them over and over until it has sent 640 million
>packets. Similarly, I use a simple DPDK program to receive which is
>basically the layer 2 forwarding example without re-transmission.
>
>
>
>First test:
>
>In my first test, P1 transmits data to P3 while the other processes are
>idle.
>
>Packet size: 1480 byte packets
>
>Flow control: On/Off, doesn’t matter I get same result.
>
>Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
>4.9Gbit/s
>
>
>
>Second test:
>
>I my second test, I attempt to increase the amount of data transmitted over
>NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.
>
>Packet size: 1480 byte packets
>
>Flow control: On/Off, doesn’t matter I get same result.
>
>Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75
>Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.
>

One suggestion I have would be to split the problem into two parts by looping the cable or packets back to the machine sending the packets and see what the performance is in that case. The other possible suggestion is to try Pktgen-dpdk on the two machines with the cables looped back to them selves and see what Pktgen performance is in that case. I do not know what the problem is I am only suggesting you try some applications we know work and simplify the configuration. I hope this helps.
>
>
>
>
>Does anyone has any idea what I am doing wrong here ? In the second test I
>would expect P1 to transmit with the same speed as in the first test. It
>seems that there is a bottleneck somewhere, however. I have left most
>things to their default values but have also tried tweaking queue sizes,
>number of queues, interrupts, etc. with no luck
>
>
>
>
>
>Best Regards,
>
>Jesper
>


Regards,
Keith





^ permalink raw reply	[flat|nested] 7+ messages in thread

* [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC
@ 2015-10-31  8:54 Jesper Wramberg
  2015-10-31 16:26 ` Wiles, Keith
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Wramberg @ 2015-10-31  8:54 UTC (permalink / raw)
  To: users

Hi all,



I am experiencing some performance issues in a somewhat custom setup with
two Mellanox ConnectX-3 NICs. I realize these issues might be due to the
setup, but I was hoping someone might be able to pinpoint some possible
problems/bottlenecks.




The server:

I have a Dell PowerEdge R630 with two Mellanox ConnectX-3 NICs (one on each
socket). I have a minimal Centos 7.1.1503 installed with kernel-3.10.0-229.
Note that this kernel is re-build with most things disabled to minimize
size, etc. It has infiniband enabled, however, and mlx4_core as a module
(since nothing works otherwise). Finally, I have connected the two NICs
from port 2 to port 2.



The firmware:

I have installed the latest firmware for the NICs from dell which is
2.34.5060.



The drivers, modules, etc.:

I have downloaded the Mellanox OFED package 3.1 for Centos 7.1 and used its
rebuild feature to build it against the custom kernel. I have installed it
using the --basic option since I just want libibverbs, libmlx4, kernel
modules and openibd service stuff. The mlx4_core.conf is set for ethernet
on all ports. Moreover, it is configured for flow steering mode -7 and a
few VFs. I can restart the openibd service successfully and everything
seems to be working. ibdev2netdev reports the NICs and its VFs, etc. The
only problems I have encountered at this stage is that the links doesn't
always seem to come up unless I unplug and re-plug the cables.



DPDK setup:

I have built DPDK with the mlx4 pmd using the .h/.a files from the OFED
package. I build it using the default values for everything. Running the
simple hello world example I can see that everything is initialized
correctly, etc.



Test setup:

To test the performance of the NICs I have the following setup. Two
processes, P1 and P2, running on NIC A. Two other processes, P3 and P4,
running on NIC B. All processes use virtual functions on their respective
NICs. Depending on the test, the processes can either transmit or receive
data. To transmit, I use a simple DPDK program which generates 32000
packets and transmits them over and over until it has sent 640 million
packets. Similarly, I use a simple DPDK program to receive which is
basically the layer 2 forwarding example without re-transmission.



First test:

In my first test, P1 transmits data to P3 while the other processes are
idle.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Result: P3 receive all packets but it takes 192.52 seconds ~ 3.32 Mpps ~
4.9Gbit/s



Second test:

I my second test, I attempt to increase the amount of data transmitted over
NIC A. As such, P1 transmits data to P3 while P2 transmits data to P4.

Packet size: 1480 byte packets

Flow control: On/Off, doesn’t matter I get same result.

Results: P3 and P4 receive all packets but it takes 364.40 seconds ~ 1.75
Mpps ~ 2.6Gbit/s for a single process to get its data transmitted.





Does anyone has any idea what I am doing wrong here ? In the second test I
would expect P1 to transmit with the same speed as in the first test. It
seems that there is a bottleneck somewhere, however. I have left most
things to their default values but have also tried tweaking queue sizes,
number of queues, interrupts, etc. with no luck





Best Regards,

Jesper

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-11-02 13:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <36170571.QMO0L8HZgB@xps13>
2015-11-01 10:05 ` [dpdk-users] Low TX performance on Mellanox ConnectX-3 NIC Olga Shern
2015-11-02 10:59   ` Jesper Wramberg
2015-11-02 12:31     ` Olga Shern
2015-11-02 12:57     ` Jesper Wramberg
2015-11-02 13:57       ` Jesper Wramberg
2015-10-31  8:54 Jesper Wramberg
2015-10-31 16:26 ` Wiles, Keith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).