* [dpdk-dev] Performance degradation with multiple ports
@ 2016-02-23 3:24 SwamZ
2016-02-23 6:27 ` Arnon Warshavsky
0 siblings, 1 reply; 2+ messages in thread
From: SwamZ @ 2016-02-23 3:24 UTC (permalink / raw)
To: dev
[-- Attachment #1: Type: text/plain, Size: 1393 bytes --]
Hi,
I am trying to find the maximum IO core performance with DPDK-2.2 code
using l2fwd application. I got the following number in comparison with
DPDK-1.7 code.
One Port Two ports
DPDK 2.2 14.86Mpps per port 11.8Mpps per port
DPDK 1.7 11.8Mpps per port 11.8Mpps per port
Traffic rate from Router tester: 64bytes packet with 100% line rate
(14.86Mpps per port)
CPU Speed : 3.3GHz
NIC : 82599ES 10-Gigabit
IO Virtualization: SR-IOV
Command used: ./l2fwd -c 3 -w 0000:02:00.1 -w 0000:02:00.0 -- -p 3 -T 1
Note:
- Both the ports are in same NUMA node. I got the same results with full
CPU core as well as hyper-theraded core.
- PCIe speed is same for both the ports. Attached the lspci and other
relevant output.
- In multiple port case, each core was receiving only 11.8Mpps. This means
that RX is the bottleneck.
Questions:
1) For two ports case, I am getting only 11.8Mpps per port compared to
single port case, for which I got line rate. What could be the reason for
this performance degradation? I was looking at the DPDK mail archive and
found the following article similar to this and couldn’t conclude anything.
http://dpdk.org/ml/archives/dev/2013-May/000115.html
2) Did anybody try this kind of performance test for i40E NIC?
Thanks,
Swamy
[-- Attachment #2: lspci_lscpu.txt --]
[-- Type: text/plain, Size: 9906 bytes --]
LSPIC output for the two NICs:
02:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 32
Region 0: Memory at dd880000 (64-bit, prefetchable) [size=512K]
Region 2: I/O ports at 8020 [size=32]
Region 4: Memory at dd904000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <1us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Device Serial Number 90-e2-ba-ff-ff-74-6b-c8
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
VF offset: 128, stride: 2, Device ID: 10ed
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000df800000 (64-bit, non-prefetchable)
Region 3: Memory at 00000000df700000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: ixgbe
02:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 36
Region 0: Memory at dd800000 (64-bit, prefetchable) [size=512K]
Region 2: I/O ports at 8000 [size=32]
Region 4: Memory at dd900000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend+
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <1us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Device Serial Number 90-e2-ba-ff-ff-74-6b-c8
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 01
VF offset: 128, stride: 2, Device ID: 10ed
Supported Page Size: 00000553, System Page Size: 00000001
Region 0: Memory at 00000000df600000 (64-bit, non-prefetchable)
Region 3: Memory at 00000000df500000 (64-bit, non-prefetchable)
VF Migration: offset: 00000000, BIR: 0
Kernel driver in use: ixgbe
root@BOX:~# uname -a
Linux BOX 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@BOX:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Stepping: 4
CPU MHz: 1200.000
BogoMIPS: 6601.55
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
root@BOX:~# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
stepping : 4
microcode : 0x416
cpu MHz : 1200.000
cache size : 25600 KB
physical id : 0
siblings : 16
core id : 1
cpu cores : 8
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 6599.86
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
stepping : 4
microcode : 0x416
cpu MHz : 3301.000
cache size : 25600 KB
physical id : 0
siblings : 16
core id : 2
cpu cores : 8
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips : 6599.86
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
<SNIP> Remaining core details are removed
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [dpdk-dev] Performance degradation with multiple ports
2016-02-23 3:24 [dpdk-dev] Performance degradation with multiple ports SwamZ
@ 2016-02-23 6:27 ` Arnon Warshavsky
0 siblings, 0 replies; 2+ messages in thread
From: Arnon Warshavsky @ 2016-02-23 6:27 UTC (permalink / raw)
To: SwamZ; +Cc: dev
Hi Swamy
A somewhat similar degradation (though not with l2fwd) was experienced by
us as described here
http://dev.dpdk.narkive.com/OL0KiHns/dpdk-dev-missing-prefetch-in-non-vector-rx-function
In our case it surfaced for not using the default configuration and working
in non-vector mode, and it behaved the same for both ixgbe and i40e.
/Arnon
On Tue, Feb 23, 2016 at 5:24 AM, SwamZ <swamssrk@gmail.com> wrote:
> Hi,
>
> I am trying to find the maximum IO core performance with DPDK-2.2 code
> using l2fwd application. I got the following number in comparison with
> DPDK-1.7 code.
>
>
> One Port Two ports
>
> DPDK 2.2 14.86Mpps per port 11.8Mpps per port
>
> DPDK 1.7 11.8Mpps per port 11.8Mpps per port
>
>
>
> Traffic rate from Router tester: 64bytes packet with 100% line rate
> (14.86Mpps per port)
>
> CPU Speed : 3.3GHz
>
> NIC : 82599ES 10-Gigabit
>
> IO Virtualization: SR-IOV
>
> Command used: ./l2fwd -c 3 -w 0000:02:00.1 -w 0000:02:00.0 -- -p 3 -T 1
>
>
> Note:
>
> - Both the ports are in same NUMA node. I got the same results with full
> CPU core as well as hyper-theraded core.
>
> - PCIe speed is same for both the ports. Attached the lspci and other
> relevant output.
>
> - In multiple port case, each core was receiving only 11.8Mpps. This means
> that RX is the bottleneck.
>
>
> Questions:
>
> 1) For two ports case, I am getting only 11.8Mpps per port compared to
> single port case, for which I got line rate. What could be the reason for
> this performance degradation? I was looking at the DPDK mail archive and
> found the following article similar to this and couldn’t conclude anything.
>
> http://dpdk.org/ml/archives/dev/2013-May/000115.html
>
>
> 2) Did anybody try this kind of performance test for i40E NIC?
>
>
> Thanks,
>
> Swamy
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-02-23 6:27 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23 3:24 [dpdk-dev] Performance degradation with multiple ports SwamZ
2016-02-23 6:27 ` Arnon Warshavsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).