DPDK patches and discussions
 help / color / mirror / Atom feed
* Re: [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD
       [not found] <55D76854.5010306@linaro.org>
@ 2015-08-23  6:46 ` Gray, Mark D
  2015-08-24 11:43 ` [dpdk-dev] [ovs-dev] " Traynor, Kevin
  2015-08-26 18:49 ` [dpdk-dev] " Zoltan Kiss
  2 siblings, 0 replies; 4+ messages in thread
From: Gray, Mark D @ 2015-08-23  6:46 UTC (permalink / raw)
  To: Zoltan Kiss, dev, dev


> 
> Hi,
> 
> I've set up a simple packet forwarding perf test on a dual-port 10G
> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
> one core used. I've used latest OVS with DPDK 2.1, and the first result
> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
> year with the same test. The first thing I've changed was to revert back
> to the old behaviour about this issue:
> 
> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
> 
> So instead of the new default I've passed 2048 +
> RTE_PKTMBUF_HEADROOM.

We'll post a patch this week that should resolve this (if it is the same issue).

> That increased the performance to 13.5, but to figure out what's wrong
> started to play with the receive functions. First I've disabled vector
> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
> then I've enabled scattered RX, and with
> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
> is I guess as close as possible to the 14.2 line rate (on my HW at
> least, with one core)
> Does anyone has a good explanation about why the vector PMD performs so
> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
> difference.
> I've tried to look into it with oprofile, but the results were quite
> strange: 35% of the samples were from miniflow_extract, the part where
> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
> snippet (1M samples):
> 
>    511454 19        0.0037  flow.c:511
>    511458 149       0.0292  dp-packet.h:266
>    51145f 4264      0.8357  dp-packet.h:267
>    511466 18        0.0035  dp-packet.h:268
>    51146d 43        0.0084  dp-packet.h:269
>    511474 172       0.0337  flow.c:511
>    51147a 4320      0.8467  string3.h:51
>    51147e 358763   70.3176  flow.c:99
>    511482 2        3.9e-04  string3.h:51
>    511485 3060      0.5998  string3.h:51
>    511488 1693      0.3318  string3.h:51
>    51148c 2933      0.5749  flow.c:326
>    511491 47        0.0092  flow.c:326
> 
> And the corresponding disassembled code:
> 
>    511454:       49 83 f9 0d             cmp    r9,0xd
>    511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
>    51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
>    511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
>    51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
>    511474:       0f 86 af 01 00 00       jbe    511629
> <miniflow_extract+0x279>
>    51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
>    51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
>    511482:       49 89 00                mov    QWORD PTR [r8],rax
>    511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
>    511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
>    51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
>    511491:       66 41 81 fa 81 00       cmp    r10w,0x81
> 
> My only explanation to this so far is that I misunderstand something
> about the oprofile results.
> 
> Regards,
> 
> Zoltan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD
       [not found] <55D76854.5010306@linaro.org>
  2015-08-23  6:46 ` [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD Gray, Mark D
@ 2015-08-24 11:43 ` Traynor, Kevin
  2015-08-26 17:07   ` Zoltan Kiss
  2015-08-26 18:49 ` [dpdk-dev] " Zoltan Kiss
  2 siblings, 1 reply; 4+ messages in thread
From: Traynor, Kevin @ 2015-08-24 11:43 UTC (permalink / raw)
  To: Zoltan Kiss, dev, dev


> -----Original Message-----
> From: dev [mailto:dev-bounces@openvswitch.org] On Behalf Of Zoltan Kiss
> Sent: Friday, August 21, 2015 7:05 PM
> To: dev@dpdk.org; dev@openvswitch.org
> Cc: Richardson, Bruce; Ananyev, Konstantin
> Subject: [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD
> 
> Hi,
> 
> I've set up a simple packet forwarding perf test on a dual-port 10G
> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
> one core used. I've used latest OVS with DPDK 2.1, and the first result
> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
> year with the same test. The first thing I've changed was to revert back
> to the old behaviour about this issue:
> 
> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
> 
> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
> That increased the performance to 13.5, but to figure out what's wrong
> started to play with the receive functions. First I've disabled vector
> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
> then I've enabled scattered RX, and with
> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
> is I guess as close as possible to the 14.2 line rate (on my HW at
> least, with one core)
> Does anyone has a good explanation about why the vector PMD performs so
> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
> difference.

I've previously turned on/off vectorisation and found that for tx it makes
a significant difference. For Rx it didn't make a much of a difference but
rx bulk allocation which gets enabled with it did improve performance.

Is there is something else also running on the current pmd core? did you
try moving it to another? Also, did you compile OVS with -O3/-Ofast, they
tend to give a performance boost.

Are you hitting 3.2 GHz for the core with the pmd? I think that is only
with turbo boost, so it may not be achievable all the time.

> I've tried to look into it with oprofile, but the results were quite
> strange: 35% of the samples were from miniflow_extract, the part where
> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
> snippet (1M samples):
> 
>    511454 19        0.0037  flow.c:511
>    511458 149       0.0292  dp-packet.h:266
>    51145f 4264      0.8357  dp-packet.h:267
>    511466 18        0.0035  dp-packet.h:268
>    51146d 43        0.0084  dp-packet.h:269
>    511474 172       0.0337  flow.c:511
>    51147a 4320      0.8467  string3.h:51
>    51147e 358763   70.3176  flow.c:99
>    511482 2        3.9e-04  string3.h:51
>    511485 3060      0.5998  string3.h:51
>    511488 1693      0.3318  string3.h:51
>    51148c 2933      0.5749  flow.c:326
>    511491 47        0.0092  flow.c:326
> 
> And the corresponding disassembled code:
> 
>    511454:       49 83 f9 0d             cmp    r9,0xd
>    511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
>    51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
>    511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
>    51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
>    511474:       0f 86 af 01 00 00       jbe    511629
> <miniflow_extract+0x279>
>    51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
>    51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
>    511482:       49 89 00                mov    QWORD PTR [r8],rax
>    511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
>    511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
>    51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
>    511491:       66 41 81 fa 81 00       cmp    r10w,0x81
> 
> My only explanation to this so far is that I misunderstand something
> about the oprofile results.
> 
> Regards,
> 
> Zoltan
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD
  2015-08-24 11:43 ` [dpdk-dev] [ovs-dev] " Traynor, Kevin
@ 2015-08-26 17:07   ` Zoltan Kiss
  0 siblings, 0 replies; 4+ messages in thread
From: Zoltan Kiss @ 2015-08-26 17:07 UTC (permalink / raw)
  To: Traynor, Kevin, dev, dev

Hi,

On 24/08/15 12:43, Traynor, Kevin wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@openvswitch.org] On Behalf Of Zoltan Kiss
>> Sent: Friday, August 21, 2015 7:05 PM
>> To: dev@dpdk.org; dev@openvswitch.org
>> Cc: Richardson, Bruce; Ananyev, Konstantin
>> Subject: [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD
>>
>> Hi,
>>
>> I've set up a simple packet forwarding perf test on a dual-port 10G
>> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
>> one core used. I've used latest OVS with DPDK 2.1, and the first result
>> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
>> year with the same test. The first thing I've changed was to revert back
>> to the old behaviour about this issue:
>>
>> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
>>
>> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
>> That increased the performance to 13.5, but to figure out what's wrong
>> started to play with the receive functions. First I've disabled vector
>> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
>> then I've enabled scattered RX, and with
>> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
>> is I guess as close as possible to the 14.2 line rate (on my HW at
>> least, with one core)
>> Does anyone has a good explanation about why the vector PMD performs so
>> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
>> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
>> difference.
>
> I've previously turned on/off vectorisation and found that for tx it makes
> a significant difference. For Rx it didn't make a much of a difference but
> rx bulk allocation which gets enabled with it did improve performance.
>
> Is there is something else also running on the current pmd core? did you
> try moving it to another?
I've tied the pmd to the second core, as far as I can see from top and 
profiling outputs hardly anything else runs there.

Also, did you compile OVS with -O3/-Ofast, they
> tend to give a performance boost.
Yes

>
> Are you hitting 3.2 GHz for the core with the pmd? I think that is only
> with turbo boost, so it may not be achievable all the time.
The turbo boost freq is 3.6 GHz.

>
>> I've tried to look into it with oprofile, but the results were quite
>> strange: 35% of the samples were from miniflow_extract, the part where
>> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
>> snippet (1M samples):
>>
>>     511454 19        0.0037  flow.c:511
>>     511458 149       0.0292  dp-packet.h:266
>>     51145f 4264      0.8357  dp-packet.h:267
>>     511466 18        0.0035  dp-packet.h:268
>>     51146d 43        0.0084  dp-packet.h:269
>>     511474 172       0.0337  flow.c:511
>>     51147a 4320      0.8467  string3.h:51
>>     51147e 358763   70.3176  flow.c:99
>>     511482 2        3.9e-04  string3.h:51
>>     511485 3060      0.5998  string3.h:51
>>     511488 1693      0.3318  string3.h:51
>>     51148c 2933      0.5749  flow.c:326
>>     511491 47        0.0092  flow.c:326
>>
>> And the corresponding disassembled code:
>>
>>     511454:       49 83 f9 0d             cmp    r9,0xd
>>     511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
>>     51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
>>     511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
>>     51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
>>     511474:       0f 86 af 01 00 00       jbe    511629
>> <miniflow_extract+0x279>
>>     51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
>>     51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
>>     511482:       49 89 00                mov    QWORD PTR [r8],rax
>>     511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
>>     511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
>>     51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
>>     511491:       66 41 81 fa 81 00       cmp    r10w,0x81
>>
>> My only explanation to this so far is that I misunderstand something
>> about the oprofile results.
>>
>> Regards,
>>
>> Zoltan
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD
       [not found] <55D76854.5010306@linaro.org>
  2015-08-23  6:46 ` [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD Gray, Mark D
  2015-08-24 11:43 ` [dpdk-dev] [ovs-dev] " Traynor, Kevin
@ 2015-08-26 18:49 ` Zoltan Kiss
  2 siblings, 0 replies; 4+ messages in thread
From: Zoltan Kiss @ 2015-08-26 18:49 UTC (permalink / raw)
  To: dev, dev

Hi,

I've checked it further, based on Stephen's suggestion I've tried perf 
top as well. The results were the same, it spends a lot of time in that 
part of the code, and there are high number of branch load misses 
(BR_MISS_PRED_RETIRED) around there too.
I've also started to strip down miniflow_extract() to remove parts which 
are not relevant to this very simple testcase. I've removed the metadata 
checking branches and the "size < sizeof(struct eth_header)". I've 
removed the size check from emc_processing, and placed log messages in 
flow_extract and netdev_flow_key_from_flow, to make sure the excessive 
time spent in miniflow_extract is not because these two are somehow 
calling it.
That way I've closed out all of the branches preceding this instruction. 
Oddly the high sample number now moved down a few instructions:
...
dp_packet_reset_offsets
   5113eb:       b8 ff ff ff ff          mov    $0xffffffff,%eax
   5113f0:       66 89 8f 86 00 00 00    mov    %cx,0x86(%rdi)
   5113f7:       c6 87 81 00 00 00 00    movb   $0x0,0x81(%rdi)
   5113fe:       66 89 87 82 00 00 00    mov    %ax,0x82(%rdi)
data_pull
   511405:       48 8d 4d 0c             lea    0xc(%rbp),%rcx
dp_packet_reset_offsets
   511409:       66 89 97 84 00 00 00    mov    %dx,0x84(%rdi)
memcpy
   511410:       48 8b 45 00             mov    0x0(%rbp),%rax
   511414:       48 89 46 18             mov    %rax,0x18(%rsi)

This last instruction moves the first 8 bytes of the MAC address (coming 
from 0x0(%rbp)) to 0x18(%rsi), which is basically memory pointed by 
parameter "struct miniflow *dst". It is allocated on the stack by 
emc_processing.
I couldn't find any branch which can cause this miss, but then I've 
checked the PMD stats:

pmd thread numa_id 0 core_id 1:
	emc hits:4395834176
	megaflow hits:1
	miss:1
	lost:0
	polling cycles:166083129380 (16.65%)
	processing cycles:831536059972 (83.35%)
	avg cycles per packet: 226.95 (997619189352/4395834178)
	avg processing cycles per packet: 189.16 (831536059972/4395834178)

So everything hits EMC, when I measured the change of that counter for 
10 seconds, the result was around ~13.3 Mpps too. The cycle statistics 
shows that it should be able to handle more than 15M packets per second, 
yet it doesn't receive that much, while with the non-vector PMD it can 
max out the link.
Any more suggestions?

Regards,

Zoltan


On 21/08/15 19:05, Zoltan Kiss wrote:
> Hi,
>
> I've set up a simple packet forwarding perf test on a dual-port 10G
> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
> one core used. I've used latest OVS with DPDK 2.1, and the first result
> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
> year with the same test. The first thing I've changed was to revert back
> to the old behaviour about this issue:
>
> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
>
> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
> That increased the performance to 13.5, but to figure out what's wrong
> started to play with the receive functions. First I've disabled vector
> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
> then I've enabled scattered RX, and with
> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
> is I guess as close as possible to the 14.2 line rate (on my HW at
> least, with one core)
> Does anyone has a good explanation about why the vector PMD performs so
> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
> difference.
> I've tried to look into it with oprofile, but the results were quite
> strange: 35% of the samples were from miniflow_extract, the part where
> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
> snippet (1M samples):
>
>    511454 19        0.0037  flow.c:511
>    511458 149       0.0292  dp-packet.h:266
>    51145f 4264      0.8357  dp-packet.h:267
>    511466 18        0.0035  dp-packet.h:268
>    51146d 43        0.0084  dp-packet.h:269
>    511474 172       0.0337  flow.c:511
>    51147a 4320      0.8467  string3.h:51
>    51147e 358763   70.3176  flow.c:99
>    511482 2        3.9e-04  string3.h:51
>    511485 3060      0.5998  string3.h:51
>    511488 1693      0.3318  string3.h:51
>    51148c 2933      0.5749  flow.c:326
>    511491 47        0.0092  flow.c:326
>
> And the corresponding disassembled code:
>
>    511454:       49 83 f9 0d             cmp    r9,0xd
>    511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
>    51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
>    511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
>    51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
>    511474:       0f 86 af 01 00 00       jbe    511629
> <miniflow_extract+0x279>
>    51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
>    51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
>    511482:       49 89 00                mov    QWORD PTR [r8],rax
>    511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
>    511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
>    51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
>    511491:       66 41 81 fa 81 00       cmp    r10w,0x81
>
> My only explanation to this so far is that I misunderstand something
> about the oprofile results.
>
> Regards,
>
> Zoltan

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-08-26 18:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <55D76854.5010306@linaro.org>
2015-08-23  6:46 ` [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD Gray, Mark D
2015-08-24 11:43 ` [dpdk-dev] [ovs-dev] " Traynor, Kevin
2015-08-26 17:07   ` Zoltan Kiss
2015-08-26 18:49 ` [dpdk-dev] " Zoltan Kiss

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).