* cache miss increases when change rx descriptor from 512 to 2048
@ 2023-02-09 3:58 Xiaoping Yan (NSB)
2023-02-09 16:38 ` Stephen Hemminger
0 siblings, 1 reply; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-09 3:58 UTC (permalink / raw)
To: users
[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]
Hi experts,
I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
Rx/tx descriptor 512, test result 3.2mpps
Rx/tx descriptor 2048, test result 3mpp
From perf data, rx descriptor 2048 case has more cache miss, and lower instruction per cycle
Perf for 512 rx descriptor
114289237792 cpu-cycles
365408402395 instructions # 3.20 insn per cycle
74186289932 branches
36020793 branch-misses # 0.05% of all branches
1298741388 bus-cycles
3413460 cache-misses # 0.723 % of all cache refs
472363654 cache-references
Perf for 2048 rx descriptor:
57038451185 cpu-cycles
173805485573 instructions # 3.05 insn per cycle
35289607389 branches
15418885 branch-misses # 0.04% of all branches
648164239 bus-cycles
13170596 cache-misses # 1.702 % of all cache refs
773765263 cache-references
I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
Any one observe similar results?
Any idea to mitigate (or investigate further) the impact? (we want to use 2048 to better tolerate some jitter/burst)
Any comment?
Thank you.
Br, Xiaoping
[-- Attachment #2: Type: text/html, Size: 6601 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: cache miss increases when change rx descriptor from 512 to 2048
2023-02-09 3:58 cache miss increases when change rx descriptor from 512 to 2048 Xiaoping Yan (NSB)
@ 2023-02-09 16:38 ` Stephen Hemminger
2023-02-10 1:59 ` Xiaoping Yan (NSB)
0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2023-02-09 16:38 UTC (permalink / raw)
To: Xiaoping Yan (NSB); +Cc: users
On Thu, 9 Feb 2023 03:58:56 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> Hi experts,
>
> I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> Rx/tx descriptor 512, test result 3.2mpps
> Rx/tx descriptor 2048, test result 3mpp
> From perf data, rx descriptor 2048 case has more cache miss, and lower instruction per cycle
> Perf for 512 rx descriptor
> 114289237792 cpu-cycles
> 365408402395 instructions # 3.20 insn per cycle
> 74186289932 branches
> 36020793 branch-misses # 0.05% of all branches
> 1298741388 bus-cycles
> 3413460 cache-misses # 0.723 % of all cache refs
> 472363654 cache-references
> Perf for 2048 rx descriptor:
> 57038451185 cpu-cycles
> 173805485573 instructions # 3.05 insn per cycle
> 35289607389 branches
> 15418885 branch-misses # 0.04% of all branches
> 648164239 bus-cycles
> 13170596 cache-misses # 1.702 % of all cache refs
> 773765263 cache-references
>
> I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
>
> Any one observe similar results?
> Any idea to mitigate (or investigate further) the impact? (we want to use 2048 to better tolerate some jitter/burst)
> Any comment?
>
> Thank you.
>
> Br, Xiaoping
>
If number of RX descriptors is small, there is a higher chance that when the device driver
walks the descriptor table or is using the resulting mbuf that the data is still in cache.
With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf
will not be in cache.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: cache miss increases when change rx descriptor from 512 to 2048
2023-02-09 16:38 ` Stephen Hemminger
@ 2023-02-10 1:59 ` Xiaoping Yan (NSB)
2023-02-10 2:38 ` Stephen Hemminger
0 siblings, 1 reply; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-10 1:59 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: users
Hi Stephen,
Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?
Thank you a lot.
Br, Xiaoping
-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org>
Sent: 2023年2月10日 0:38
To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
Cc: users@dpdk.org
Subject: Re: cache miss increases when change rx descriptor from 512 to 2048
On Thu, 9 Feb 2023 03:58:56 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> Hi experts,
>
> I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048, test
> result 3mpp From perf data, rx descriptor 2048 case has more cache
> miss, and lower instruction per cycle Perf for 512 rx descriptor
> 114289237792 cpu-cycles
> 365408402395 instructions # 3.20 insn per cycle
> 74186289932 branches
> 36020793 branch-misses # 0.05% of all branches
> 1298741388 bus-cycles
> 3413460 cache-misses # 0.723 % of all cache refs
> 472363654 cache-references
> Perf for 2048 rx descriptor:
> 57038451185 cpu-cycles
> 173805485573 instructions # 3.05 insn per cycle
> 35289607389 branches
> 15418885 branch-misses # 0.04% of all branches
> 648164239 bus-cycles
> 13170596 cache-misses # 1.702 % of all cache refs
> 773765263 cache-references
>
> I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
>
> Any one observe similar results?
> Any idea to mitigate (or investigate further) the impact? (we want to
> use 2048 to better tolerate some jitter/burst) Any comment?
>
> Thank you.
>
> Br, Xiaoping
>
If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.
With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: cache miss increases when change rx descriptor from 512 to 2048
2023-02-10 1:59 ` Xiaoping Yan (NSB)
@ 2023-02-10 2:38 ` Stephen Hemminger
2023-02-10 2:50 ` Xiaoping Yan (NSB)
0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2023-02-10 2:38 UTC (permalink / raw)
To: Xiaoping Yan (NSB); +Cc: users
On Fri, 10 Feb 2023 01:59:02 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> Hi Stephen,
>
> Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
> Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?
>
> Thank you a lot.
>
> Br, Xiaoping
>
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 2023年2月10日 0:38
> To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
> Cc: users@dpdk.org
> Subject: Re: cache miss increases when change rx descriptor from 512 to 2048
>
> On Thu, 9 Feb 2023 03:58:56 +0000
> "Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
>
> > Hi experts,
> >
> > I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> > Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048, test
> > result 3mpp From perf data, rx descriptor 2048 case has more cache
> > miss, and lower instruction per cycle Perf for 512 rx descriptor
> > 114289237792 cpu-cycles
> > 365408402395 instructions # 3.20 insn per cycle
> > 74186289932 branches
> > 36020793 branch-misses # 0.05% of all branches
> > 1298741388 bus-cycles
> > 3413460 cache-misses # 0.723 % of all cache refs
> > 472363654 cache-references
> > Perf for 2048 rx descriptor:
> > 57038451185 cpu-cycles
> > 173805485573 instructions # 3.05 insn per cycle
> > 35289607389 branches
> > 15418885 branch-misses # 0.04% of all branches
> > 648164239 bus-cycles
> > 13170596 cache-misses # 1.702 % of all cache refs
> > 773765263 cache-references
> >
> > I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> >
> > Any one observe similar results?
> > Any idea to mitigate (or investigate further) the impact? (we want to
> > use 2048 to better tolerate some jitter/burst) Any comment?
> >
> > Thank you.
> >
> > Br, Xiaoping
> >
>
> If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.
>
> With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.
The receive descriptors are a ring, so the mbuf returned by the hardware is the oldest one
that was queued. My mistake, that would be FIFO.
So the oldest mbuf (and associated ring data) will be the one the driver uses.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: cache miss increases when change rx descriptor from 512 to 2048
2023-02-10 2:38 ` Stephen Hemminger
@ 2023-02-10 2:50 ` Xiaoping Yan (NSB)
0 siblings, 0 replies; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-10 2:50 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: users
Hi Stephen,
Ok, get the point.
Thank you.
Br, Xiaoping
-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org>
Sent: 2023年2月10日 10:39
To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
Cc: users@dpdk.org
Subject: Re: cache miss increases when change rx descriptor from 512 to 2048
On Fri, 10 Feb 2023 01:59:02 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> Hi Stephen,
>
> Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
> Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?
>
> Thank you a lot.
>
> Br, Xiaoping
>
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 2023年2月10日 0:38
> To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
> Cc: users@dpdk.org
> Subject: Re: cache miss increases when change rx descriptor from 512
> to 2048
>
> On Thu, 9 Feb 2023 03:58:56 +0000
> "Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
>
> > Hi experts,
> >
> > I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> > Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048,
> > test result 3mpp From perf data, rx descriptor 2048 case has more
> > cache miss, and lower instruction per cycle Perf for 512 rx descriptor
> > 114289237792 cpu-cycles
> > 365408402395 instructions # 3.20 insn per cycle
> > 74186289932 branches
> > 36020793 branch-misses # 0.05% of all branches
> > 1298741388 bus-cycles
> > 3413460 cache-misses # 0.723 % of all cache refs
> > 472363654 cache-references
> > Perf for 2048 rx descriptor:
> > 57038451185 cpu-cycles
> > 173805485573 instructions # 3.05 insn per cycle
> > 35289607389 branches
> > 15418885 branch-misses # 0.04% of all branches
> > 648164239 bus-cycles
> > 13170596 cache-misses # 1.702 % of all cache refs
> > 773765263 cache-references
> >
> > I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> >
> > Any one observe similar results?
> > Any idea to mitigate (or investigate further) the impact? (we want
> > to use 2048 to better tolerate some jitter/burst) Any comment?
> >
> > Thank you.
> >
> > Br, Xiaoping
> >
>
> If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.
>
> With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.
The receive descriptors are a ring, so the mbuf returned by the hardware is the oldest one that was queued. My mistake, that would be FIFO.
So the oldest mbuf (and associated ring data) will be the one the driver uses.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-02-10 2:50 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09 3:58 cache miss increases when change rx descriptor from 512 to 2048 Xiaoping Yan (NSB)
2023-02-09 16:38 ` Stephen Hemminger
2023-02-10 1:59 ` Xiaoping Yan (NSB)
2023-02-10 2:38 ` Stephen Hemminger
2023-02-10 2:50 ` Xiaoping Yan (NSB)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).