DPDK usage discussions
 help / color / mirror / Atom feed
* cache miss increases when change rx descriptor from 512 to 2048
@ 2023-02-09  3:58 Xiaoping Yan (NSB)
  2023-02-09 16:38 ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-09  3:58 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

Hi experts,

I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
Rx/tx descriptor 512, test result 3.2mpps
Rx/tx descriptor 2048, test result 3mpp
From perf data, rx descriptor 2048 case has more cache miss, and lower instruction per cycle
Perf for 512 rx descriptor
      114289237792      cpu-cycles
      365408402395      instructions              #    3.20  insn per cycle
       74186289932      branches
          36020793      branch-misses             #    0.05% of all branches
        1298741388      bus-cycles
           3413460      cache-misses              #    0.723 % of all cache refs
         472363654      cache-references
Perf for 2048 rx descriptor:
       57038451185      cpu-cycles
      173805485573      instructions              #    3.05  insn per cycle
       35289607389      branches
          15418885      branch-misses             #    0.04% of all branches
         648164239      bus-cycles
          13170596      cache-misses              #    1.702 % of all cache refs
         773765263      cache-references

I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.

Any one observe similar results?
Any idea to mitigate (or investigate further) the impact? (we want to use 2048 to better tolerate some jitter/burst)
Any comment?

Thank you.

Br, Xiaoping


[-- Attachment #2: Type: text/html, Size: 6601 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: cache miss increases when change rx descriptor from 512 to 2048
  2023-02-09  3:58 cache miss increases when change rx descriptor from 512 to 2048 Xiaoping Yan (NSB)
@ 2023-02-09 16:38 ` Stephen Hemminger
  2023-02-10  1:59   ` Xiaoping Yan (NSB)
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2023-02-09 16:38 UTC (permalink / raw)
  To: Xiaoping Yan (NSB); +Cc: users

On Thu, 9 Feb 2023 03:58:56 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:

> Hi experts,
> 
> I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> Rx/tx descriptor 512, test result 3.2mpps
> Rx/tx descriptor 2048, test result 3mpp
> From perf data, rx descriptor 2048 case has more cache miss, and lower instruction per cycle
> Perf for 512 rx descriptor
>       114289237792      cpu-cycles
>       365408402395      instructions              #    3.20  insn per cycle
>        74186289932      branches
>           36020793      branch-misses             #    0.05% of all branches
>         1298741388      bus-cycles
>            3413460      cache-misses              #    0.723 % of all cache refs
>          472363654      cache-references
> Perf for 2048 rx descriptor:
>        57038451185      cpu-cycles
>       173805485573      instructions              #    3.05  insn per cycle
>        35289607389      branches
>           15418885      branch-misses             #    0.04% of all branches
>          648164239      bus-cycles
>           13170596      cache-misses              #    1.702 % of all cache refs
>          773765263      cache-references
> 
> I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> 
> Any one observe similar results?
> Any idea to mitigate (or investigate further) the impact? (we want to use 2048 to better tolerate some jitter/burst)
> Any comment?
> 
> Thank you.
> 
> Br, Xiaoping
> 

If number of RX descriptors is small, there is a higher chance that when the device driver
walks the descriptor table or is using the resulting mbuf that the data is still in cache.

With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf
will not be in cache.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: cache miss increases when change rx descriptor from 512 to 2048
  2023-02-09 16:38 ` Stephen Hemminger
@ 2023-02-10  1:59   ` Xiaoping Yan (NSB)
  2023-02-10  2:38     ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-10  1:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: users

Hi Stephen,

Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?

Thank you a lot.

Br, Xiaoping

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: 2023年2月10日 0:38
To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
Cc: users@dpdk.org
Subject: Re: cache miss increases when change rx descriptor from 512 to 2048

On Thu, 9 Feb 2023 03:58:56 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:

> Hi experts,
> 
> I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048, test 
> result 3mpp From perf data, rx descriptor 2048 case has more cache 
> miss, and lower instruction per cycle Perf for 512 rx descriptor
>       114289237792      cpu-cycles
>       365408402395      instructions              #    3.20  insn per cycle
>        74186289932      branches
>           36020793      branch-misses             #    0.05% of all branches
>         1298741388      bus-cycles
>            3413460      cache-misses              #    0.723 % of all cache refs
>          472363654      cache-references
> Perf for 2048 rx descriptor:
>        57038451185      cpu-cycles
>       173805485573      instructions              #    3.05  insn per cycle
>        35289607389      branches
>           15418885      branch-misses             #    0.04% of all branches
>          648164239      bus-cycles
>           13170596      cache-misses              #    1.702 % of all cache refs
>          773765263      cache-references
> 
> I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> 
> Any one observe similar results?
> Any idea to mitigate (or investigate further) the impact? (we want to 
> use 2048 to better tolerate some jitter/burst) Any comment?
> 
> Thank you.
> 
> Br, Xiaoping
> 

If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.

With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: cache miss increases when change rx descriptor from 512 to 2048
  2023-02-10  1:59   ` Xiaoping Yan (NSB)
@ 2023-02-10  2:38     ` Stephen Hemminger
  2023-02-10  2:50       ` Xiaoping Yan (NSB)
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2023-02-10  2:38 UTC (permalink / raw)
  To: Xiaoping Yan (NSB); +Cc: users

On Fri, 10 Feb 2023 01:59:02 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:

> Hi Stephen,
> 
> Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
> Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?
> 
> Thank you a lot.
> 
> Br, Xiaoping
> 
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org> 
> Sent: 2023年2月10日 0:38
> To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
> Cc: users@dpdk.org
> Subject: Re: cache miss increases when change rx descriptor from 512 to 2048
> 
> On Thu, 9 Feb 2023 03:58:56 +0000
> "Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> 
> > Hi experts,
> > 
> > I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> > Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048, test 
> > result 3mpp From perf data, rx descriptor 2048 case has more cache 
> > miss, and lower instruction per cycle Perf for 512 rx descriptor
> >       114289237792      cpu-cycles
> >       365408402395      instructions              #    3.20  insn per cycle
> >        74186289932      branches
> >           36020793      branch-misses             #    0.05% of all branches
> >         1298741388      bus-cycles
> >            3413460      cache-misses              #    0.723 % of all cache refs
> >          472363654      cache-references
> > Perf for 2048 rx descriptor:
> >        57038451185      cpu-cycles
> >       173805485573      instructions              #    3.05  insn per cycle
> >        35289607389      branches
> >           15418885      branch-misses             #    0.04% of all branches
> >          648164239      bus-cycles
> >           13170596      cache-misses              #    1.702 % of all cache refs
> >          773765263      cache-references
> > 
> > I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> > 
> > Any one observe similar results?
> > Any idea to mitigate (or investigate further) the impact? (we want to 
> > use 2048 to better tolerate some jitter/burst) Any comment?
> > 
> > Thank you.
> > 
> > Br, Xiaoping
> >   
> 
> If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.
> 
> With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.

The receive descriptors are a ring, so the mbuf returned by the hardware is the oldest one
that was queued. My mistake, that would be FIFO.

So the oldest mbuf (and associated ring data) will be the one the driver uses.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: cache miss increases when change rx descriptor from 512 to 2048
  2023-02-10  2:38     ` Stephen Hemminger
@ 2023-02-10  2:50       ` Xiaoping Yan (NSB)
  0 siblings, 0 replies; 5+ messages in thread
From: Xiaoping Yan (NSB) @ 2023-02-10  2:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: users

Hi Stephen,

Ok, get the point.
Thank you.

Br, Xiaoping

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: 2023年2月10日 10:39
To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
Cc: users@dpdk.org
Subject: Re: cache miss increases when change rx descriptor from 512 to 2048

On Fri, 10 Feb 2023 01:59:02 +0000
"Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:

> Hi Stephen,
> 
> Can you help to elaborate a bit on " since descriptors are used LIFO the rx descriptor and mbuf will not be in cache."?
> Last in means the mbuf just used and freed? And FO means such mbuf (LI) will be used first, would it mean that it has good chance it is still in cache?
> 
> Thank you a lot.
> 
> Br, Xiaoping
> 
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 2023年2月10日 0:38
> To: Xiaoping Yan (NSB) <xiaoping.yan@nokia-sbell.com>
> Cc: users@dpdk.org
> Subject: Re: cache miss increases when change rx descriptor from 512 
> to 2048
> 
> On Thu, 9 Feb 2023 03:58:56 +0000
> "Xiaoping Yan (NSB)" <xiaoping.yan@nokia-sbell.com> wrote:
> 
> > Hi experts,
> > 
> > I had a traffic throughput test for my dpdk application, with same software and test case, only difference is the number of rx/tx descriptor:
> > Rx/tx descriptor 512, test result 3.2mpps Rx/tx descriptor 2048, 
> > test result 3mpp From perf data, rx descriptor 2048 case has more 
> > cache miss, and lower instruction per cycle Perf for 512 rx descriptor
> >       114289237792      cpu-cycles
> >       365408402395      instructions              #    3.20  insn per cycle
> >        74186289932      branches
> >           36020793      branch-misses             #    0.05% of all branches
> >         1298741388      bus-cycles
> >            3413460      cache-misses              #    0.723 % of all cache refs
> >          472363654      cache-references
> > Perf for 2048 rx descriptor:
> >        57038451185      cpu-cycles
> >       173805485573      instructions              #    3.05  insn per cycle
> >        35289607389      branches
> >           15418885      branch-misses             #    0.04% of all branches
> >          648164239      bus-cycles
> >           13170596      cache-misses              #    1.702 % of all cache refs
> >          773765263      cache-references
> > 
> > I understand it means more rx descriptor somehow causes more cache miss and then less instruction per cycle, so lower performance.
> > 
> > Any one observe similar results?
> > Any idea to mitigate (or investigate further) the impact? (we want 
> > to use 2048 to better tolerate some jitter/burst) Any comment?
> > 
> > Thank you.
> > 
> > Br, Xiaoping
> >   
> 
> If number of RX descriptors is small, there is a higher chance that when the device driver walks the descriptor table or is using the resulting mbuf that the data is still in cache.
> 
> With large number of descriptors, since descriptors are used LIFO the rx descriptor and mbuf will not be in cache.

The receive descriptors are a ring, so the mbuf returned by the hardware is the oldest one that was queued. My mistake, that would be FIFO.

So the oldest mbuf (and associated ring data) will be the one the driver uses.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-02-10  2:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09  3:58 cache miss increases when change rx descriptor from 512 to 2048 Xiaoping Yan (NSB)
2023-02-09 16:38 ` Stephen Hemminger
2023-02-10  1:59   ` Xiaoping Yan (NSB)
2023-02-10  2:38     ` Stephen Hemminger
2023-02-10  2:50       ` Xiaoping Yan (NSB)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).