* [dpdk-users] eventdev performance
@ 2018-08-05 19:03 Anthony Hart
  2018-08-07  8:34 ` Van Haaren, Harry
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Hart @ 2018-08-05 19:03 UTC (permalink / raw)
  To: users
I’ve been doing some performance measurements with the eventdev_pipeline example application (to see how the eventdev library performs - dpdk 18.05) and I’m looking for some help in determining where the bottlenecks are in my testing.
I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In this configuration performance tops out with 3 workers (6 cores total) and adding more workers actually causes a reduction in throughput.   In my setup this is about 12Mpps.   The same setup running testpmd will reach >25Mpps using only 1 core.
This is the eventdev command line.
eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -w70 -s1 -n0 -c128 -W0 -D
This is the tested command line.
testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --port-topology=loop
I’m guessing that its either the RX or Sched that’s the bottleneck in my eventdev_pipeline setup.  
So I first tried to use 2 cores for RX (-r6), performance went down.   It seems that configuring 2 RX cores still only sets up 1 h/w receive ring and access to that one ring is alternated between the two cores?    So that doesn’t help.
Next, I could use 2 scheduler cores,  but how does that work, do they again alternate?   In any case throughput is reduced by 50% in that test.
thanks for any insights,
tony
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [dpdk-users] eventdev performance
  2018-08-05 19:03 [dpdk-users] eventdev performance Anthony Hart
@ 2018-08-07  8:34 ` Van Haaren, Harry
  2018-08-09 15:56   ` Anthony Hart
  0 siblings, 1 reply; 5+ messages in thread
From: Van Haaren, Harry @ 2018-08-07  8:34 UTC (permalink / raw)
  To: Anthony Hart, users
Hi Tony,
> -----Original Message-----
> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
> Sent: Sunday, August 5, 2018 8:03 PM
> To: users@dpdk.org
> Subject: [dpdk-users] eventdev performance
> 
> I’ve been doing some performance measurements with the eventdev_pipeline
> example application (to see how the eventdev library performs - dpdk 18.05)
> and I’m looking for some help in determining where the bottlenecks are in my
> testing.
If you have the "perf top" tool available, it is very useful in printing statistics
of where CPU cycles are spent during runtime. I use it regularly to identify
bottlenecks in the code for specific lcores.
> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In
> this configuration performance tops out with 3 workers (6 cores total) and
> adding more workers actually causes a reduction in throughput.   In my setup
> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
> using only 1 core.
Raw forwarding of a packet is less work than forwarding and load-balancing
across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.
> This is the eventdev command line.
> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
> w70 -s1 -n0 -c128 -W0 -D
The -W0 indicates to perform zero cycles of work on each worker core.
This makes each of the 3 worker cores very fast in returning work to the
scheduler core, and puts extra pressure on the scheduler. Note that in a
real-world use-case you presumably want to do work on each of the worker
cores, so the command above (while valid for understanding how it works,
and performance of certain things) is not expected to be used in production.
I'm not sure how familiar you are with CPU caches, but it is worth understanding
that reading this "locally" from L1 or L2 cache is very fast compared to
communicating with another core.
Given that with -W0 the worker cores are very fast, the scheduler can rarely
read data locally - it always has to communicate with other cores.
Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
per event mimic doing actual work on each event. 
> This is the tested command line.
> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
> port-topology=loop
> 
> 
> I’m guessing that its either the RX or Sched that’s the bottleneck in my
> eventdev_pipeline setup.
Given that you state testpmd is capable of forwarding at >25 mpps on your
platform it is safe to rule out RX, since testpmd is performing the RX in
that forwarding workload.
Which leaves the scheduler - and indeed the scheduler is probably what is
the limiting factor in this case.
> So I first tried to use 2 cores for RX (-r6), performance went down.   It
> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
> access to that one ring is alternated between the two cores?    So that
> doesn’t help.
Correct - it is invalid to use two CPU cores on a single RX queue without
some form of serialization (otherwise it causes race-conditions). The
eventdev_pipeline sample app helpfully provides that - but there is a performance
impact on doing so. Using two RX threads on a single RX queue is generally
not recommended.
> Next, I could use 2 scheduler cores,  but how does that work, do they again
> alternate?   In any case throughput is reduced by 50% in that test.
Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
to run it at the same time, and hence the serialization is in place to ensure
that the results are valid.
> thanks for any insights,
> tony
Try the suggestion above of adding work to the worker cores - this should
"balance out" the current scheduling bottleneck, and place some more on
each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
Apart from that, I should try to understand your intended use better.
Is this an academic investigation into the performance, or do you have
specific goals in mind? Is dynamic load-balancing as the event_sw provides
required, or would a simpler (and hence possibly more performant) method suffice?
Regards, -Harry
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [dpdk-users] eventdev performance
  2018-08-07  8:34 ` Van Haaren, Harry
@ 2018-08-09 15:56   ` Anthony Hart
  2018-08-15 16:04     ` Van Haaren, Harry
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Hart @ 2018-08-09 15:56 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: users
Hi Harry,
Thanks for the reply, please see responses inline
> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren@intel.com> wrote:
> 
> Hi Tony,
> 
>> -----Original Message-----
>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users@dpdk.org
>> Subject: [dpdk-users] eventdev performance
>> 
>> I’ve been doing some performance measurements with the eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk 18.05)
>> and I’m looking for some help in determining where the bottlenecks are in my
>> testing.
> 
> If you have the "perf top" tool available, it is very useful in printing statistics
> of where CPU cycles are spent during runtime. I use it regularly to identify
> bottlenecks in the code for specific lcores.
Yes I have perf if there is something you’d like to see I can post it.  
> 
> 
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In
>> this configuration performance tops out with 3 workers (6 cores total) and
>> adding more workers actually causes a reduction in throughput.   In my setup
>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>> using only 1 core.
> 
> Raw forwarding of a packet is less work than forwarding and load-balancing
> across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.
ok.  
> 
> 
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
>> w70 -s1 -n0 -c128 -W0 -D
> 
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to the
> scheduler core, and puts extra pressure on the scheduler. Note that in a
> real-world use-case you presumably want to do work on each of the worker
> cores, so the command above (while valid for understanding how it works,
> and performance of certain things) is not expected to be used in production.
> 
> I'm not sure how familiar you are with CPU caches, but it is worth understanding
> that reading this "locally" from L1 or L2 cache is very fast compared to
> communicating with another core.
> 
> Given that with -W0 the worker cores are very fast, the scheduler can rarely
> read data locally - it always has to communicate with other cores.
> 
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
> per event mimic doing actual work on each event. 
Adding work with -W reduces performance.
I modify eventdev_pipeline to print the contents of rte_event_eth_rx_adapter_stats for the device.  In particular I print the rx_enq_retry and rx_poll_count values for the receive thread.    Once I get to a load level where packets are dropped I see that the number of retires equals or exceeds the poll count (as I increase the load the retries exceeds the poll count).
I think this indicates that the Scheduler is not keeping up.  That could be (I assume) because the workers are not consuming fast enough.  However if I increase the number of workers then the ratio of retry to poll_count (in the rx thread) goes up, for example adding 4 more workers and the retries:poll ration becomes 5:1
Seems like this is indicating that the Scheduler is the bottleneck?
> 
> 
>> This is the tested command line.
>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>> port-topology=loop
>> 
>> 
>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>> eventdev_pipeline setup.
> 
> Given that you state testpmd is capable of forwarding at >25 mpps on your
> platform it is safe to rule out RX, since testpmd is performing the RX in
> that forwarding workload.
> 
> Which leaves the scheduler - and indeed the scheduler is probably what is
> the limiting factor in this case.
yes seems so.
> 
> 
>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
>> access to that one ring is alternated between the two cores?    So that
>> doesn’t help.
> 
> Correct - it is invalid to use two CPU cores on a single RX queue without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a performance
> impact on doing so. Using two RX threads on a single RX queue is generally
> not recommended.
> 
> 
>> Next, I could use 2 scheduler cores,  but how does that work, do they again
>> alternate?   In any case throughput is reduced by 50% in that test.
> 
> Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
> to run it at the same time, and hence the serialization is in place to ensure
> that the results are valid.
> 
> 
>> thanks for any insights,
>> tony
> 
> Try the suggestion above of adding work to the worker cores - this should
> "balance out" the current scheduling bottleneck, and place some more on
> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
> 
> Apart from that, I should try to understand your intended use better.
> Is this an academic investigation into the performance, or do you have
> specific goals in mind? Is dynamic load-balancing as the event_sw provides
> required, or would a simpler (and hence possibly more performant) method suffice?
> 
Our current app uses the standard testpmd style of each core does rx->work->tx, the packets are spread across the cores using RSS in the ethernet device.   This works fine provided the traffic is diverse.  Elephant flows are a problem though, so we’d like the option of distributing the packets in the way that eventdev_pipline -p does (yes I understand implications with reordering).   So eventdev looks interesting.   So I was trying to get an idea of what the performance implication would be in using eventdev.
> Regards, -Harry
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [dpdk-users] eventdev performance
  2018-08-09 15:56   ` Anthony Hart
@ 2018-08-15 16:04     ` Van Haaren, Harry
  2018-08-20 16:05       ` Anthony Hart
  0 siblings, 1 reply; 5+ messages in thread
From: Van Haaren, Harry @ 2018-08-15 16:04 UTC (permalink / raw)
  To: Anthony Hart; +Cc: users
> From: Anthony Hart [mailto:ahart@domainhart.com]
> Sent: Thursday, August 9, 2018 4:56 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: users@dpdk.org
> Subject: Re: [dpdk-users] eventdev performance
> 
> Hi Harry,
> Thanks for the reply, please see responses inline
> 
> > On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren@intel.com>
> wrote:
> >
> > Hi Tony,
> >
> >> -----Original Message-----
> >> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
> >> Sent: Sunday, August 5, 2018 8:03 PM
> >> To: users@dpdk.org
> >> Subject: [dpdk-users] eventdev performance
> >>
> >> I’ve been doing some performance measurements with the eventdev_pipeline
> >> example application (to see how the eventdev library performs - dpdk
> 18.05)
> >> and I’m looking for some help in determining where the bottlenecks are in
> my
> >> testing.
> >
> > If you have the "perf top" tool available, it is very useful in printing
> statistics
> > of where CPU cycles are spent during runtime. I use it regularly to
> identify
> > bottlenecks in the code for specific lcores.
> 
> Yes I have perf if there is something you’d like to see I can post it.
I'll check the rest of your email first. Generally I use perf to see are the
cycles being spent on each core where is expected. In this case, it might
be useful to look at the scheduler core and see where it is spending its time.
> >> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).
> In
> >> this configuration performance tops out with 3 workers (6 cores total)
> and
> >> adding more workers actually causes a reduction in throughput.   In my
> setup
> >> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
> >> using only 1 core.
> >
> > Raw forwarding of a packet is less work than forwarding and load-balancing
> > across multiple cores. More work means more CPU cycles spent per packet,
> hence less mpps.
> 
> ok.
> 
> >
> >
> >> This is the eventdev command line.
> >> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8
> -
> >> w70 -s1 -n0 -c128 -W0 -D
> >
> > The -W0 indicates to perform zero cycles of work on each worker core.
> > This makes each of the 3 worker cores very fast in returning work to the
> > scheduler core, and puts extra pressure on the scheduler. Note that in a
> > real-world use-case you presumably want to do work on each of the worker
> > cores, so the command above (while valid for understanding how it works,
> > and performance of certain things) is not expected to be used in
> production.
> >
> > I'm not sure how familiar you are with CPU caches, but it is worth
> understanding
> > that reading this "locally" from L1 or L2 cache is very fast compared to
> > communicating with another core.
> >
> > Given that with -W0 the worker cores are very fast, the scheduler can
> rarely
> > read data locally - it always has to communicate with other cores.
> >
> > Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of
> work
> > per event mimic doing actual work on each event.
> 
> Adding work with -W reduces performance.
OK - that means that the worker cores are at least part of the bottleneck.
If they were very idle, adding some work to them would not have changed
the performance.
> I modify eventdev_pipeline to print the contents of
> rte_event_eth_rx_adapter_stats for the device.  In particular I print the
> rx_enq_retry and rx_poll_count values for the receive thread.    Once I get
> to a load level where packets are dropped I see that the number of retires
> equals or exceeds the poll count (as I increase the load the retries exceeds
> the poll count).
>
> I think this indicates that the Scheduler is not keeping up.  That could be
> (I assume) because the workers are not consuming fast enough.  However if I
> increase the number of workers then the ratio of retry to poll_count (in the
> rx thread) goes up, for example adding 4 more workers and the retries:poll
> ration becomes 5:1
> 
> Seems like this is indicating that the Scheduler is the bottleneck?
So I gather you have  prototyped the pipeline you want to run with the
eventdev_pipeline sample app? Would you share the command line being
used with the eventdev_pipeline sample app, and I can try reproduce / understand.
One of the easiest mistakes (that I make regularly :) is that the RX/TX/Sched
core overlap, which causes excessive work to be performed on one thread,
reducing overall performance.
> >> This is the tested command line.
> >> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
> >> port-topology=loop
> >>
> >>
> >> I’m guessing that its either the RX or Sched that’s the bottleneck in my
> >> eventdev_pipeline setup.
> >
> > Given that you state testpmd is capable of forwarding at >25 mpps on your
> > platform it is safe to rule out RX, since testpmd is performing the RX in
> > that forwarding workload.
> >
> > Which leaves the scheduler - and indeed the scheduler is probably what is
> > the limiting factor in this case.
> 
> yes seems so.
> 
> >
> >
> >> So I first tried to use 2 cores for RX (-r6), performance went down.   It
> >> seems that configuring 2 RX cores still only sets up 1 h/w receive ring
> and
> >> access to that one ring is alternated between the two cores?    So that
> >> doesn’t help.
> >
> > Correct - it is invalid to use two CPU cores on a single RX queue without
> > some form of serialization (otherwise it causes race-conditions). The
> > eventdev_pipeline sample app helpfully provides that - but there is a
> performance
> > impact on doing so. Using two RX threads on a single RX queue is generally
> > not recommended.
> >
> >
> >> Next, I could use 2 scheduler cores,  but how does that work, do they
> again
> >> alternate?   In any case throughput is reduced by 50% in that test.
> >
> > Yes, for the same reason. The event_sw0 PMD does not allow multiple
> threads
> > to run it at the same time, and hence the serialization is in place to
> ensure
> > that the results are valid.
> >
> >
> >> thanks for any insights,
> >> tony
> >
> > Try the suggestion above of adding work to the worker cores - this should
> > "balance out" the current scheduling bottleneck, and place some more on
> > each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
> >
> > Apart from that, I should try to understand your intended use better.
> > Is this an academic investigation into the performance, or do you have
> > specific goals in mind? Is dynamic load-balancing as the event_sw provides
> > required, or would a simpler (and hence possibly more performant) method
> suffice?
> >
> 
> Our current app uses the standard testpmd style of each core does rx->work-
> >tx, the packets are spread across the cores using RSS in the ethernet
> device.   This works fine provided the traffic is diverse.  Elephant flows
> are a problem though, so we’d like the option of distributing the packets in
> the way that eventdev_pipline -p does (yes I understand implications with
> reordering).   So eventdev looks interesting.   So I was trying to get an
> idea of what the performance implication would be in using eventdev.
Yes, valid use case, you're on the right track I suppose. Have you thought about
what CPU budget you're willing to spend to get the functionality of dynamically
spreading (elephant or smaller) flows across cores?
^ permalink raw reply	[flat|nested] 5+ messages in thread
* Re: [dpdk-users] eventdev performance
  2018-08-15 16:04     ` Van Haaren, Harry
@ 2018-08-20 16:05       ` Anthony Hart
  0 siblings, 0 replies; 5+ messages in thread
From: Anthony Hart @ 2018-08-20 16:05 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: users
Hi Harry,
Here’s two example command lines I’m using.  First with 1 worker second with 3 workers.
./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-4 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -w 10 -s1 -n0 -c128 -W0 -D
./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -s1 -n0 -c128 -W0 -D -w70
In terms of performance one issue is the (apparent) bottleneck in the scheduler, the other issue is that I now have 3 cores (rx, tx and schedule) that are not running my mission path.  Is there any way of scaling up the scheduler performance, i.e. adding more cores to the scheduling process?
many thanks 
tony
> On Aug 15, 2018, at 12:04 PM, Van Haaren, Harry <harry.van.haaren@intel.com> wrote:
> 
>> From: Anthony Hart [mailto:ahart@domainhart.com]
>> Sent: Thursday, August 9, 2018 4:56 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: users@dpdk.org
>> Subject: Re: [dpdk-users] eventdev performance
>> 
>> Hi Harry,
>> Thanks for the reply, please see responses inline
>> 
>>> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren@intel.com>
>> wrote:
>>> 
>>> Hi Tony,
>>> 
>>>> -----Original Message-----
>>>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
>>>> Sent: Sunday, August 5, 2018 8:03 PM
>>>> To: users@dpdk.org
>>>> Subject: [dpdk-users] eventdev performance
>>>> 
>>>> I’ve been doing some performance measurements with the eventdev_pipeline
>>>> example application (to see how the eventdev library performs - dpdk
>> 18.05)
>>>> and I’m looking for some help in determining where the bottlenecks are in
>> my
>>>> testing.
>>> 
>>> If you have the "perf top" tool available, it is very useful in printing
>> statistics
>>> of where CPU cycles are spent during runtime. I use it regularly to
>> identify
>>> bottlenecks in the code for specific lcores.
>> 
>> Yes I have perf if there is something you’d like to see I can post it.
> 
> I'll check the rest of your email first. Generally I use perf to see are the
> cycles being spent on each core where is expected. In this case, it might
> be useful to look at the scheduler core and see where it is spending its time.
> 
> 
>>>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).
>> In
>>>> this configuration performance tops out with 3 workers (6 cores total)
>> and
>>>> adding more workers actually causes a reduction in throughput.   In my
>> setup
>>>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>>>> using only 1 core.
>>> 
>>> Raw forwarding of a packet is less work than forwarding and load-balancing
>>> across multiple cores. More work means more CPU cycles spent per packet,
>> hence less mpps.
>> 
>> ok.
>> 
>>> 
>>> 
>>>> This is the eventdev command line.
>>>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8
>> -
>>>> w70 -s1 -n0 -c128 -W0 -D
>>> 
>>> The -W0 indicates to perform zero cycles of work on each worker core.
>>> This makes each of the 3 worker cores very fast in returning work to the
>>> scheduler core, and puts extra pressure on the scheduler. Note that in a
>>> real-world use-case you presumably want to do work on each of the worker
>>> cores, so the command above (while valid for understanding how it works,
>>> and performance of certain things) is not expected to be used in
>> production.
>>> 
>>> I'm not sure how familiar you are with CPU caches, but it is worth
>> understanding
>>> that reading this "locally" from L1 or L2 cache is very fast compared to
>>> communicating with another core.
>>> 
>>> Given that with -W0 the worker cores are very fast, the scheduler can
>> rarely
>>> read data locally - it always has to communicate with other cores.
>>> 
>>> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of
>> work
>>> per event mimic doing actual work on each event.
>> 
>> Adding work with -W reduces performance.
> 
> OK - that means that the worker cores are at least part of the bottleneck.
> If they were very idle, adding some work to them would not have changed
> the performance.
> 
>> I modify eventdev_pipeline to print the contents of
>> rte_event_eth_rx_adapter_stats for the device.  In particular I print the
>> rx_enq_retry and rx_poll_count values for the receive thread.    Once I get
>> to a load level where packets are dropped I see that the number of retires
>> equals or exceeds the poll count (as I increase the load the retries exceeds
>> the poll count).
>> 
>> I think this indicates that the Scheduler is not keeping up.  That could be
>> (I assume) because the workers are not consuming fast enough.  However if I
>> increase the number of workers then the ratio of retry to poll_count (in the
>> rx thread) goes up, for example adding 4 more workers and the retries:poll
>> ration becomes 5:1
>> 
>> Seems like this is indicating that the Scheduler is the bottleneck?
> 
> So I gather you have  prototyped the pipeline you want to run with the
> eventdev_pipeline sample app? Would you share the command line being
> used with the eventdev_pipeline sample app, and I can try reproduce / understand.
> 
> One of the easiest mistakes (that I make regularly :) is that the RX/TX/Sched
> core overlap, which causes excessive work to be performed on one thread,
> reducing overall performance.
> 
> 
>>>> This is the tested command line.
>>>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>>>> port-topology=loop
>>>> 
>>>> 
>>>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>>>> eventdev_pipeline setup.
>>> 
>>> Given that you state testpmd is capable of forwarding at >25 mpps on your
>>> platform it is safe to rule out RX, since testpmd is performing the RX in
>>> that forwarding workload.
>>> 
>>> Which leaves the scheduler - and indeed the scheduler is probably what is
>>> the limiting factor in this case.
>> 
>> yes seems so.
>> 
>>> 
>>> 
>>>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>>>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring
>> and
>>>> access to that one ring is alternated between the two cores?    So that
>>>> doesn’t help.
>>> 
>>> Correct - it is invalid to use two CPU cores on a single RX queue without
>>> some form of serialization (otherwise it causes race-conditions). The
>>> eventdev_pipeline sample app helpfully provides that - but there is a
>> performance
>>> impact on doing so. Using two RX threads on a single RX queue is generally
>>> not recommended.
>>> 
>>> 
>>>> Next, I could use 2 scheduler cores,  but how does that work, do they
>> again
>>>> alternate?   In any case throughput is reduced by 50% in that test.
>>> 
>>> Yes, for the same reason. The event_sw0 PMD does not allow multiple
>> threads
>>> to run it at the same time, and hence the serialization is in place to
>> ensure
>>> that the results are valid.
>>> 
>>> 
>>>> thanks for any insights,
>>>> tony
>>> 
>>> Try the suggestion above of adding work to the worker cores - this should
>>> "balance out" the current scheduling bottleneck, and place some more on
>>> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
>>> 
>>> Apart from that, I should try to understand your intended use better.
>>> Is this an academic investigation into the performance, or do you have
>>> specific goals in mind? Is dynamic load-balancing as the event_sw provides
>>> required, or would a simpler (and hence possibly more performant) method
>> suffice?
>>> 
>> 
>> Our current app uses the standard testpmd style of each core does rx->work-
>>> tx, the packets are spread across the cores using RSS in the ethernet
>> device.   This works fine provided the traffic is diverse.  Elephant flows
>> are a problem though, so we’d like the option of distributing the packets in
>> the way that eventdev_pipline -p does (yes I understand implications with
>> reordering).   So eventdev looks interesting.   So I was trying to get an
>> idea of what the performance implication would be in using eventdev.
> 
> Yes, valid use case, you're on the right track I suppose. Have you thought about
> what CPU budget you're willing to spend to get the functionality of dynamically
> spreading (elephant or smaller) flows across cores?
> 
^ permalink raw reply	[flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-08-20 16:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-05 19:03 [dpdk-users] eventdev performance Anthony Hart
2018-08-07  8:34 ` Van Haaren, Harry
2018-08-09 15:56   ` Anthony Hart
2018-08-15 16:04     ` Van Haaren, Harry
2018-08-20 16:05       ` Anthony Hart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).