Re: [dpdk-users] eventdev performance

DPDK usage discussions
 help / color / mirror / Atom feed

From: Anthony Hart <ahart@domainhart.com>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
Cc: "users@dpdk.org" <users@dpdk.org>
Subject: Re: [dpdk-users] eventdev performance
Date: Thu, 9 Aug 2018 11:56:07 -0400	[thread overview]
Message-ID: <9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com> (raw)
In-Reply-To: <E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com>

Hi Harry,
Thanks for the reply, please see responses inline

> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren@intel.com> wrote:
> 
> Hi Tony,
> 
>> -----Original Message-----
>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users@dpdk.org
>> Subject: [dpdk-users] eventdev performance
>> 
>> I’ve been doing some performance measurements with the eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk 18.05)
>> and I’m looking for some help in determining where the bottlenecks are in my
>> testing.
> 
> If you have the "perf top" tool available, it is very useful in printing statistics
> of where CPU cycles are spent during runtime. I use it regularly to identify
> bottlenecks in the code for specific lcores.

Yes I have perf if there is something you’d like to see I can post it.  

> 
> 
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).   In
>> this configuration performance tops out with 3 workers (6 cores total) and
>> adding more workers actually causes a reduction in throughput.   In my setup
>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>> using only 1 core.
> 
> Raw forwarding of a packet is less work than forwarding and load-balancing
> across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.

ok.  

> 
> 
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
>> w70 -s1 -n0 -c128 -W0 -D
> 
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to the
> scheduler core, and puts extra pressure on the scheduler. Note that in a
> real-world use-case you presumably want to do work on each of the worker
> cores, so the command above (while valid for understanding how it works,
> and performance of certain things) is not expected to be used in production.
> 
> I'm not sure how familiar you are with CPU caches, but it is worth understanding
> that reading this "locally" from L1 or L2 cache is very fast compared to
> communicating with another core.
> 
> Given that with -W0 the worker cores are very fast, the scheduler can rarely
> read data locally - it always has to communicate with other cores.
> 
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
> per event mimic doing actual work on each event. 

Adding work with -W reduces performance.

I modify eventdev_pipeline to print the contents of rte_event_eth_rx_adapter_stats for the device.  In particular I print the rx_enq_retry and rx_poll_count values for the receive thread.    Once I get to a load level where packets are dropped I see that the number of retires equals or exceeds the poll count (as I increase the load the retries exceeds the poll count).

I think this indicates that the Scheduler is not keeping up.  That could be (I assume) because the workers are not consuming fast enough.  However if I increase the number of workers then the ratio of retry to poll_count (in the rx thread) goes up, for example adding 4 more workers and the retries:poll ration becomes 5:1

Seems like this is indicating that the Scheduler is the bottleneck?


> 
> 
>> This is the tested command line.
>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>> port-topology=loop
>> 
>> 
>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>> eventdev_pipeline setup.
> 
> Given that you state testpmd is capable of forwarding at >25 mpps on your
> platform it is safe to rule out RX, since testpmd is performing the RX in
> that forwarding workload.
> 
> Which leaves the scheduler - and indeed the scheduler is probably what is
> the limiting factor in this case.

yes seems so.

> 
> 
>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
>> access to that one ring is alternated between the two cores?    So that
>> doesn’t help.
> 
> Correct - it is invalid to use two CPU cores on a single RX queue without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a performance
> impact on doing so. Using two RX threads on a single RX queue is generally
> not recommended.
> 
> 
>> Next, I could use 2 scheduler cores,  but how does that work, do they again
>> alternate?   In any case throughput is reduced by 50% in that test.
> 
> Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
> to run it at the same time, and hence the serialization is in place to ensure
> that the results are valid.
> 
> 
>> thanks for any insights,
>> tony
> 
> Try the suggestion above of adding work to the worker cores - this should
> "balance out" the current scheduling bottleneck, and place some more on
> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
> 
> Apart from that, I should try to understand your intended use better.
> Is this an academic investigation into the performance, or do you have
> specific goals in mind? Is dynamic load-balancing as the event_sw provides
> required, or would a simpler (and hence possibly more performant) method suffice?
> 

Our current app uses the standard testpmd style of each core does rx->work->tx, the packets are spread across the cores using RSS in the ethernet device.   This works fine provided the traffic is diverse.  Elephant flows are a problem though, so we’d like the option of distributing the packets in the way that eventdev_pipline -p does (yes I understand implications with reordering).   So eventdev looks interesting.   So I was trying to get an idea of what the performance implication would be in using eventdev.



> Regards, -Harry

next prev parent reply	other threads:[~2018-08-09 15:56 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-05 19:03 Anthony Hart
2018-08-07  8:34 ` Van Haaren, Harry
2018-08-09 15:56   ` Anthony Hart [this message]
2018-08-15 16:04     ` Van Haaren, Harry
2018-08-20 16:05       ` Anthony Hart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com \
    --to=ahart@domainhart.com \
    --cc=harry.van.haaren@intel.com \
    --cc=users@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).