From: "Van Haaren, Harry" <harry.van.haaren@intel.com>
To: Anthony Hart <ahart@domainhart.com>, "users@dpdk.org" <users@dpdk.org>
Subject: Re: [dpdk-users] eventdev performance
Date: Tue, 7 Aug 2018 08:34:56 +0000 [thread overview]
Message-ID: <E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <3338AA01-BB97-494F-B39C-6A510D085C79@domainhart.com>
Hi Tony,
> -----Original Message-----
> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
> Sent: Sunday, August 5, 2018 8:03 PM
> To: users@dpdk.org
> Subject: [dpdk-users] eventdev performance
>
> I’ve been doing some performance measurements with the eventdev_pipeline
> example application (to see how the eventdev library performs - dpdk 18.05)
> and I’m looking for some help in determining where the bottlenecks are in my
> testing.
If you have the "perf top" tool available, it is very useful in printing statistics
of where CPU cycles are spent during runtime. I use it regularly to identify
bottlenecks in the code for specific lcores.
> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device). In
> this configuration performance tops out with 3 workers (6 cores total) and
> adding more workers actually causes a reduction in throughput. In my setup
> this is about 12Mpps. The same setup running testpmd will reach >25Mpps
> using only 1 core.
Raw forwarding of a packet is less work than forwarding and load-balancing
across multiple cores. More work means more CPU cycles spent per packet, hence less mpps.
> This is the eventdev command line.
> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -
> w70 -s1 -n0 -c128 -W0 -D
The -W0 indicates to perform zero cycles of work on each worker core.
This makes each of the 3 worker cores very fast in returning work to the
scheduler core, and puts extra pressure on the scheduler. Note that in a
real-world use-case you presumably want to do work on each of the worker
cores, so the command above (while valid for understanding how it works,
and performance of certain things) is not expected to be used in production.
I'm not sure how familiar you are with CPU caches, but it is worth understanding
that reading this "locally" from L1 or L2 cache is very fast compared to
communicating with another core.
Given that with -W0 the worker cores are very fast, the scheduler can rarely
read data locally - it always has to communicate with other cores.
Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of work
per event mimic doing actual work on each event.
> This is the tested command line.
> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
> port-topology=loop
>
>
> I’m guessing that its either the RX or Sched that’s the bottleneck in my
> eventdev_pipeline setup.
Given that you state testpmd is capable of forwarding at >25 mpps on your
platform it is safe to rule out RX, since testpmd is performing the RX in
that forwarding workload.
Which leaves the scheduler - and indeed the scheduler is probably what is
the limiting factor in this case.
> So I first tried to use 2 cores for RX (-r6), performance went down. It
> seems that configuring 2 RX cores still only sets up 1 h/w receive ring and
> access to that one ring is alternated between the two cores? So that
> doesn’t help.
Correct - it is invalid to use two CPU cores on a single RX queue without
some form of serialization (otherwise it causes race-conditions). The
eventdev_pipeline sample app helpfully provides that - but there is a performance
impact on doing so. Using two RX threads on a single RX queue is generally
not recommended.
> Next, I could use 2 scheduler cores, but how does that work, do they again
> alternate? In any case throughput is reduced by 50% in that test.
Yes, for the same reason. The event_sw0 PMD does not allow multiple threads
to run it at the same time, and hence the serialization is in place to ensure
that the results are valid.
> thanks for any insights,
> tony
Try the suggestion above of adding work to the worker cores - this should
"balance out" the current scheduling bottleneck, and place some more on
each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
Apart from that, I should try to understand your intended use better.
Is this an academic investigation into the performance, or do you have
specific goals in mind? Is dynamic load-balancing as the event_sw provides
required, or would a simpler (and hence possibly more performant) method suffice?
Regards, -Harry
next prev parent reply other threads:[~2018-08-07 8:35 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-05 19:03 Anthony Hart
2018-08-07 8:34 ` Van Haaren, Harry [this message]
2018-08-09 15:56 ` Anthony Hart
2018-08-15 16:04 ` Van Haaren, Harry
2018-08-20 16:05 ` Anthony Hart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com \
--to=harry.van.haaren@intel.com \
--cc=ahart@domainhart.com \
--cc=users@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).