DPDK usage discussions
 help / color / mirror / Atom feed
From: Anthony Hart <ahart@domainhart.com>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
Cc: "users@dpdk.org" <users@dpdk.org>
Subject: Re: [dpdk-users] eventdev performance
Date: Mon, 20 Aug 2018 12:05:00 -0400	[thread overview]
Message-ID: <33FF153C-EBBA-4124-8B8A-28688607452A@domainhart.com> (raw)
In-Reply-To: <E923DB57A917B54B9182A2E928D00FA65E2B4F21@IRSMSX102.ger.corp.intel.com>


Hi Harry,

Here’s two example command lines I’m using.  First with 1 worker second with 3 workers.

./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-4 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -w 10 -s1 -n0 -c128 -W0 -D


./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -s1 -n0 -c128 -W0 -D -w70



In terms of performance one issue is the (apparent) bottleneck in the scheduler, the other issue is that I now have 3 cores (rx, tx and schedule) that are not running my mission path.  Is there any way of scaling up the scheduler performance, i.e. adding more cores to the scheduling process?

many thanks 
tony

> On Aug 15, 2018, at 12:04 PM, Van Haaren, Harry <harry.van.haaren@intel.com> wrote:
> 
>> From: Anthony Hart [mailto:ahart@domainhart.com]
>> Sent: Thursday, August 9, 2018 4:56 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: users@dpdk.org
>> Subject: Re: [dpdk-users] eventdev performance
>> 
>> Hi Harry,
>> Thanks for the reply, please see responses inline
>> 
>>> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry <harry.van.haaren@intel.com>
>> wrote:
>>> 
>>> Hi Tony,
>>> 
>>>> -----Original Message-----
>>>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
>>>> Sent: Sunday, August 5, 2018 8:03 PM
>>>> To: users@dpdk.org
>>>> Subject: [dpdk-users] eventdev performance
>>>> 
>>>> I’ve been doing some performance measurements with the eventdev_pipeline
>>>> example application (to see how the eventdev library performs - dpdk
>> 18.05)
>>>> and I’m looking for some help in determining where the bottlenecks are in
>> my
>>>> testing.
>>> 
>>> If you have the "perf top" tool available, it is very useful in printing
>> statistics
>>> of where CPU cycles are spent during runtime. I use it regularly to
>> identify
>>> bottlenecks in the code for specific lcores.
>> 
>> Yes I have perf if there is something you’d like to see I can post it.
> 
> I'll check the rest of your email first. Generally I use perf to see are the
> cycles being spent on each core where is expected. In this case, it might
> be useful to look at the scheduler core and see where it is spending its time.
> 
> 
>>>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 device).
>> In
>>>> this configuration performance tops out with 3 workers (6 cores total)
>> and
>>>> adding more workers actually causes a reduction in throughput.   In my
>> setup
>>>> this is about 12Mpps.   The same setup running testpmd will reach >25Mpps
>>>> using only 1 core.
>>> 
>>> Raw forwarding of a packet is less work than forwarding and load-balancing
>>> across multiple cores. More work means more CPU cycles spent per packet,
>> hence less mpps.
>> 
>> ok.
>> 
>>> 
>>> 
>>>> This is the eventdev command line.
>>>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8
>> -
>>>> w70 -s1 -n0 -c128 -W0 -D
>>> 
>>> The -W0 indicates to perform zero cycles of work on each worker core.
>>> This makes each of the 3 worker cores very fast in returning work to the
>>> scheduler core, and puts extra pressure on the scheduler. Note that in a
>>> real-world use-case you presumably want to do work on each of the worker
>>> cores, so the command above (while valid for understanding how it works,
>>> and performance of certain things) is not expected to be used in
>> production.
>>> 
>>> I'm not sure how familiar you are with CPU caches, but it is worth
>> understanding
>>> that reading this "locally" from L1 or L2 cache is very fast compared to
>>> communicating with another core.
>>> 
>>> Given that with -W0 the worker cores are very fast, the scheduler can
>> rarely
>>> read data locally - it always has to communicate with other cores.
>>> 
>>> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles of
>> work
>>> per event mimic doing actual work on each event.
>> 
>> Adding work with -W reduces performance.
> 
> OK - that means that the worker cores are at least part of the bottleneck.
> If they were very idle, adding some work to them would not have changed
> the performance.
> 
>> I modify eventdev_pipeline to print the contents of
>> rte_event_eth_rx_adapter_stats for the device.  In particular I print the
>> rx_enq_retry and rx_poll_count values for the receive thread.    Once I get
>> to a load level where packets are dropped I see that the number of retires
>> equals or exceeds the poll count (as I increase the load the retries exceeds
>> the poll count).
>> 
>> I think this indicates that the Scheduler is not keeping up.  That could be
>> (I assume) because the workers are not consuming fast enough.  However if I
>> increase the number of workers then the ratio of retry to poll_count (in the
>> rx thread) goes up, for example adding 4 more workers and the retries:poll
>> ration becomes 5:1
>> 
>> Seems like this is indicating that the Scheduler is the bottleneck?
> 
> So I gather you have  prototyped the pipeline you want to run with the
> eventdev_pipeline sample app? Would you share the command line being
> used with the eventdev_pipeline sample app, and I can try reproduce / understand.
> 
> One of the easiest mistakes (that I make regularly :) is that the RX/TX/Sched
> core overlap, which causes excessive work to be performed on one thread,
> reducing overall performance.
> 
> 
>>>> This is the tested command line.
>>>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq 1 --
>>>> port-topology=loop
>>>> 
>>>> 
>>>> I’m guessing that its either the RX or Sched that’s the bottleneck in my
>>>> eventdev_pipeline setup.
>>> 
>>> Given that you state testpmd is capable of forwarding at >25 mpps on your
>>> platform it is safe to rule out RX, since testpmd is performing the RX in
>>> that forwarding workload.
>>> 
>>> Which leaves the scheduler - and indeed the scheduler is probably what is
>>> the limiting factor in this case.
>> 
>> yes seems so.
>> 
>>> 
>>> 
>>>> So I first tried to use 2 cores for RX (-r6), performance went down.   It
>>>> seems that configuring 2 RX cores still only sets up 1 h/w receive ring
>> and
>>>> access to that one ring is alternated between the two cores?    So that
>>>> doesn’t help.
>>> 
>>> Correct - it is invalid to use two CPU cores on a single RX queue without
>>> some form of serialization (otherwise it causes race-conditions). The
>>> eventdev_pipeline sample app helpfully provides that - but there is a
>> performance
>>> impact on doing so. Using two RX threads on a single RX queue is generally
>>> not recommended.
>>> 
>>> 
>>>> Next, I could use 2 scheduler cores,  but how does that work, do they
>> again
>>>> alternate?   In any case throughput is reduced by 50% in that test.
>>> 
>>> Yes, for the same reason. The event_sw0 PMD does not allow multiple
>> threads
>>> to run it at the same time, and hence the serialization is in place to
>> ensure
>>> that the results are valid.
>>> 
>>> 
>>>> thanks for any insights,
>>>> tony
>>> 
>>> Try the suggestion above of adding work to the worker cores - this should
>>> "balance out" the current scheduling bottleneck, and place some more on
>>> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
>>> 
>>> Apart from that, I should try to understand your intended use better.
>>> Is this an academic investigation into the performance, or do you have
>>> specific goals in mind? Is dynamic load-balancing as the event_sw provides
>>> required, or would a simpler (and hence possibly more performant) method
>> suffice?
>>> 
>> 
>> Our current app uses the standard testpmd style of each core does rx->work-
>>> tx, the packets are spread across the cores using RSS in the ethernet
>> device.   This works fine provided the traffic is diverse.  Elephant flows
>> are a problem though, so we’d like the option of distributing the packets in
>> the way that eventdev_pipline -p does (yes I understand implications with
>> reordering).   So eventdev looks interesting.   So I was trying to get an
>> idea of what the performance implication would be in using eventdev.
> 
> Yes, valid use case, you're on the right track I suppose. Have you thought about
> what CPU budget you're willing to spend to get the functionality of dynamically
> spreading (elephant or smaller) flows across cores?
> 

      reply	other threads:[~2018-08-20 16:05 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-05 19:03 Anthony Hart
2018-08-07  8:34 ` Van Haaren, Harry
2018-08-09 15:56   ` Anthony Hart
2018-08-15 16:04     ` Van Haaren, Harry
2018-08-20 16:05       ` Anthony Hart [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33FF153C-EBBA-4124-8B8A-28688607452A@domainhart.com \
    --to=ahart@domainhart.com \
    --cc=harry.van.haaren@intel.com \
    --cc=users@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).