From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ahart@domainhart.com>
Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com
 [209.85.216.180]) by dpdk.org (Postfix) with ESMTP id B5CE51E2B
 for <users@dpdk.org>; Mon, 20 Aug 2018 18:05:03 +0200 (CEST)
Received: by mail-qt0-f180.google.com with SMTP id r37-v6so9527636qtc.0
 for <users@dpdk.org>; Mon, 20 Aug 2018 09:05:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=domainhart-com.20150623.gappssmtp.com; s=20150623;
 h=from:message-id:mime-version:subject:date:in-reply-to:cc:to
 :references; bh=E2GTZalKTuDsnyhgq6WEYJiZuApE0C4h0DvKDlOK4sc=;
 b=erzeyKvnnXj0wzDC1az+0XzVRNXdnd93OkWS3UNCPIrEvHUwnS7f9PZUbSGi0Wv8bM
 ClZEb3IthPLmsHHBdHChaZxQH2ofAuxK4NQT2rs+immTlCpfptxUsQmT9ier3tv6ojqm
 I4QRIYaL18jVv3z9UHJhOBtMQL75QKuUtpefjkpidfGxeKHaZB39mLhX0yWdfFyvpImz
 paK5Obutw2Kb8cbr4tszGWaqfPryu7Sk/wZ0elAkoDu2q+K5OSJK+UvrKOXF12ntdjsM
 by8K1vIUvRczQIDFUt16uvksUDdW6eA785e/ZMDZUU5tOxCAHfthKsCe78/lK/nSTPrq
 f+lg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:message-id:mime-version:subject:date
 :in-reply-to:cc:to:references;
 bh=E2GTZalKTuDsnyhgq6WEYJiZuApE0C4h0DvKDlOK4sc=;
 b=GuNQx5xKJAyUBJtQ0Vh1kpED/+7efo0feC93b1rTdqVwIqvCpl9xEa6NYdups+1cC/
 vAPrXLQ3liOfNTOQN8eXUBzNLXDMXWVJSd8Cyrmbjm8vSAGlTBKbHrMNfogEdUC9oIsA
 JeFNS//+rq+rnmKBvJsPLwoaqFfXUGolEyN2BBGi5k9XoQihJWXDdTz9HCZip3t0wz1O
 0UFu+evJYUKuegCkBqygPDBaly3kYADtUQBJM1acX0HyjFoMUuYOGAIZrN783sfvuDxH
 9sNThl3glT9CUXrlKAwDJ6AzHCqvb5BoSpbovYmz+vlo7PQvPWYJr0pkZYcHaNtBmprs
 8Kxg==
X-Gm-Message-State: AOUpUlGPfjPFFcauWX/4PNrZ95yV8W1SYYa+yzv7dUnT7FwVd7w8mx9g
 oRwWFoFFZK8qhH/Xn4CPYEURrw==
X-Google-Smtp-Source: AA+uWPwF6iP8XeBAgLvYxSF9m1IiZPSeChBcRp2MuJNZMWLreFqI8ZcQCAMqOWrX87aQInEvITdzaA==
X-Received: by 2002:ac8:bcf:: with SMTP id
 p15-v6mr6474879qti.182.1534781103117; 
 Mon, 20 Aug 2018 09:05:03 -0700 (PDT)
Received: from thart-mac.corero.com ([50.236.24.116])
 by smtp.gmail.com with ESMTPSA id a17-v6sm5708974qkb.62.2018.08.20.09.05.01
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 20 Aug 2018 09:05:02 -0700 (PDT)
From: Anthony Hart <ahart@domainhart.com>
Message-Id: <33FF153C-EBBA-4124-8B8A-28688607452A@domainhart.com>
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
Date: Mon, 20 Aug 2018 12:05:00 -0400
In-Reply-To: <E923DB57A917B54B9182A2E928D00FA65E2B4F21@IRSMSX102.ger.corp.intel.com>
Cc: "users@dpdk.org" <users@dpdk.org>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
References: <3338AA01-BB97-494F-B39C-6A510D085C79@domainhart.com>
 <E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com>
 <9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com>
 <E923DB57A917B54B9182A2E928D00FA65E2B4F21@IRSMSX102.ger.corp.intel.com>
X-Mailer: Apple Mail (2.3445.9.1)
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-users] eventdev performance
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 20 Aug 2018 16:05:04 -0000


Hi Harry,

Here=E2=80=99s two example command lines I=E2=80=99m using.  First with =
1 worker second with 3 workers.

./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-4 =
-w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -w 10 -s1 -n0 -c128 -W0 =
-D


./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-6 =
-w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -s1 -n0 -c128 -W0 -D -w70


In terms of performance one issue is the (apparent) bottleneck in the =
scheduler, the other issue is that I now have 3 cores (rx, tx and =
schedule) that are not running my mission path.  Is there any way of =
scaling up the scheduler performance, i.e. adding more cores to the =
scheduling process?

many thanks=20
tony

> On Aug 15, 2018, at 12:04 PM, Van Haaren, Harry =
<harry.van.haaren@intel.com> wrote:
>=20
>> From: Anthony Hart [mailto:ahart@domainhart.com]
>> Sent: Thursday, August 9, 2018 4:56 PM
>> To: Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: users@dpdk.org
>> Subject: Re: [dpdk-users] eventdev performance
>>=20
>> Hi Harry,
>> Thanks for the reply, please see responses inline
>>=20
>>> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry =
<harry.van.haaren@intel.com>
>> wrote:
>>>=20
>>> Hi Tony,
>>>=20
>>>> -----Original Message-----
>>>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony =
Hart
>>>> Sent: Sunday, August 5, 2018 8:03 PM
>>>> To: users@dpdk.org
>>>> Subject: [dpdk-users] eventdev performance
>>>>=20
>>>> I=E2=80=99ve been doing some performance measurements with the =
eventdev_pipeline
>>>> example application (to see how the eventdev library performs - =
dpdk
>> 18.05)
>>>> and I=E2=80=99m looking for some help in determining where the =
bottlenecks are in
>> my
>>>> testing.
>>>=20
>>> If you have the "perf top" tool available, it is very useful in =
printing
>> statistics
>>> of where CPU cycles are spent during runtime. I use it regularly to
>> identify
>>> bottlenecks in the code for specific lcores.
>>=20
>> Yes I have perf if there is something you=E2=80=99d like to see I can =
post it.
>=20
> I'll check the rest of your email first. Generally I use perf to see =
are the
> cycles being spent on each core where is expected. In this case, it =
might
> be useful to look at the scheduler core and see where it is spending =
its time.
>=20
>=20
>>>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 =
device).
>> In
>>>> this configuration performance tops out with 3 workers (6 cores =
total)
>> and
>>>> adding more workers actually causes a reduction in throughput.   In =
my
>> setup
>>>> this is about 12Mpps.   The same setup running testpmd will reach =
>25Mpps
>>>> using only 1 core.
>>>=20
>>> Raw forwarding of a packet is less work than forwarding and =
load-balancing
>>> across multiple cores. More work means more CPU cycles spent per =
packet,
>> hence less mpps.
>>=20
>> ok.
>>=20
>>>=20
>>>=20
>>>> This is the eventdev command line.
>>>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 =
-t4 -e8
>> -
>>>> w70 -s1 -n0 -c128 -W0 -D
>>>=20
>>> The -W0 indicates to perform zero cycles of work on each worker =
core.
>>> This makes each of the 3 worker cores very fast in returning work to =
the
>>> scheduler core, and puts extra pressure on the scheduler. Note that =
in a
>>> real-world use-case you presumably want to do work on each of the =
worker
>>> cores, so the command above (while valid for understanding how it =
works,
>>> and performance of certain things) is not expected to be used in
>> production.
>>>=20
>>> I'm not sure how familiar you are with CPU caches, but it is worth
>> understanding
>>> that reading this "locally" from L1 or L2 cache is very fast =
compared to
>>> communicating with another core.
>>>=20
>>> Given that with -W0 the worker cores are very fast, the scheduler =
can
>> rarely
>>> read data locally - it always has to communicate with other cores.
>>>=20
>>> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles =
of
>> work
>>> per event mimic doing actual work on each event.
>>=20
>> Adding work with -W reduces performance.
>=20
> OK - that means that the worker cores are at least part of the =
bottleneck.
> If they were very idle, adding some work to them would not have =
changed
> the performance.
>=20
>> I modify eventdev_pipeline to print the contents of
>> rte_event_eth_rx_adapter_stats for the device.  In particular I print =
the
>> rx_enq_retry and rx_poll_count values for the receive thread.    Once =
I get
>> to a load level where packets are dropped I see that the number of =
retires
>> equals or exceeds the poll count (as I increase the load the retries =
exceeds
>> the poll count).
>>=20
>> I think this indicates that the Scheduler is not keeping up.  That =
could be
>> (I assume) because the workers are not consuming fast enough.  =
However if I
>> increase the number of workers then the ratio of retry to poll_count =
(in the
>> rx thread) goes up, for example adding 4 more workers and the =
retries:poll
>> ration becomes 5:1
>>=20
>> Seems like this is indicating that the Scheduler is the bottleneck?
>=20
> So I gather you have  prototyped the pipeline you want to run with the
> eventdev_pipeline sample app? Would you share the command line being
> used with the eventdev_pipeline sample app, and I can try reproduce / =
understand.
>=20
> One of the easiest mistakes (that I make regularly :) is that the =
RX/TX/Sched
> core overlap, which causes excessive work to be performed on one =
thread,
> reducing overall performance.
>=20
>=20
>>>> This is the tested command line.
>>>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 =
--txq 1 --
>>>> port-topology=3Dloop
>>>>=20
>>>>=20
>>>> I=E2=80=99m guessing that its either the RX or Sched that=E2=80=99s =
the bottleneck in my
>>>> eventdev_pipeline setup.
>>>=20
>>> Given that you state testpmd is capable of forwarding at >25 mpps on =
your
>>> platform it is safe to rule out RX, since testpmd is performing the =
RX in
>>> that forwarding workload.
>>>=20
>>> Which leaves the scheduler - and indeed the scheduler is probably =
what is
>>> the limiting factor in this case.
>>=20
>> yes seems so.
>>=20
>>>=20
>>>=20
>>>> So I first tried to use 2 cores for RX (-r6), performance went =
down.   It
>>>> seems that configuring 2 RX cores still only sets up 1 h/w receive =
ring
>> and
>>>> access to that one ring is alternated between the two cores?    So =
that
>>>> doesn=E2=80=99t help.
>>>=20
>>> Correct - it is invalid to use two CPU cores on a single RX queue =
without
>>> some form of serialization (otherwise it causes race-conditions). =
The
>>> eventdev_pipeline sample app helpfully provides that - but there is =
a
>> performance
>>> impact on doing so. Using two RX threads on a single RX queue is =
generally
>>> not recommended.
>>>=20
>>>=20
>>>> Next, I could use 2 scheduler cores,  but how does that work, do =
they
>> again
>>>> alternate?   In any case throughput is reduced by 50% in that test.
>>>=20
>>> Yes, for the same reason. The event_sw0 PMD does not allow multiple
>> threads
>>> to run it at the same time, and hence the serialization is in place =
to
>> ensure
>>> that the results are valid.
>>>=20
>>>=20
>>>> thanks for any insights,
>>>> tony
>>>=20
>>> Try the suggestion above of adding work to the worker cores - this =
should
>>> "balance out" the current scheduling bottleneck, and place some more =
on
>>> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
>>>=20
>>> Apart from that, I should try to understand your intended use =
better.
>>> Is this an academic investigation into the performance, or do you =
have
>>> specific goals in mind? Is dynamic load-balancing as the event_sw =
provides
>>> required, or would a simpler (and hence possibly more performant) =
method
>> suffice?
>>>=20
>>=20
>> Our current app uses the standard testpmd style of each core does =
rx->work-
>>> tx, the packets are spread across the cores using RSS in the =
ethernet
>> device.   This works fine provided the traffic is diverse.  Elephant =
flows
>> are a problem though, so we=E2=80=99d like the option of distributing =
the packets in
>> the way that eventdev_pipline -p does (yes I understand implications =
with
>> reordering).   So eventdev looks interesting.   So I was trying to =
get an
>> idea of what the performance implication would be in using eventdev.
>=20
> Yes, valid use case, you're on the right track I suppose. Have you =
thought about
> what CPU budget you're willing to spend to get the functionality of =
dynamically
> spreading (elephant or smaller) flows across cores?
>=20