From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ahart@domainhart.com>
Received: from mail-qt0-f176.google.com (mail-qt0-f176.google.com
 [209.85.216.176]) by dpdk.org (Postfix) with ESMTP id C9721343C
 for <users@dpdk.org>; Thu,  9 Aug 2018 17:56:09 +0200 (CEST)
Received: by mail-qt0-f176.google.com with SMTP id t5-v6so7067097qtn.3
 for <users@dpdk.org>; Thu, 09 Aug 2018 08:56:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=domainhart-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=RwYCrUFlP3TkNc1Qc0hzArbhhcyFUhOZNkl9StxBZSQ=;
 b=bSucsCVwT1iXOrSbaJO8DUeWQmRPwyOiceG1cT3ENBlz2QtC3nUBErOt9FPfabXWQL
 nwzKZMoYmvI0fmS2qNgxN20z+Zup7hTUxaIdKtktyq0+gOQaeu7D4J5gkp29x5vBklYn
 ALt6yN1RYNTfkgFLgP8wjo5YOUzwGFDcYHnHeejJdPp09N6h5uqm2633kMZXgwirdHIa
 8E+x/jxSGzOc0i3ZEdMJt+Kl032SysEM8rvWMCrUs/QLtqPa+oHJ98xaWBY9+xXrp0gK
 oRNJZOw/EeMomr4vtjkpB+CFrMenHdgoVkjuDyNBIl6PEAR/+IoEIAjKVm7lLBargIoM
 pnQA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=RwYCrUFlP3TkNc1Qc0hzArbhhcyFUhOZNkl9StxBZSQ=;
 b=ajNitxuzBnuBfMbuGDCjncWglq8Vidp2hzAlEEy/IHIsPK0IbDO10gnAm8LBiBzeLe
 EkhBCHTxi1/6RhKxaw4BF/LOmkm8kCEambhNA7EA0BMZ+fH+A9U0GCaAI0YeCXfeyegM
 Ap7Z6Mp8TJ/QWZvhJvFEOEhobFZOADKHq/jB1XnRjha5Uox1Og2TrppEKPC1DDQS2oah
 iyPEkX2TUTJSasQGxlRvO1sXnMO7nVzMeVHrdevjVnR9XUErj0aGiBOwclezpH0KoUsa
 ABnchyZC0dD10+5CFItrUiFRdwow2djsb+/DdT6mY9oNrm/E7K+FI/P4GOseXfNpmjIf
 vfhQ==
X-Gm-Message-State: AOUpUlHWDXqMQD5/HT4PMEn9TLnQ5iM+pqmQgDiRtZ4UruH6XzzpxHqr
 4jRNEfUT2GMu31nNOEdjBDy0fA==
X-Google-Smtp-Source: AA+uWPwTCQJnXKsIJeaLxV2OMffCKZsz44f2M82fiwOP+lmfjiyEYi4BT7J+5WHUkYiIpuhrl4j9dw==
X-Received: by 2002:a0c:c290:: with SMTP id
 b16-v6mr2381683qvi.182.1533830168977; 
 Thu, 09 Aug 2018 08:56:08 -0700 (PDT)
Received: from thart-mac.corero.com ([50.236.24.116])
 by smtp.gmail.com with ESMTPSA id r4-v6sm3527221qtm.10.2018.08.09.08.56.07
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 09 Aug 2018 08:56:07 -0700 (PDT)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Anthony Hart <ahart@domainhart.com>
In-Reply-To: <E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com>
Date: Thu, 9 Aug 2018 11:56:07 -0400
Cc: "users@dpdk.org" <users@dpdk.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com>
References: <3338AA01-BB97-494F-B39C-6A510D085C79@domainhart.com>
 <E923DB57A917B54B9182A2E928D00FA65E298E7F@IRSMSX102.ger.corp.intel.com>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
X-Mailer: Apple Mail (2.3445.9.1)
Subject: Re: [dpdk-users] eventdev performance
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 09 Aug 2018 15:56:10 -0000

Hi Harry,
Thanks for the reply, please see responses inline

> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry =
<harry.van.haaren@intel.com> wrote:
>=20
> Hi Tony,
>=20
>> -----Original Message-----
>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart
>> Sent: Sunday, August 5, 2018 8:03 PM
>> To: users@dpdk.org
>> Subject: [dpdk-users] eventdev performance
>>=20
>> I=E2=80=99ve been doing some performance measurements with the =
eventdev_pipeline
>> example application (to see how the eventdev library performs - dpdk =
18.05)
>> and I=E2=80=99m looking for some help in determining where the =
bottlenecks are in my
>> testing.
>=20
> If you have the "perf top" tool available, it is very useful in =
printing statistics
> of where CPU cycles are spent during runtime. I use it regularly to =
identify
> bottlenecks in the code for specific lcores.

Yes I have perf if there is something you=E2=80=99d like to see I can =
post it. =20

>=20
>=20
>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 =
device).   In
>> this configuration performance tops out with 3 workers (6 cores =
total) and
>> adding more workers actually causes a reduction in throughput.   In =
my setup
>> this is about 12Mpps.   The same setup running testpmd will reach =
>25Mpps
>> using only 1 core.
>=20
> Raw forwarding of a packet is less work than forwarding and =
load-balancing
> across multiple cores. More work means more CPU cycles spent per =
packet, hence less mpps.

ok. =20

>=20
>=20
>> This is the eventdev command line.
>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 =
-e8 -
>> w70 -s1 -n0 -c128 -W0 -D
>=20
> The -W0 indicates to perform zero cycles of work on each worker core.
> This makes each of the 3 worker cores very fast in returning work to =
the
> scheduler core, and puts extra pressure on the scheduler. Note that in =
a
> real-world use-case you presumably want to do work on each of the =
worker
> cores, so the command above (while valid for understanding how it =
works,
> and performance of certain things) is not expected to be used in =
production.
>=20
> I'm not sure how familiar you are with CPU caches, but it is worth =
understanding
> that reading this "locally" from L1 or L2 cache is very fast compared =
to
> communicating with another core.
>=20
> Given that with -W0 the worker cores are very fast, the scheduler can =
rarely
> read data locally - it always has to communicate with other cores.
>=20
> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles =
of work
> per event mimic doing actual work on each event.=20

Adding work with -W reduces performance.

I modify eventdev_pipeline to print the contents of =
rte_event_eth_rx_adapter_stats for the device.  In particular I print =
the rx_enq_retry and rx_poll_count values for the receive thread.    =
Once I get to a load level where packets are dropped I see that the =
number of retires equals or exceeds the poll count (as I increase the =
load the retries exceeds the poll count).

I think this indicates that the Scheduler is not keeping up.  That could =
be (I assume) because the workers are not consuming fast enough.  =
However if I increase the number of workers then the ratio of retry to =
poll_count (in the rx thread) goes up, for example adding 4 more workers =
and the retries:poll ration becomes 5:1

Seems like this is indicating that the Scheduler is the bottleneck?


>=20
>=20
>> This is the tested command line.
>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq =
1 --
>> port-topology=3Dloop
>>=20
>>=20
>> I=E2=80=99m guessing that its either the RX or Sched that=E2=80=99s =
the bottleneck in my
>> eventdev_pipeline setup.
>=20
> Given that you state testpmd is capable of forwarding at >25 mpps on =
your
> platform it is safe to rule out RX, since testpmd is performing the RX =
in
> that forwarding workload.
>=20
> Which leaves the scheduler - and indeed the scheduler is probably what =
is
> the limiting factor in this case.

yes seems so.

>=20
>=20
>> So I first tried to use 2 cores for RX (-r6), performance went down.  =
 It
>> seems that configuring 2 RX cores still only sets up 1 h/w receive =
ring and
>> access to that one ring is alternated between the two cores?    So =
that
>> doesn=E2=80=99t help.
>=20
> Correct - it is invalid to use two CPU cores on a single RX queue =
without
> some form of serialization (otherwise it causes race-conditions). The
> eventdev_pipeline sample app helpfully provides that - but there is a =
performance
> impact on doing so. Using two RX threads on a single RX queue is =
generally
> not recommended.
>=20
>=20
>> Next, I could use 2 scheduler cores,  but how does that work, do they =
again
>> alternate?   In any case throughput is reduced by 50% in that test.
>=20
> Yes, for the same reason. The event_sw0 PMD does not allow multiple =
threads
> to run it at the same time, and hence the serialization is in place to =
ensure
> that the results are valid.
>=20
>=20
>> thanks for any insights,
>> tony
>=20
> Try the suggestion above of adding work to the worker cores - this =
should
> "balance out" the current scheduling bottleneck, and place some more =
on
> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so.
>=20
> Apart from that, I should try to understand your intended use better.
> Is this an academic investigation into the performance, or do you have
> specific goals in mind? Is dynamic load-balancing as the event_sw =
provides
> required, or would a simpler (and hence possibly more performant) =
method suffice?
>=20

Our current app uses the standard testpmd style of each core does =
rx->work->tx, the packets are spread across the cores using RSS in the =
ethernet device.   This works fine provided the traffic is diverse.  =
Elephant flows are a problem though, so we=E2=80=99d like the option of =
distributing the packets in the way that eventdev_pipline -p does (yes I =
understand implications with reordering).   So eventdev looks =
interesting.   So I was trying to get an idea of what the performance =
implication would be in using eventdev.


> Regards, -Harry