From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f176.google.com (mail-qt0-f176.google.com [209.85.216.176]) by dpdk.org (Postfix) with ESMTP id C9721343C for ; Thu, 9 Aug 2018 17:56:09 +0200 (CEST) Received: by mail-qt0-f176.google.com with SMTP id t5-v6so7067097qtn.3 for ; Thu, 09 Aug 2018 08:56:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=domainhart-com.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=RwYCrUFlP3TkNc1Qc0hzArbhhcyFUhOZNkl9StxBZSQ=; b=bSucsCVwT1iXOrSbaJO8DUeWQmRPwyOiceG1cT3ENBlz2QtC3nUBErOt9FPfabXWQL nwzKZMoYmvI0fmS2qNgxN20z+Zup7hTUxaIdKtktyq0+gOQaeu7D4J5gkp29x5vBklYn ALt6yN1RYNTfkgFLgP8wjo5YOUzwGFDcYHnHeejJdPp09N6h5uqm2633kMZXgwirdHIa 8E+x/jxSGzOc0i3ZEdMJt+Kl032SysEM8rvWMCrUs/QLtqPa+oHJ98xaWBY9+xXrp0gK oRNJZOw/EeMomr4vtjkpB+CFrMenHdgoVkjuDyNBIl6PEAR/+IoEIAjKVm7lLBargIoM pnQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=RwYCrUFlP3TkNc1Qc0hzArbhhcyFUhOZNkl9StxBZSQ=; b=ajNitxuzBnuBfMbuGDCjncWglq8Vidp2hzAlEEy/IHIsPK0IbDO10gnAm8LBiBzeLe EkhBCHTxi1/6RhKxaw4BF/LOmkm8kCEambhNA7EA0BMZ+fH+A9U0GCaAI0YeCXfeyegM Ap7Z6Mp8TJ/QWZvhJvFEOEhobFZOADKHq/jB1XnRjha5Uox1Og2TrppEKPC1DDQS2oah iyPEkX2TUTJSasQGxlRvO1sXnMO7nVzMeVHrdevjVnR9XUErj0aGiBOwclezpH0KoUsa ABnchyZC0dD10+5CFItrUiFRdwow2djsb+/DdT6mY9oNrm/E7K+FI/P4GOseXfNpmjIf vfhQ== X-Gm-Message-State: AOUpUlHWDXqMQD5/HT4PMEn9TLnQ5iM+pqmQgDiRtZ4UruH6XzzpxHqr 4jRNEfUT2GMu31nNOEdjBDy0fA== X-Google-Smtp-Source: AA+uWPwTCQJnXKsIJeaLxV2OMffCKZsz44f2M82fiwOP+lmfjiyEYi4BT7J+5WHUkYiIpuhrl4j9dw== X-Received: by 2002:a0c:c290:: with SMTP id b16-v6mr2381683qvi.182.1533830168977; Thu, 09 Aug 2018 08:56:08 -0700 (PDT) Received: from thart-mac.corero.com ([50.236.24.116]) by smtp.gmail.com with ESMTPSA id r4-v6sm3527221qtm.10.2018.08.09.08.56.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Aug 2018 08:56:07 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) From: Anthony Hart In-Reply-To: Date: Thu, 9 Aug 2018 11:56:07 -0400 Cc: "users@dpdk.org" Content-Transfer-Encoding: quoted-printable Message-Id: <9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com> References: <3338AA01-BB97-494F-B39C-6A510D085C79@domainhart.com> To: "Van Haaren, Harry" X-Mailer: Apple Mail (2.3445.9.1) Subject: Re: [dpdk-users] eventdev performance X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Aug 2018 15:56:10 -0000 Hi Harry, Thanks for the reply, please see responses inline > On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry = wrote: >=20 > Hi Tony, >=20 >> -----Original Message----- >> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony Hart >> Sent: Sunday, August 5, 2018 8:03 PM >> To: users@dpdk.org >> Subject: [dpdk-users] eventdev performance >>=20 >> I=E2=80=99ve been doing some performance measurements with the = eventdev_pipeline >> example application (to see how the eventdev library performs - dpdk = 18.05) >> and I=E2=80=99m looking for some help in determining where the = bottlenecks are in my >> testing. >=20 > If you have the "perf top" tool available, it is very useful in = printing statistics > of where CPU cycles are spent during runtime. I use it regularly to = identify > bottlenecks in the code for specific lcores. Yes I have perf if there is something you=E2=80=99d like to see I can = post it. =20 >=20 >=20 >> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 = device). In >> this configuration performance tops out with 3 workers (6 cores = total) and >> adding more workers actually causes a reduction in throughput. In = my setup >> this is about 12Mpps. The same setup running testpmd will reach = >25Mpps >> using only 1 core. >=20 > Raw forwarding of a packet is less work than forwarding and = load-balancing > across multiple cores. More work means more CPU cycles spent per = packet, hence less mpps. ok. =20 >=20 >=20 >> This is the eventdev command line. >> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 = -e8 - >> w70 -s1 -n0 -c128 -W0 -D >=20 > The -W0 indicates to perform zero cycles of work on each worker core. > This makes each of the 3 worker cores very fast in returning work to = the > scheduler core, and puts extra pressure on the scheduler. Note that in = a > real-world use-case you presumably want to do work on each of the = worker > cores, so the command above (while valid for understanding how it = works, > and performance of certain things) is not expected to be used in = production. >=20 > I'm not sure how familiar you are with CPU caches, but it is worth = understanding > that reading this "locally" from L1 or L2 cache is very fast compared = to > communicating with another core. >=20 > Given that with -W0 the worker cores are very fast, the scheduler can = rarely > read data locally - it always has to communicate with other cores. >=20 > Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles = of work > per event mimic doing actual work on each event.=20 Adding work with -W reduces performance. I modify eventdev_pipeline to print the contents of = rte_event_eth_rx_adapter_stats for the device. In particular I print = the rx_enq_retry and rx_poll_count values for the receive thread. = Once I get to a load level where packets are dropped I see that the = number of retires equals or exceeds the poll count (as I increase the = load the retries exceeds the poll count). I think this indicates that the Scheduler is not keeping up. That could = be (I assume) because the workers are not consuming fast enough. = However if I increase the number of workers then the ratio of retry to = poll_count (in the rx thread) goes up, for example adding 4 more workers = and the retries:poll ration becomes 5:1 Seems like this is indicating that the Scheduler is the bottleneck? >=20 >=20 >> This is the tested command line. >> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 --txq = 1 -- >> port-topology=3Dloop >>=20 >>=20 >> I=E2=80=99m guessing that its either the RX or Sched that=E2=80=99s = the bottleneck in my >> eventdev_pipeline setup. >=20 > Given that you state testpmd is capable of forwarding at >25 mpps on = your > platform it is safe to rule out RX, since testpmd is performing the RX = in > that forwarding workload. >=20 > Which leaves the scheduler - and indeed the scheduler is probably what = is > the limiting factor in this case. yes seems so. >=20 >=20 >> So I first tried to use 2 cores for RX (-r6), performance went down. = It >> seems that configuring 2 RX cores still only sets up 1 h/w receive = ring and >> access to that one ring is alternated between the two cores? So = that >> doesn=E2=80=99t help. >=20 > Correct - it is invalid to use two CPU cores on a single RX queue = without > some form of serialization (otherwise it causes race-conditions). The > eventdev_pipeline sample app helpfully provides that - but there is a = performance > impact on doing so. Using two RX threads on a single RX queue is = generally > not recommended. >=20 >=20 >> Next, I could use 2 scheduler cores, but how does that work, do they = again >> alternate? In any case throughput is reduced by 50% in that test. >=20 > Yes, for the same reason. The event_sw0 PMD does not allow multiple = threads > to run it at the same time, and hence the serialization is in place to = ensure > that the results are valid. >=20 >=20 >> thanks for any insights, >> tony >=20 > Try the suggestion above of adding work to the worker cores - this = should > "balance out" the current scheduling bottleneck, and place some more = on > each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so. >=20 > Apart from that, I should try to understand your intended use better. > Is this an academic investigation into the performance, or do you have > specific goals in mind? Is dynamic load-balancing as the event_sw = provides > required, or would a simpler (and hence possibly more performant) = method suffice? >=20 Our current app uses the standard testpmd style of each core does = rx->work->tx, the packets are spread across the cores using RSS in the = ethernet device. This works fine provided the traffic is diverse. = Elephant flows are a problem though, so we=E2=80=99d like the option of = distributing the packets in the way that eventdev_pipline -p does (yes I = understand implications with reordering). So eventdev looks = interesting. So I was trying to get an idea of what the performance = implication would be in using eventdev. > Regards, -Harry