From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com [209.85.216.180]) by dpdk.org (Postfix) with ESMTP id B5CE51E2B for ; Mon, 20 Aug 2018 18:05:03 +0200 (CEST) Received: by mail-qt0-f180.google.com with SMTP id r37-v6so9527636qtc.0 for ; Mon, 20 Aug 2018 09:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=domainhart-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=E2GTZalKTuDsnyhgq6WEYJiZuApE0C4h0DvKDlOK4sc=; b=erzeyKvnnXj0wzDC1az+0XzVRNXdnd93OkWS3UNCPIrEvHUwnS7f9PZUbSGi0Wv8bM ClZEb3IthPLmsHHBdHChaZxQH2ofAuxK4NQT2rs+immTlCpfptxUsQmT9ier3tv6ojqm I4QRIYaL18jVv3z9UHJhOBtMQL75QKuUtpefjkpidfGxeKHaZB39mLhX0yWdfFyvpImz paK5Obutw2Kb8cbr4tszGWaqfPryu7Sk/wZ0elAkoDu2q+K5OSJK+UvrKOXF12ntdjsM by8K1vIUvRczQIDFUt16uvksUDdW6eA785e/ZMDZUU5tOxCAHfthKsCe78/lK/nSTPrq f+lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=E2GTZalKTuDsnyhgq6WEYJiZuApE0C4h0DvKDlOK4sc=; b=GuNQx5xKJAyUBJtQ0Vh1kpED/+7efo0feC93b1rTdqVwIqvCpl9xEa6NYdups+1cC/ vAPrXLQ3liOfNTOQN8eXUBzNLXDMXWVJSd8Cyrmbjm8vSAGlTBKbHrMNfogEdUC9oIsA JeFNS//+rq+rnmKBvJsPLwoaqFfXUGolEyN2BBGi5k9XoQihJWXDdTz9HCZip3t0wz1O 0UFu+evJYUKuegCkBqygPDBaly3kYADtUQBJM1acX0HyjFoMUuYOGAIZrN783sfvuDxH 9sNThl3glT9CUXrlKAwDJ6AzHCqvb5BoSpbovYmz+vlo7PQvPWYJr0pkZYcHaNtBmprs 8Kxg== X-Gm-Message-State: AOUpUlGPfjPFFcauWX/4PNrZ95yV8W1SYYa+yzv7dUnT7FwVd7w8mx9g oRwWFoFFZK8qhH/Xn4CPYEURrw== X-Google-Smtp-Source: AA+uWPwF6iP8XeBAgLvYxSF9m1IiZPSeChBcRp2MuJNZMWLreFqI8ZcQCAMqOWrX87aQInEvITdzaA== X-Received: by 2002:ac8:bcf:: with SMTP id p15-v6mr6474879qti.182.1534781103117; Mon, 20 Aug 2018 09:05:03 -0700 (PDT) Received: from thart-mac.corero.com ([50.236.24.116]) by smtp.gmail.com with ESMTPSA id a17-v6sm5708974qkb.62.2018.08.20.09.05.01 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Aug 2018 09:05:02 -0700 (PDT) From: Anthony Hart Message-Id: <33FF153C-EBBA-4124-8B8A-28688607452A@domainhart.com> Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Date: Mon, 20 Aug 2018 12:05:00 -0400 In-Reply-To: Cc: "users@dpdk.org" To: "Van Haaren, Harry" References: <3338AA01-BB97-494F-B39C-6A510D085C79@domainhart.com> <9E092979-55BD-4AA8-9785-4D660E84105F@domainhart.com> X-Mailer: Apple Mail (2.3445.9.1) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] eventdev performance X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 20 Aug 2018 16:05:04 -0000 Hi Harry, Here=E2=80=99s two example command lines I=E2=80=99m using. First with = 1 worker second with 3 workers. ./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-4 = -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -w 10 -s1 -n0 -c128 -W0 = -D ./examples/eventdev_pipeline/build/eventdev_pipeline -l 0-6 = -w0000:02:00.0 --vdev event_sw0 -- -r2 -t4 -e8 -s1 -n0 -c128 -W0 -D -w70 In terms of performance one issue is the (apparent) bottleneck in the = scheduler, the other issue is that I now have 3 cores (rx, tx and = schedule) that are not running my mission path. Is there any way of = scaling up the scheduler performance, i.e. adding more cores to the = scheduling process? many thanks=20 tony > On Aug 15, 2018, at 12:04 PM, Van Haaren, Harry = wrote: >=20 >> From: Anthony Hart [mailto:ahart@domainhart.com] >> Sent: Thursday, August 9, 2018 4:56 PM >> To: Van Haaren, Harry >> Cc: users@dpdk.org >> Subject: Re: [dpdk-users] eventdev performance >>=20 >> Hi Harry, >> Thanks for the reply, please see responses inline >>=20 >>> On Aug 7, 2018, at 4:34 AM, Van Haaren, Harry = >> wrote: >>>=20 >>> Hi Tony, >>>=20 >>>> -----Original Message----- >>>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Anthony = Hart >>>> Sent: Sunday, August 5, 2018 8:03 PM >>>> To: users@dpdk.org >>>> Subject: [dpdk-users] eventdev performance >>>>=20 >>>> I=E2=80=99ve been doing some performance measurements with the = eventdev_pipeline >>>> example application (to see how the eventdev library performs - = dpdk >> 18.05) >>>> and I=E2=80=99m looking for some help in determining where the = bottlenecks are in >> my >>>> testing. >>>=20 >>> If you have the "perf top" tool available, it is very useful in = printing >> statistics >>> of where CPU cycles are spent during runtime. I use it regularly to >> identify >>> bottlenecks in the code for specific lcores. >>=20 >> Yes I have perf if there is something you=E2=80=99d like to see I can = post it. >=20 > I'll check the rest of your email first. Generally I use perf to see = are the > cycles being spent on each core where is expected. In this case, it = might > be useful to look at the scheduler core and see where it is spending = its time. >=20 >=20 >>>> I have 1 Rx, 1 Tx, 1 Scheduler and N worker cores (1 sw_event0 = device). >> In >>>> this configuration performance tops out with 3 workers (6 cores = total) >> and >>>> adding more workers actually causes a reduction in throughput. In = my >> setup >>>> this is about 12Mpps. The same setup running testpmd will reach = >25Mpps >>>> using only 1 core. >>>=20 >>> Raw forwarding of a packet is less work than forwarding and = load-balancing >>> across multiple cores. More work means more CPU cycles spent per = packet, >> hence less mpps. >>=20 >> ok. >>=20 >>>=20 >>>=20 >>>> This is the eventdev command line. >>>> eventdev_pipeline -l 0,1-6 -w0000:02:00.0 --vdev event_sw0 -- -r2 = -t4 -e8 >> - >>>> w70 -s1 -n0 -c128 -W0 -D >>>=20 >>> The -W0 indicates to perform zero cycles of work on each worker = core. >>> This makes each of the 3 worker cores very fast in returning work to = the >>> scheduler core, and puts extra pressure on the scheduler. Note that = in a >>> real-world use-case you presumably want to do work on each of the = worker >>> cores, so the command above (while valid for understanding how it = works, >>> and performance of certain things) is not expected to be used in >> production. >>>=20 >>> I'm not sure how familiar you are with CPU caches, but it is worth >> understanding >>> that reading this "locally" from L1 or L2 cache is very fast = compared to >>> communicating with another core. >>>=20 >>> Given that with -W0 the worker cores are very fast, the scheduler = can >> rarely >>> read data locally - it always has to communicate with other cores. >>>=20 >>> Try adding -W1000 or so, and perhaps 4 worker cores. The 1000 cycles = of >> work >>> per event mimic doing actual work on each event. >>=20 >> Adding work with -W reduces performance. >=20 > OK - that means that the worker cores are at least part of the = bottleneck. > If they were very idle, adding some work to them would not have = changed > the performance. >=20 >> I modify eventdev_pipeline to print the contents of >> rte_event_eth_rx_adapter_stats for the device. In particular I print = the >> rx_enq_retry and rx_poll_count values for the receive thread. Once = I get >> to a load level where packets are dropped I see that the number of = retires >> equals or exceeds the poll count (as I increase the load the retries = exceeds >> the poll count). >>=20 >> I think this indicates that the Scheduler is not keeping up. That = could be >> (I assume) because the workers are not consuming fast enough. = However if I >> increase the number of workers then the ratio of retry to poll_count = (in the >> rx thread) goes up, for example adding 4 more workers and the = retries:poll >> ration becomes 5:1 >>=20 >> Seems like this is indicating that the Scheduler is the bottleneck? >=20 > So I gather you have prototyped the pipeline you want to run with the > eventdev_pipeline sample app? Would you share the command line being > used with the eventdev_pipeline sample app, and I can try reproduce / = understand. >=20 > One of the easiest mistakes (that I make regularly :) is that the = RX/TX/Sched > core overlap, which causes excessive work to be performed on one = thread, > reducing overall performance. >=20 >=20 >>>> This is the tested command line. >>>> testpmd -w0000:02:00.0 -l 0,1 -- -i --nb-core 1 --numa --rxq 1 = --txq 1 -- >>>> port-topology=3Dloop >>>>=20 >>>>=20 >>>> I=E2=80=99m guessing that its either the RX or Sched that=E2=80=99s = the bottleneck in my >>>> eventdev_pipeline setup. >>>=20 >>> Given that you state testpmd is capable of forwarding at >25 mpps on = your >>> platform it is safe to rule out RX, since testpmd is performing the = RX in >>> that forwarding workload. >>>=20 >>> Which leaves the scheduler - and indeed the scheduler is probably = what is >>> the limiting factor in this case. >>=20 >> yes seems so. >>=20 >>>=20 >>>=20 >>>> So I first tried to use 2 cores for RX (-r6), performance went = down. It >>>> seems that configuring 2 RX cores still only sets up 1 h/w receive = ring >> and >>>> access to that one ring is alternated between the two cores? So = that >>>> doesn=E2=80=99t help. >>>=20 >>> Correct - it is invalid to use two CPU cores on a single RX queue = without >>> some form of serialization (otherwise it causes race-conditions). = The >>> eventdev_pipeline sample app helpfully provides that - but there is = a >> performance >>> impact on doing so. Using two RX threads on a single RX queue is = generally >>> not recommended. >>>=20 >>>=20 >>>> Next, I could use 2 scheduler cores, but how does that work, do = they >> again >>>> alternate? In any case throughput is reduced by 50% in that test. >>>=20 >>> Yes, for the same reason. The event_sw0 PMD does not allow multiple >> threads >>> to run it at the same time, and hence the serialization is in place = to >> ensure >>> that the results are valid. >>>=20 >>>=20 >>>> thanks for any insights, >>>> tony >>>=20 >>> Try the suggestion above of adding work to the worker cores - this = should >>> "balance out" the current scheduling bottleneck, and place some more = on >>> each worker core. Try values of 500, 750, 1000, 1500 and 2500 or so. >>>=20 >>> Apart from that, I should try to understand your intended use = better. >>> Is this an academic investigation into the performance, or do you = have >>> specific goals in mind? Is dynamic load-balancing as the event_sw = provides >>> required, or would a simpler (and hence possibly more performant) = method >> suffice? >>>=20 >>=20 >> Our current app uses the standard testpmd style of each core does = rx->work- >>> tx, the packets are spread across the cores using RSS in the = ethernet >> device. This works fine provided the traffic is diverse. Elephant = flows >> are a problem though, so we=E2=80=99d like the option of distributing = the packets in >> the way that eventdev_pipline -p does (yes I understand implications = with >> reordering). So eventdev looks interesting. So I was trying to = get an >> idea of what the performance implication would be in using eventdev. >=20 > Yes, valid use case, you're on the right track I suppose. Have you = thought about > what CPU budget you're willing to spend to get the functionality of = dynamically > spreading (elephant or smaller) flows across cores? >=20