From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0+DSVq=OH=ericsson.com=mattias.ronnblom@lysator.liu.se>
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by dpdk.org (Postfix) with ESMTP id 70D011B3A7
 for <dev@dpdk.org>; Wed, 28 Nov 2018 17:55:44 +0100 (CET)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id D69C74001A
 for <dev@dpdk.org>; Wed, 28 Nov 2018 17:55:43 +0100 (CET)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id C1B6240005; Wed, 28 Nov 2018 17:55:43 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on
 bernadotte.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=5.0 tests=ALL_TRUSTED,AWL,
 PP_MIME_FAKE_ASCII_TEXT autolearn=disabled version=3.4.1
X-Spam-Score: -0.9
Received: from [192.168.1.59] (host-90-232-89-187.mobileonline.telia.com
 [90.232.89.187])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id E5D9240006;
 Wed, 28 Nov 2018 17:55:41 +0100 (CET)
To: Venky Venkatesh <vvenkatesh@paloaltonetworks.com>,
 "dev@dpdk.org" <dev@dpdk.org>
References: <27A03E76-DED0-435F-B02F-24A7A7B1BCC9@contoso.com>
 <779258cb-490f-0111-94ce-bc87d1502ed0@lysator.liu.se>
 <0AD526BD-FC54-4128-829D-6D5EE8BEAFC6@paloaltonetworks.com>
 <e0c27e88-4d0d-f65c-f3ed-cceb0dcb420e@ericsson.com>
 <7E26E1F9-4148-4F6C-9BC1-B79A419B2A97@paloaltonetworks.com>
From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= <mattias.ronnblom@ericsson.com>
Message-ID: <4af7e731-6243-ce80-cc78-4d6c0ebd7135@ericsson.com>
Date: Wed, 28 Nov 2018 17:55:41 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <7E26E1F9-4148-4F6C-9BC1-B79A419B2A97@paloaltonetworks.com>
Content-Language: en-US
X-Virus-Scanned: ClamAV using ClamSMTP
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: Re: [dpdk-dev] Application used for DSW event_dev performance
	testing
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Nov 2018 16:55:44 -0000

On 2018-11-27 23:33, Venky Venkatesh wrote:
> 
> As you can see the DSW overhead dominates the scene and very little real work is getting done. Is there some configuration or tuning to be done to get the sort of performance you are seeing with multiple cores?
>
I can't explain the behavior you are seeing based on the information you 
have supplied.

Attached is a small DSW throughput test program, that I thought might 
help you to find the issue. It works much like the pipeline simulator I 
used when developing the scheduler, but it's a lot simpler. Remember to 
supply "--vdev=event_dsw0".

I ran it on my 12-core Skylake desktop (@2,9 GHz, turbo disabled). With 
zero work and one stage, I get ~640 Mevent/s. For the first few stages 
you add, you'll see a drop in performance. For example, with 3 stages, 
you are at ~310 Mevent/s.

If you increase DSW_MAX_PORT_OUT_BUFFER and DSW_MAX_PORT_OPS_PER_BG_TASK 
you see improvements in efficiency on high-core-count machines. On my 
system, the above goes to 675 M/s for a 1-stage pipeline, and 460 M/s on 
a 3-stage pipeline, if I apply the following changes to dsw_evdev.h:
-#define DSW_MAX_PORT_OUT_BUFFER (32)
+#define DSW_MAX_PORT_OUT_BUFFER (64)

-#define DSW_MAX_PORT_OPS_PER_BG_TASK (128)
+#define DSW_MAX_PORT_OPS_PER_BG_TASK (512)

With 500 clock cycles of dummy work, the per-event overhead is ~16 TSC 
clock cycles/stage and event (i.e. per scheduled event; enqueue + 
dequeue), if my quick-and-dirty benchmark program does the math 
correctly. This also includes the overhead from the benchmark program 
itself.

Overhead with a real application will be higher.