From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BCB3F423FD; Tue, 17 Jan 2023 14:00:45 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6B533400EF; Tue, 17 Jan 2023 14:00:45 +0100 (CET) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by mails.dpdk.org (Postfix) with ESMTP id 0C9CF400D4 for ; Tue, 17 Jan 2023 14:00:42 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673960443; x=1705496443; h=date:from:to:cc:subject:message-id:references: content-transfer-encoding:in-reply-to:mime-version; bh=JNfALgiJ3AS1lYCxGDiYdtnlcYzal8TJwtL1yMia/cc=; b=i4W61RloH3eZ8KCQ6oZ7snBBpLzjpAFEvL5gV6LEtSutglLSTLmCXP/V vtS+LiWlTq5EGEmBWhlOaoK7JBuTm0PFQHKGR2V4CFDu7Z5vOQrBqnVlG xX3coHB5qOo/t91sRRi/S+oHadUjlcPaD+BRtwmF8jyWdnDczxeIZO852 IYFWsobzeRkbnCIyV7qyWuDv0ckvbmzT8NAUx9g8ytoK3HWuoirgDzZmM nWq+MqYJtB7AF94eFDxVfN+GYpjZ9hSUX+2/I98V1VybVC9W4GS/fdeKE 2zEbgwf3MhpeHx52ZSgHkMmpfM8LuOXbsKMR/qFMZg09bT+01qBoBWbu6 w==; X-IronPort-AV: E=McAfee;i="6500,9779,10592"; a="312554625" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312554625" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 05:00:37 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10592"; a="659385609" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="659385609" Received: from orsmsx601.amr.corp.intel.com ([10.22.229.14]) by orsmga002.jf.intel.com with ESMTP; 17 Jan 2023 05:00:27 -0800 Received: from orsmsx601.amr.corp.intel.com (10.22.229.14) by ORSMSX601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.16; Tue, 17 Jan 2023 05:00:27 -0800 Received: from ORSEDG601.ED.cps.intel.com (10.7.248.6) by orsmsx601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.16 via Frontend Transport; Tue, 17 Jan 2023 05:00:27 -0800 Received: from NAM04-MW2-obe.outbound.protection.outlook.com (104.47.73.174) by edgegateway.intel.com (134.134.137.102) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.16; Tue, 17 Jan 2023 05:00:27 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SqrZb9V2DKtEuAp9A2j3x/GDdO9j4StTJw88gqaKG8erebpFeo0eNNn09OsaAhpiqODZzO58BxjQe27sxv9lscZcfcQRgBFKtcWEV4idE/Wi+yGNhs2PDjiQuIMOFJuaNc6lQCBzlR+9B1qX9UHJ+V5wyOYZU4H9KH8HkaxZ8S6RqqUvpAL4QdYEMTJaREdbqsc7mEI1dN4B2u5uiqkQGwsUv169n4gBnDnuM6iZ5ABAuueiOEixD0q8dRIMP4YDb+sMEZwxoIvcnfBmhQOIQgcNlv3g8ZTZwTYN6jqbMz4ooSpgBaXnzSK8VbXAE1kmnaz/52o5JRVjXS+wpstudw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=czFViu/GVQnRJZ9pyEmfHVFYfZ1lgXBZCrVlcl7S8kI=; b=ANdOFP7KTRdKtrKoWR4RkD+vZBsNdFjoCrVjszxCKCv9X1MQ/iIOqWWZZdAvYpfdrRqB0lQfsDmqFJeez8RXK/2Vwctb1R+DSGd1sGQgRuBYLFp6/P/RnkfMLGjMiC25aSevhlp2Tpd3XA8w15tZ94klVd+moboBF7QLYrCoD0LCap/AAJ0TK/N7WnSrh6Vq/ffChGg7HQ35+rNQpz0/D145wlbobTnqbVIKp8CePHTInXKKk0YcOjsy4itOrLIxiiliK+Dxguoq40mLxvWEjQTTLrosWbWSnCFRazw3dcWgfBy16uNS0T4pmEBdD6Y489ne0w8CiB2QfLs2wnBRyQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from DS0PR11MB7309.namprd11.prod.outlook.com (2603:10b6:8:13e::17) by DM4PR11MB8203.namprd11.prod.outlook.com (2603:10b6:8:187::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5986.18; Tue, 17 Jan 2023 13:00:25 +0000 Received: from DS0PR11MB7309.namprd11.prod.outlook.com ([fe80::4d9f:6867:2d53:9ee]) by DS0PR11MB7309.namprd11.prod.outlook.com ([fe80::4d9f:6867:2d53:9ee%7]) with mapi id 15.20.5986.023; Tue, 17 Jan 2023 13:00:25 +0000 Date: Tue, 17 Jan 2023 13:00:14 +0000 From: Bruce Richardson To: Cheng Jiang CC: , , , , , , , Subject: Re: [PATCH v2] app/dma-perf: introduce dma-perf application Message-ID: References: <20221220010619.31829-1-cheng1.jiang@intel.com> <20230117015623.17606-1-cheng1.jiang@intel.com> Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230117015623.17606-1-cheng1.jiang@intel.com> X-ClientProxiedBy: LO4P265CA0209.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:33a::16) To DS0PR11MB7309.namprd11.prod.outlook.com (2603:10b6:8:13e::17) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS0PR11MB7309:EE_|DM4PR11MB8203:EE_ X-MS-Office365-Filtering-Correlation-Id: 90fc6f77-6152-4604-588f-08daf88acc6e X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: JUZBcM6Ynl9oLy/SG2xA6jAl5F/vg/BW6ICOpoAJ9mVZ9JziDBSbAtoonjizEHdgLy/N8p68WdGTeaDEEcmzm7BdsfYSqU2Ng8vA07emly/YkgcOUgRy5+I+YUorCGPob7d+P0Tx6jk5KhO6IvcLTaIjbz1BxnpGiee93HZYuq2uk2Lzu4oE5uRYpJwVbcO8vYaZmk+OWZvxmbLQpxDq6STKBah15Kg/YvY9XPumnq9urXDHJqYobN6V2/gKZXyT/ZxrCpT90DnPV6vKtFgSPQWlvxE9A3eIna4ssyZNjei1F8HavEREF/FJfs2Bz0u5QtxhHFTDbaNETbNUddj1MJ8UegC+oaP0Ofb25zGOD38JPeIp3joOAlriyuQsGnigey4mOaZukcimsndfvacyacdbe4JgHiRco4jgycigdI4zC7zjhfAmMfONU4RX2h0BXyxS/wZCH1O0ILi/9106pe/FgfHgNbU6cfH3yBC04OX75JNm5fQYOM3OoRFl5OWhoRsYPtTB3Zck0DBlff0vZTXYH0shx3mZlxarngGkAf4kd/vn64pcYTFL7wncWIdrN6UrX2EvE8IYDZ+uJZVDWGVzbrEahjnwGiHLgyDQqRpd/BN+KZg2yri3dvlTeVohc3MU2zFZ9uDEBFYvq9Ofmw== X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DS0PR11MB7309.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230022)(136003)(396003)(376002)(39860400002)(346002)(366004)(451199015)(66556008)(66946007)(66476007)(66574015)(41300700001)(6512007)(186003)(8676002)(26005)(4326008)(86362001)(83380400001)(5660300002)(82960400001)(6862004)(8936002)(107886003)(6666004)(6636002)(478600001)(6506007)(316002)(30864003)(38100700002)(2906002)(6486002)(44832011); DIR:OUT; SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?iso-8859-1?Q?Za/BvQzhiVd0tKa5HQbmn8aYZpc+8eKbeqPfCISN5FqoLlv51Nc6BSegOQ?= =?iso-8859-1?Q?dvGIUYCOAaeLdPjj8c2m+dgdTPJYL5YtxLCvBniIi52KlDOkyhTK5YR2Bs?= =?iso-8859-1?Q?y5S8dWxVuzwi7Z+aXmEwLxPAJHIClneKtVsmyVXDhI7QO2jkOtuNendlw6?= =?iso-8859-1?Q?KDDj1IYxYZmi6IeeDk9XO84prZ6EUOkcU58O+DuZiLeb001F515X3ZYnfE?= =?iso-8859-1?Q?7hXDXbIPGLSDLV8DirMLPYjrZrQydrU3ajMMQ//S7jRqGd1jlmqOimb5LY?= =?iso-8859-1?Q?hR+Xi0laJOIfAD9zbDBA42bTs4FxzcSWBuuylzvR1dpDp5jF2Dq33D9pkD?= =?iso-8859-1?Q?kWWCC7MqGRV0vBncHaI/MdLg4jetulsYn205dJ/WMkOmuTF1MtETQ45JNZ?= =?iso-8859-1?Q?mbZllTaZ3gLKzNmwEXhFJ3sWjOOCT3B2M5PFSKawL34Bj1MA2TlobY9r/O?= =?iso-8859-1?Q?ACn91cXLG56b2HogoBoUbSLhupEcvnSJMzT+SgVFOS9iQMRrbr52IFEsSL?= =?iso-8859-1?Q?CnuKqO+MEklHCyJaae6nv7c63S04jb7Y2Vw/9imGuQRrWml7K4CQsChnBF?= =?iso-8859-1?Q?fJCZQJScQZf619AKig/LZdyRgtVkRrr+9oD8vQhqVfpz+FunfB3ufcfX1o?= =?iso-8859-1?Q?DVUuWbGPBnLXsheo8s3NkIGiR9IVghfb+M2AAsKp6ozzA/GDUoA9icxhgu?= =?iso-8859-1?Q?2gLVloF44IEJ3esu1kYftqzcpsWje/QvpA3sSgPbZG7Pz2H/iitm//0SaT?= =?iso-8859-1?Q?STYNtHeFbiUuIV+Sv2LeXymOKeo53NEJ0j0pC0gXbKfv/8ceYnRBFveaa0?= =?iso-8859-1?Q?+igLZmlnEtHwx4guxNO1bixboYsh2aUuxFiRY0/fLyRgr1bX1IUkvBX9IJ?= =?iso-8859-1?Q?wE8c/uleJUkvnDkysdSnJzQgaJL4aZngNg1zCkzzDG78iqmqeKZq4+5wH5?= =?iso-8859-1?Q?n354hGmMb8JAze1ZNH2JNGANXS7MyvCcg0jVjBid45Zz54zfkIAinikecM?= =?iso-8859-1?Q?Ed16YubCXvi+oRjpKohjUbbhURgTKyft4PNPHUXMDoR8cxuyVlfnc6NkID?= =?iso-8859-1?Q?aX/FA07cqIkxoOs0rdWPNHPgjU/jc7ZFQ0Od0vzGKe06WH+Pii3I7zjog6?= =?iso-8859-1?Q?6yIPdndWwe9j84/5KR5ZWm0vSHk1pC83Ogi30sHnccjMm2YBGb+gagXLWh?= =?iso-8859-1?Q?LvYkbOUd+EaXurUbLM6SbjgVLH+qddE6AiwDpFxxEE97Rby5C499h4lsfK?= =?iso-8859-1?Q?XIKCQDxW5k66crmauMACfhDxKUssZs9xs6eqZEWX8BDCYJmf6uGZ9mSaoh?= =?iso-8859-1?Q?YQE/a26pkE0Tfrz5xjj7eQErAHw/mwQCP+hH6tZfxV37pM9FL7eQ7W7Hw7?= =?iso-8859-1?Q?w/2yKwwPaZjdcPGEoR3Aggo7+gB0oFwCMoP1VygenjoAIEWSbSXeqQVm8G?= =?iso-8859-1?Q?U+JLmW+RwV1xCVU62fINcqnZug2Ja5izzq+vnk7T393BBKehG75zfBrgrA?= =?iso-8859-1?Q?K9zdUVCaLsCk4UtUuX3NEpqapxOsgMy5S05sr7311VMZDsspBBzoMpG6m/?= =?iso-8859-1?Q?O5Tga1xG5WqP7rgWyRDbVpZyUjyOooEE3Mi8pvtJ9qYwT3N5J9+tYLHn+m?= =?iso-8859-1?Q?9dp4k1PYp6VxAz5+vbxMtbR90czH/GqNPq/hf8Vjw6YbhB7C360Sp9AA?= =?iso-8859-1?Q?=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: 90fc6f77-6152-4604-588f-08daf88acc6e X-MS-Exchange-CrossTenant-AuthSource: DS0PR11MB7309.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2023 13:00:24.8714 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: TvuDD9XXZahTzmwRv0gKpFm/W3SEebkcaghIPk5U5XeNgd2WzLfsxQ4lqy8YdLbpU1+O1iEW1Ujsxoo/8R/2GrgFLRpE1BAeW9NO0xtFTg4= X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM4PR11MB8203 X-OriginatorOrg: intel.com X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Tue, Jan 17, 2023 at 01:56:23AM +0000, Cheng Jiang wrote: > There are many high-performance DMA devices supported in DPDK now, and > these DMA devices can also be integrated into other modules of DPDK as > accelerators, such as Vhost. Before integrating DMA into applications, > developers need to know the performance of these DMA devices in various > scenarios and the performance of CPUs in the same scenario, such as > different buffer lengths. Only in this way can we know the target > performance of the application accelerated by using them. This patch > introduces a high-performance testing tool, which supports comparing the > performance of CPU and DMA in different scenarios automatically with a > pre-set config file. Memory Copy performance test are supported for now. > > Signed-off-by: Cheng Jiang > Signed-off-by: Jiayu Hu > Signed-off-by: Yuan Wang > Acked-by: Morten Brørup > --- > v2: fixed some CI issues. Some first review comments inline below. More will likely follow as I review it further and try testing it out. /Bruce > > app/meson.build | 1 + > app/test-dma-perf/benchmark.c | 539 ++++++++++++++++++++++++++++++++++ > app/test-dma-perf/benchmark.h | 12 + > app/test-dma-perf/config.ini | 61 ++++ > app/test-dma-perf/main.c | 434 +++++++++++++++++++++++++++ > app/test-dma-perf/main.h | 53 ++++ > app/test-dma-perf/meson.build | 22 ++ > 7 files changed, 1122 insertions(+) > create mode 100644 app/test-dma-perf/benchmark.c > create mode 100644 app/test-dma-perf/benchmark.h > create mode 100644 app/test-dma-perf/config.ini > create mode 100644 app/test-dma-perf/main.c > create mode 100644 app/test-dma-perf/main.h > create mode 100644 app/test-dma-perf/meson.build > > diff --git a/app/meson.build b/app/meson.build > index e32ea4bd5c..a060ad2725 100644 > --- a/app/meson.build > +++ b/app/meson.build > @@ -28,6 +28,7 @@ apps = [ > 'test-regex', > 'test-sad', > 'test-security-perf', > + 'test-dma-perf', > ] Lists in DPDK are always alphabetical when no other order is required, therefore this new app should be further up the list, after "test-crypto-perf". > > default_cflags = machine_args + ['-DALLOW_EXPERIMENTAL_API'] > diff --git a/app/test-dma-perf/benchmark.c b/app/test-dma-perf/benchmark.c > new file mode 100644 > index 0000000000..1cb5b0b291 > --- /dev/null > +++ b/app/test-dma-perf/benchmark.c > @@ -0,0 +1,539 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2022 Intel Corporation > + */ > + > +#include > + > +#include > +#include > +#include > +#include > +#include > + > +#include "main.h" > +#include "benchmark.h" > + > + > +#define MAX_DMA_CPL_NB 255 > + > +#define CSV_LINE_DMA_FMT "Scenario %u,%u,%u,%u,%u,%u,%" PRIu64 ",%.3lf,%" PRIu64 "\n" > +#define CSV_LINE_CPU_FMT "Scenario %u,%u,NA,%u,%u,%u,%" PRIu64 ",%.3lf,%" PRIu64 "\n" > + > +struct lcore_params { > + uint16_t dev_id; > + uint32_t nr_buf; > + uint16_t kick_batch; > + uint32_t buf_size; > + uint32_t repeat_times; > + uint16_t mpool_iter_step; > + struct rte_mbuf **srcs; > + struct rte_mbuf **dsts; > + uint8_t scenario_id; > +}; > + > +struct buf_info { > + struct rte_mbuf **array; > + uint32_t nr_buf; > + uint32_t buf_size; > +}; > + > +static struct rte_mempool *src_pool; > +static struct rte_mempool *dst_pool; > + > +uint16_t dmadev_ids[MAX_WORKER_NB]; > +uint32_t nb_dmadevs; > + > +#define PRINT_ERR(...) print_err(__func__, __LINE__, __VA_ARGS__) > + > +static inline int > +__rte_format_printf(3, 4) > +print_err(const char *func, int lineno, const char *format, ...) > +{ > + va_list ap; > + int ret; > + > + ret = fprintf(stderr, "In %s:%d - ", func, lineno); > + va_start(ap, format); > + ret += vfprintf(stderr, format, ap); > + va_end(ap); > + > + return ret; > +} > + > +static inline void > +calc_result(struct lcore_params *p, uint64_t cp_cycle_sum, double time_sec, > + uint32_t repeat_times, uint32_t *memory, uint64_t *ave_cycle, > + float *bandwidth, uint64_t *ops) > +{ > + *memory = (p->buf_size * p->nr_buf * 2) / (1024 * 1024); > + *ave_cycle = cp_cycle_sum / (p->repeat_times * p->nr_buf); > + *bandwidth = p->buf_size * 8 * rte_get_timer_hz() / (*ave_cycle * 1000 * 1000 * 1000.0); > + *ops = (double)p->nr_buf * repeat_times / time_sec; > +} > + > +static void > +output_result(uint8_t scenario_id, uint32_t lcore_id, uint16_t dev_id, uint64_t ave_cycle, > + uint32_t buf_size, uint32_t nr_buf, uint32_t memory, > + float bandwidth, uint64_t ops, bool is_dma) > +{ > + if (is_dma) > + printf("lcore %u, DMA %u:\n" > + "average cycles: %" PRIu64 "," > + " buffer size: %u, nr_buf: %u," > + " memory: %uMB, frequency: %" PRIu64 ".\n", > + lcore_id, > + dev_id, > + ave_cycle, > + buf_size, > + nr_buf, > + memory, > + rte_get_timer_hz()); Longer lines are allowed for strings, so you can merge each line of output to a single line, which will improve readability. Also, to shorten the code, there is no reason each parameter needs to go on its own line. > + else > + printf("lcore %u\n" > + "average cycles: %" PRIu64 "," > + " buffer size: %u, nr_buf: %u," > + " memory: %uMB, frequency: %" PRIu64 ".\n", > + lcore_id, > + ave_cycle, > + buf_size, > + nr_buf, > + memory, > + rte_get_timer_hz()); Suggestion, rather than duplicating the whole output, only the first line needs to change based on SW vs HW copies. How about: if (is_dma) printf("lcore %u, DMA %u\n", lcore_id, dev_id); else printf("lcore %u\n", lcore_id); printf("average cycles: ..." , ...); > + > + printf("Average bandwidth: %.3lfGbps, OPS: %" PRIu64 "\n", bandwidth, ops); > + > + if (is_dma) > + snprintf(output_str[lcore_id], MAX_OUTPUT_STR_LEN, > + CSV_LINE_DMA_FMT, > + scenario_id, lcore_id, dev_id, buf_size, > + nr_buf, memory, ave_cycle, bandwidth, ops); > + else > + snprintf(output_str[lcore_id], MAX_OUTPUT_STR_LEN, > + CSV_LINE_CPU_FMT, > + scenario_id, lcore_id, buf_size, > + nr_buf, memory, ave_cycle, bandwidth, ops); > +} > + > +static inline void > +cache_flush_buf(void *arg) For non-x86 builds, you probably need to mark "arg" as unused to avoid compiler warnings. Why is the parameter type given as a void pointer, when the type is unconditionally cast below as "struct buf_info"? Void pointer type should only be needed if you need to call this via a generic function pointer. > +{ > +#ifdef RTE_ARCH_X86_64 > + char *data; > + char *addr; > + struct buf_info *info = arg; > + struct rte_mbuf **srcs = info->array; > + uint32_t i, k; > + > + for (i = 0; i < info->nr_buf; i++) { > + data = rte_pktmbuf_mtod(srcs[i], char *); > + for (k = 0; k < info->buf_size / 64; k++) { > + addr = (k * 64 + data); > + __builtin_ia32_clflush(addr); > + } inner loop may be shorter by incrementing loop var by 64, rather than dividing and then multiplying, since you can eliminate variable "addr". Also can be more readable with a variable rename: for (offset = 0; offset < info->buf_size; offset += 64) __buildin_ia32_clflush(data + offset); > + } > +#endif > +} > + > +/* Configuration of device. */ > +static void > +configure_dmadev_queue(uint32_t dev_id, uint32_t ring_size) > +{ > + uint16_t vchan = 0; > + struct rte_dma_info info; > + struct rte_dma_conf dev_config = { .nb_vchans = 1 }; > + struct rte_dma_vchan_conf qconf = { > + .direction = RTE_DMA_DIR_MEM_TO_MEM, > + .nb_desc = ring_size > + }; > + > + if (rte_dma_configure(dev_id, &dev_config) != 0) > + rte_exit(EXIT_FAILURE, "Error with rte_dma_configure()\n"); > + > + if (rte_dma_vchan_setup(dev_id, vchan, &qconf) != 0) { > + printf("Error with queue configuration\n"); > + rte_panic(); > + } > + Inconsistency here - and below too. Either use rte_exit on failure or use rte_panic, but don't mix them. Panic seems a little severe, so I suggest just using rte_exit() in all cases. > + rte_dma_info_get(dev_id, &info); > + if (info.nb_vchans != 1) { > + printf("Error, no configured queues reported on device id %u\n", dev_id); > + rte_panic(); > + } > + if (rte_dma_start(dev_id) != 0) > + rte_exit(EXIT_FAILURE, "Error with rte_dma_start()\n"); > +} > + > +static int > +config_dmadevs(uint32_t nb_workers, uint32_t ring_size) > +{ > + int16_t dev_id = rte_dma_next_dev(0); > + uint32_t i; > + > + nb_dmadevs = 0; > + > + for (i = 0; i < nb_workers; i++) { > + if (dev_id == -1) > + goto end; > + > + dmadev_ids[i] = dev_id; > + configure_dmadev_queue(dmadev_ids[i], ring_size); > + dev_id = rte_dma_next_dev(dev_id + 1); > + ++nb_dmadevs; Very minor nit, but I'd suggest swapping these last two lines, incrementing nb_dmadevs right after configuring the device, but before finding a new one. It just makes more sense to me. > + } > + > +end: > + if (nb_dmadevs < nb_workers) { > + printf("Not enough dmadevs (%u) for all workers (%u).\n", nb_dmadevs, nb_workers); > + return -1; > + } > + > + RTE_LOG(INFO, DMA, "Number of used dmadevs: %u.\n", nb_dmadevs); > + > + return 0; > +} > + > +static inline void > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t kick_batch, uint32_t buf_size, > + uint16_t mpool_iter_step, struct rte_mbuf **srcs, struct rte_mbuf **dsts) > +{ > + int64_t async_cnt = 0; > + int nr_cpl = 0; > + uint32_t index; > + uint16_t offset; > + uint32_t i; > + > + for (offset = 0; offset < mpool_iter_step; offset++) { > + for (i = 0; index = i * mpool_iter_step + offset, index < nr_buf; i++) { Assignment in the condition part of a loop seems wrong. I suggest reworking this to avoid it. > + if (unlikely(rte_dma_copy(dev_id, > + 0, > + srcs[index]->buf_iova + srcs[index]->data_off, > + dsts[index]->buf_iova + dsts[index]->data_off, rte_pktmbuf_iova() macro can be used here. > + buf_size, > + 0) < 0)) { > + rte_dma_submit(dev_id, 0); > + while (rte_dma_burst_capacity(dev_id, 0) == 0) { > + nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, > + NULL, NULL); > + async_cnt -= nr_cpl; > + } > + if (rte_dma_copy(dev_id, > + 0, > + srcs[index]->buf_iova + srcs[index]->data_off, > + dsts[index]->buf_iova + dsts[index]->data_off, > + buf_size, > + 0) < 0) { > + printf("enqueue fail again at %u\n", index); > + printf("space:%d\n", rte_dma_burst_capacity(dev_id, 0)); > + rte_exit(EXIT_FAILURE, "DMA enqueue failed\n"); > + } > + } > + async_cnt++; > + > + /** > + * When '&' is used to wrap an index, mask must be a power of 2. > + * That is, kick_batch must be 2^n. I assume that is checked on input processing when parsing the config file? > + */ > + if (unlikely((async_cnt % kick_batch) == 0)) { This is an expected condition that will occur with repeatable frequency. Therefore, unlikely is not really appropriate. > + rte_dma_submit(dev_id, 0); > + /* add a poll to avoid ring full */ > + nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, NULL); > + async_cnt -= nr_cpl; > + } > + } > + > + rte_dma_submit(dev_id, 0); > + while (async_cnt > 0) { > + nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, NULL); > + async_cnt -= nr_cpl; > + } Do we need a timeout here or in the loop above incase of errors that cause us to not get all the elements back? > + } > +} > + > +static int > +dma_mem_copy(void *p) > +{ I see the call to this function within "remote_launch" uses a cast on the function. I don't think that typecast should be necessary, but if you keep it, you can avoid using the void pointer here and just mark the input type as "struct lcore_params" directly. > + uint64_t ops; > + uint32_t memory; > + float bandwidth; > + double time_sec; > + uint32_t lcore_id = rte_lcore_id(); > + struct lcore_params *params = (struct lcore_params *)p; > + uint32_t repeat_times = params->repeat_times; > + uint32_t buf_size = params->buf_size; > + uint16_t kick_batch = params->kick_batch; > + uint32_t lcore_nr_buf = params->nr_buf; > + uint16_t dev_id = params->dev_id; > + uint16_t mpool_iter_step = params->mpool_iter_step; > + struct rte_mbuf **srcs = params->srcs; > + struct rte_mbuf **dsts = params->dsts; > + uint64_t begin, end, total_cycles = 0, avg_cycles = 0; > + uint32_t r; > + > + begin = rte_rdtsc(); > + > + for (r = 0; r < repeat_times; r++) > + do_dma_mem_copy(dev_id, lcore_nr_buf, kick_batch, buf_size, > + mpool_iter_step, srcs, dsts); > + > + end = rte_rdtsc(); > + total_cycles = end - begin; You can do without "end" easily enough: total_cycles = rte_rdtsc() - begin; > + time_sec = (double)total_cycles / rte_get_timer_hz(); > + > + calc_result(params, total_cycles, time_sec, repeat_times, &memory, > + &avg_cycles, &bandwidth, &ops); > + output_result(params->scenario_id, lcore_id, dev_id, avg_cycles, buf_size, lcore_nr_buf, > + memory, bandwidth, ops, true); > + > + rte_free(p); > + > + return 0; > +} > + > +static int > +cpu_mem_copy(void *p) > +{ Most of comments from above, also apply here. > + uint32_t idx; > + uint32_t lcore_id; > + uint32_t memory; > + uint64_t ops; > + float bandwidth; > + double time_sec; > + struct lcore_params *params = (struct lcore_params *)p; > + uint32_t repeat_times = params->repeat_times; > + uint32_t buf_size = params->buf_size; > + uint32_t lcore_nr_buf = params->nr_buf; > + uint16_t mpool_iter_step = params->mpool_iter_step; > + struct rte_mbuf **srcs = params->srcs; > + struct rte_mbuf **dsts = params->dsts; > + uint64_t begin, end, total_cycles = 0, avg_cycles = 0; > + uint32_t k, j, offset; > + > + begin = rte_rdtsc(); > + > + for (k = 0; k < repeat_times; k++) { > + /* copy buffer form src to dst */ > + for (offset = 0; offset < mpool_iter_step; offset++) { > + for (j = 0; idx = j * mpool_iter_step + offset, idx < lcore_nr_buf; j++) { > + rte_memcpy((void *)(uintptr_t)rte_mbuf_data_iova(dsts[idx]), > + (void *)(uintptr_t)rte_mbuf_data_iova(srcs[idx]), > + (size_t)buf_size); > + } > + } > + } > + > + end = rte_rdtsc(); > + total_cycles = end - begin; > + time_sec = (double)total_cycles / rte_get_timer_hz(); > + > + lcore_id = rte_lcore_id(); > + > + calc_result(params, total_cycles, time_sec, repeat_times, &memory, > + &avg_cycles, &bandwidth, &ops); > + output_result(params->scenario_id, lcore_id, 0, avg_cycles, buf_size, lcore_nr_buf, > + memory, bandwidth, ops, false); > + > + rte_free(p); > + > + return 0; > +} > + > +static int > +setup_memory_env(struct test_configure *cfg, struct rte_mbuf ***srcs, > + struct rte_mbuf ***dsts) > +{ > + uint32_t i; > + unsigned int buf_size = cfg->buf_size.cur; > + unsigned int nr_sockets; > + uint32_t nr_buf = cfg->nr_buf; > + > + nr_sockets = rte_socket_count(); > + if (cfg->src_numa_node >= nr_sockets || > + cfg->dst_numa_node >= nr_sockets) { > + printf("Error: Source or destination numa exceeds the acture numa nodes.\n"); > + return -1; > + } > + > + src_pool = rte_pktmbuf_pool_create("Benchmark_DMA_SRC", > + nr_buf, /* n == num elements */ > + 64, /* cache size */ > + 0, /* priv size */ > + buf_size + RTE_PKTMBUF_HEADROOM, > + cfg->src_numa_node); > + if (src_pool == NULL) { > + PRINT_ERR("Error with source mempool creation.\n"); > + return -1; > + } > + > + dst_pool = rte_pktmbuf_pool_create("Benchmark_DMA_DST", > + nr_buf, /* n == num elements */ > + 64, /* cache size */ > + 0, /* priv size */ > + buf_size + RTE_PKTMBUF_HEADROOM, > + cfg->dst_numa_node); > + if (dst_pool == NULL) { > + PRINT_ERR("Error with destination mempool creation.\n"); > + return -1; > + } > + > + *srcs = (struct rte_mbuf **)(malloc(nr_buf * sizeof(struct rte_mbuf *))); Typecast for void * to other types aren't actually necessary in C. I note some inconsistency in this file with regards to malloc. Here you use regular malloc, while when building the parameters to pass to the memcpy functions you use rte_malloc. I suggest standardizing on one or the other rather than mixing. > + if (*srcs == NULL) { > + printf("Error: srcs malloc failed.\n"); > + return -1; > + } > + > + *dsts = (struct rte_mbuf **)(malloc(nr_buf * sizeof(struct rte_mbuf *))); > + if (*dsts == NULL) { > + printf("Error: dsts malloc failed.\n"); > + return -1; > + } > + > + for (i = 0; i < nr_buf; i++) { > + (*srcs)[i] = rte_pktmbuf_alloc(src_pool); > + (*dsts)[i] = rte_pktmbuf_alloc(dst_pool); Rather than individually allocating you may well manage with rte_mempool_get_bulk() to allocate all mbufs in one call. > + if ((!(*srcs)[i]) || (!(*dsts)[i])) { > + printf("src: %p, dst: %p\n", (*srcs)[i], (*dsts)[i]); > + return -1; > + } > + > + (*srcs)[i]->data_len = (*srcs)[i]->pkt_len = buf_size; > + (*dsts)[i]->data_len = (*dsts)[i]->pkt_len = buf_size; rte_pktmbuf_append() macro can be used here, rather than setting the lengths manually. However, these values are not actually used anywhere else in the code, I believe, so setting them is unnecessary. You are manually tracking the copy lengths throughout the test, and nothing else is working on the mbufs, so the length the mbuf reports is immaterial.. > + } > + > + return 0; > +} > + > +void > +dma_mem_copy_benchmark(struct test_configure *cfg) > +{ > + uint32_t i; > + uint32_t offset; > + unsigned int lcore_id = 0; > + struct rte_mbuf **srcs = NULL, **dsts = NULL; > + unsigned int buf_size = cfg->buf_size.cur; > + uint16_t kick_batch = cfg->kick_batch.cur; > + uint16_t mpool_iter_step = cfg->mpool_iter_step; > + uint32_t nr_buf = cfg->nr_buf = (cfg->mem_size.cur * 1024 * 1024) / (cfg->buf_size.cur * 2); > + uint16_t nb_workers = cfg->nb_workers; > + uint32_t repeat_times = cfg->repeat_times; > + > + if (setup_memory_env(cfg, &srcs, &dsts) < 0) > + goto out; > + > + if (config_dmadevs(nb_workers, cfg->ring_size.cur) < 0) > + goto out; > + > + if (cfg->cache_flush) { > + struct buf_info info; > + > + info.array = srcs; > + info.buf_size = buf_size; > + info.nr_buf = nr_buf; > + cache_flush_buf(&info); > + >From what I can see, struct buf_info is only used for passing parameters to the cache_flush_buf function. The code would be a lot simpler to remove the structure and just pass 3 parameters to the function directly. > + info.array = dsts; > + cache_flush_buf(&info); > + rte_mb(); > + } > + > + printf("Start testing....\n"); > + > + for (i = 0; i < nb_workers; i++) { > + lcore_id = rte_get_next_lcore(lcore_id, true, true); > + offset = nr_buf / nb_workers * i; > + > + struct lcore_params *p = rte_malloc(NULL, sizeof(*p), 0); > + if (!p) { > + printf("lcore parameters malloc failure for lcore %d\n", lcore_id); > + break; > + } > + *p = (struct lcore_params) { > + dmadev_ids[i], > + (uint32_t)(nr_buf/nb_workers), > + kick_batch, > + buf_size, > + repeat_times, > + mpool_iter_step, > + srcs + offset, > + dsts + offset, > + cfg->scenario_id > + }; > + > + rte_eal_remote_launch((lcore_function_t *)dma_mem_copy, p, lcore_id); > + } > + > + rte_eal_mp_wait_lcore(); > + > +out: > + /* free env */ > + if (srcs) { > + for (i = 0; i < nr_buf; i++) > + rte_pktmbuf_free(srcs[i]); > + free(srcs); > + } > + if (dsts) { > + for (i = 0; i < nr_buf; i++) > + rte_pktmbuf_free(dsts[i]); > + free(dsts); > + } > + > + if (src_pool) > + rte_mempool_free(src_pool); > + if (dst_pool) > + rte_mempool_free(dst_pool); > + > + for (i = 0; i < nb_dmadevs; i++) { > + printf("Stopping dmadev %d\n", dmadev_ids[i]); > + rte_dma_stop(dmadev_ids[i]); > + } > +} > + > +void > +cpu_mem_copy_benchmark(struct test_configure *cfg) > +{ > + uint32_t i, offset; > + uint32_t repeat_times = cfg->repeat_times; > + uint32_t kick_batch = cfg->kick_batch.cur; > + uint32_t buf_size = cfg->buf_size.cur; > + uint32_t nr_buf = cfg->nr_buf = (cfg->mem_size.cur * 1024 * 1024) / (cfg->buf_size.cur * 2); > + uint16_t nb_workers = cfg->nb_workers; > + uint16_t mpool_iter_step = cfg->mpool_iter_step; > + struct rte_mbuf **srcs = NULL, **dsts = NULL; > + unsigned int lcore_id = 0; > + > + if (setup_memory_env(cfg, &srcs, &dsts) < 0) > + goto out; > + > + for (i = 0; i < nb_workers; i++) { > + lcore_id = rte_get_next_lcore(lcore_id, rte_lcore_count() > 1 ? 1 : 0, 1); > + offset = nr_buf / nb_workers * i; > + struct lcore_params *p = rte_malloc(NULL, sizeof(*p), 0); > + if (!p) { > + printf("lcore parameters malloc failure for lcore %d\n", lcore_id); > + break; > + } > + *p = (struct lcore_params) { 0, nr_buf/nb_workers, kick_batch, > + buf_size, repeat_times, mpool_iter_step, > + srcs + offset, dsts + offset, cfg->scenario_id }; Formatting should be the same as function above. > + rte_eal_remote_launch((lcore_function_t *)cpu_mem_copy, p, lcore_id); > + } > + > + rte_eal_mp_wait_lcore(); > + > +out: > + /* free env */ > + if (srcs) { > + for (i = 0; i < nr_buf; i++) > + rte_pktmbuf_free(srcs[i]); > + free(srcs); > + } > + if (dsts) { > + for (i = 0; i < nr_buf; i++) > + rte_pktmbuf_free(dsts[i]); > + free(dsts); > + } > + > + if (src_pool) > + rte_mempool_free(src_pool); > + if (dst_pool) > + rte_mempool_free(dst_pool); > +} There seems a quite a bit of common code between the dma_mem_copy_benchmark and cpu_mem_copy_benchmark. Might be worth investigating if they can be merged while still keeping readability. > diff --git a/app/test-dma-perf/benchmark.h b/app/test-dma-perf/benchmark.h > new file mode 100644 > index 0000000000..f5ad8d6d99 > --- /dev/null > +++ b/app/test-dma-perf/benchmark.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2022 Intel Corporation > + */ > + > +#ifndef _BENCHMARK_H_ > +#define _BENCHMARK_H_ > + > +void dma_mem_copy_benchmark(struct test_configure *cfg); > + > +void cpu_mem_copy_benchmark(struct test_configure *cfg); > + > +#endif /* _BENCHMARK_H_ */ You don't really need two separate headers in this application. Both main.h and benchmark.h can be merged into one header, since both are always included in both c files. > diff --git a/app/test-dma-perf/config.ini b/app/test-dma-perf/config.ini > new file mode 100644 > index 0000000000..e24bb19414 > --- /dev/null > +++ b/app/test-dma-perf/config.ini > @@ -0,0 +1,61 @@ > + > +; Supported test types: > +; DMA_MEM_COPY|CPU_MEM_COPY > + > +; Parameters: > +; "mem_size","buf_size","dma_ring_size","kick_batch". > +; "mem_size" means the size of the memory footprint. > +; "buf_size" means the memory size of a single operation. > +; "dma_ring_size" means the dma ring buffer size. > +; "kick_batch" means dma operation batch size. > + > +; Format: variable=first[,last,increment[,ADD|MUL]] > +; ADD is the default mode. > + > +; src_numa_node is used to control the numa node where the source memory is allocated. > +; dst_numa_node is used to control the numa node where the destination memory is allocated. > + > +; cache_flush is used to control if the cache should be flushed. > + > +; repeat_times is used to control the repeat times of the whole case. > + > +; worker_threads is used to control the threads number of the test app. > +; It should be less than the core number. > + > +; mpool_iter_step is used to control the buffer continuity. > + > +; Bind DMA to lcore: > +; Specify the "lcore_dma" parameter. > +; The number of "lcore_dma" should be greater than or equal to the number of "worker_threads". > +; Otherwise the remaining DMA devices will be automatically allocated to threads that are not > +; specified. If EAL parameters "-l" and "-a" are specified, the "lcore_dma" should be within > +; their range. > + > +[case1] > +type=DMA_MEM_COPY > +mem_size=10 > +buf_size=64,8192,2,MUL > +dma_ring_size=1024 > +kick_batch=32 > +src_numa_node=0 > +dst_numa_node=0 > +cache_flush=0 > +repeat_times=10 > +worker_threads=1 > +mpool_iter_step=1 > +lcore_dma=lcore3@0000:00:04.0 > +eal_args=--legacy-mem --file-prefix=test > + > +[case2] > +type=CPU_MEM_COPY > +mem_size=10 > +buf_size=64,8192,2,MUL > +dma_ring_size=1024 > +kick_batch=32 > +src_numa_node=0 > +dst_numa_node=1 > +cache_flush=0 > +repeat_times=100 > +worker_threads=1 > +mpool_iter_step=1 > +eal_args=--no-pci > diff --git a/app/test-dma-perf/main.c b/app/test-dma-perf/main.c > new file mode 100644 > index 0000000000..94ba369539 > --- /dev/null > +++ b/app/test-dma-perf/main.c > @@ -0,0 +1,434 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2022 Intel Corporation > + */ > + > +#include > +#if !defined(RTE_EXEC_ENV_LINUX) > + > +int > +main(int argc, char *argv[]) > +{ > + printf("OS not supported, skipping test\n"); > + return 0; > +} > + What is linux-specific about this app? If we do need to limit the app to Linux-only I suggest using meson to do so rather than putting #ifdefs in the code. > +#else > + > +#include > +#include > +#include