From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 664DA41B89; Tue, 31 Jan 2023 06:27:53 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 3FA3C40E28; Tue, 31 Jan 2023 06:27:53 +0100 (CET) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by mails.dpdk.org (Postfix) with ESMTP id BAFFF40DFB for ; Tue, 31 Jan 2023 06:27:50 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1675142870; x=1706678870; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=L24leEDqRdy9ndAJJKNdfXzXwDmKTIJDGxt3JN6CTlo=; b=T2olUpKockRf0fkb1DCmIK60azt6oplW/xmkjtWp7qcPw84qfCI4E0tr ZS4o2bzVHDSBDo2I5BXSeWrLilk4f38TH9If12WKGulZ4rlvaCCjFJWG/ FKgtoEE/sZUDh3+Onth5ZzCnIohhAfK4OBHkVGYSotAIpYk96Dmbhgpzo wWowVyN+K1dcJTPMdq9tckou6/Wjvpo8QK5gi9pOhOGhZHv2U0eNB8VuB ZH6pNVupaE3sUxuj78Sai6wVASv8Lm29AQ1r1jgCtTrEiENS/ALNEe9/K bEmJGtg0d0hzXXLbIAd4hI7kdXqCKrkvijINLucwJqPmdFMM7z1dt4ocK g==; X-IronPort-AV: E=McAfee;i="6500,9779,10606"; a="326417285" X-IronPort-AV: E=Sophos;i="5.97,259,1669104000"; d="scan'208";a="326417285" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jan 2023 21:27:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10606"; a="732968180" X-IronPort-AV: E=Sophos;i="5.97,259,1669104000"; d="scan'208";a="732968180" Received: from fmsmsx603.amr.corp.intel.com ([10.18.126.83]) by fmsmga004.fm.intel.com with ESMTP; 30 Jan 2023 21:27:49 -0800 Received: from fmsmsx603.amr.corp.intel.com (10.18.126.83) by fmsmsx603.amr.corp.intel.com (10.18.126.83) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.16; Mon, 30 Jan 2023 21:27:49 -0800 Received: from fmsedg602.ED.cps.intel.com (10.1.192.136) by fmsmsx603.amr.corp.intel.com (10.18.126.83) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.16 via Frontend Transport; Mon, 30 Jan 2023 21:27:49 -0800 Received: from NAM10-BN7-obe.outbound.protection.outlook.com (104.47.70.101) by edgegateway.intel.com (192.55.55.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.16; Mon, 30 Jan 2023 21:27:48 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Y6kv5fYCLJZC1tscFwkHN6MqX+9j+TxIoqJ5RB0wV9SUWCUfOzglAQ2BKt62YDjwST/7BQg8U7X0gU4C3dE+7b71u4rvd0y8KL77OI6Oatj9JR4D8ORXLcM+802+MvI6ax4okBdPt5rTpNzfE1MM3PdUYNMaNjoxnJLZ1CTaCiFuad8g1ByAX7DgD8Srwpra1zR+dwHoVQfqWjbR7XbDRoF87w6Bt2ZPtLzqqQZr4nIUpkFdPrHdQxANtVoI9L2MDKWvfH5xeSalMdxfSvBEE6NkIAhLo4Fs7Q1fQ356wn9ty2aR8LqFWZ+ktvDmx26g/uNHmvh3C8or0lqnMKcfmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5taYbVdHyWioixyIRQ5rKNXxH0EH27XL1pVTryppxhI=; b=F1TMgoh6yee8Uzwj3KXIwnjZara0tenUJRRPRTVOW0CvBUcNsgcMg1+/uaUMP4oUMQxrUlsY2v6gIzPDRZxysBz0WQZrXRfskjqyoioUE2aAoGZ8Gi5sTxrun101mHPGCZkOJ3ehD2X0ekQF00lyHDtII8bONx1KFGabvns0I71aMN9u1GalAfQNmg1Rp0za3dHuhZ0Khyj3bp7yItzmaTkPdP7ChxPdAy+GxKw80Ad1SjevBcUSBu/9Yqp0IRLZdImibu5IKu2AYbUvttJgaQldIOmMmvzlEoYPu3D4gvIn41xOkTncNVKGVLBYQYh1Xr7NSuC8usrzMSovvAuaIw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from CY5PR11MB6487.namprd11.prod.outlook.com (2603:10b6:930:31::17) by DS0PR11MB7879.namprd11.prod.outlook.com (2603:10b6:8:f7::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6043.36; Tue, 31 Jan 2023 05:27:45 +0000 Received: from CY5PR11MB6487.namprd11.prod.outlook.com ([fe80::4fba:c63b:6db5:75e4]) by CY5PR11MB6487.namprd11.prod.outlook.com ([fe80::4fba:c63b:6db5:75e4%9]) with mapi id 15.20.6043.036; Tue, 31 Jan 2023 05:27:45 +0000 From: "Hu, Jiayu" To: "Richardson, Bruce" , "Jiang, Cheng1" CC: "thomas@monjalon.net" , "mb@smartsharesystems.com" , "dev@dpdk.org" , "Ding, Xuan" , "Ma, WenwuX" , "Wang, YuanX" , "He, Xingguang" Subject: RE: [PATCH v3] app/dma-perf: introduce dma-perf application Thread-Topic: [PATCH v3] app/dma-perf: introduce dma-perf application Thread-Index: AQHZKnL8zXr6gki+fkWuJoUPRzJiZa6i0xEAgBUSu3A= Date: Tue, 31 Jan 2023 05:27:45 +0000 Message-ID: References: <20221220010619.31829-1-cheng1.jiang@intel.com> <20230117120526.39375-1-cheng1.jiang@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: CY5PR11MB6487:EE_|DS0PR11MB7879:EE_ x-ms-office365-filtering-correlation-id: e248c5e4-cc2c-4d49-2436-08db034be227 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: dhQT3Hk4VBpBfENEyH8vMPcwvwwKGIbrlc35dgB++xMcbA+6pQVQGLQwsrZn8OmHg+N6/r8dzq42GW5l4YB1BDILNzV6+99yvv4ccN6pGW91HQHgWYJfes1EK2tCXKU7MqbWxzUPhWM2f+E/joRBNpV9KjjMDE9ghuQZHYtZa0n0TR7FG6zrfJARIdw+AgFvco7eFjXT9hOAgBzVUhL0bqUbu7hujmNOSIbyBD0/RSZnPz5o/1Bk/mhAzZWC+2bGJR+2AaQagOYm1dUoHJJQME8ROYHwcCKS5jwZvjRexZnBBvY6spYcHruumYVcSpGkc1/JTY9cX1iPAah/TaaCP48qr17f1UrwOpIqOMUmPIkgpZ5pU6m/ZZMBouAputccXbAVGij0BRDLALWxLzsSvyuyxDp8nbKRIidaP9sRLZc16HCKwRaIIGErxlTuZtWF0u/lgbmAE6IVfd7vMfxyZHDN20S0PtnH/V7l8iJLMMJ3r+PSO4I+yobYaGMMiEc+AvurfY8dEPfPYU9PUeVPUCyVmj1BPlaTRzTS4GNAQ8PpAQ4mVv6/zp6oD/sNgmgXKjNz2NQdtY0b8SjNCGCNOLVE8BZyY9R/dtmnKQ1LnBQnO4BIIyVMDyy2bppimq8SMogFcfImBnQwZB68E0q0/gkHFPTAZWbZQ1yC/sd7qOV9snfpe4Em4RnG2gI39DcCKCeoJL9iFA7p1yT3/mIU6A== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:CY5PR11MB6487.namprd11.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230025)(396003)(366004)(346002)(39860400002)(136003)(376002)(451199018)(76116006)(66476007)(66446008)(8936002)(66556008)(66946007)(316002)(8676002)(4326008)(52536014)(5660300002)(54906003)(6636002)(110136005)(6506007)(64756008)(41300700001)(2906002)(7696005)(71200400001)(478600001)(53546011)(107886003)(26005)(186003)(9686003)(83380400001)(55016003)(33656002)(122000001)(86362001)(38100700002)(38070700005)(82960400001); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?QwMdXq1WxGO5lPKgO1SHjKKjW6KbMCmNdQ1oCCjQG3fNGQbNnQ6putplW2FU?= =?us-ascii?Q?7Ll31NxyCAOmTCjby42sWHBUgP67msf9RVTgK58makjtGpoIX6t5W/TW4iQB?= =?us-ascii?Q?sOrrm1bP23j/jqwjHCZ4otPcqr59C5GtaKNa99mk85nhuHH9nxVms1V2wzZn?= =?us-ascii?Q?euUk6aOJb8QkKhmkHpwCnXZuddFiGP/2hsYI28lX7YtNEag6vfrRu+fCliQo?= =?us-ascii?Q?mz1j+BOfgJ3ok+NmhgEinkRmYFY5+/0qeT42DL3jc7Z3mhypfmVWrDdYF4Yu?= =?us-ascii?Q?/TdDvub03nRJuI6r3IDKiKUsYNQ1NUK7UxLsvNzzTF8omDW7vrBeMrb627a6?= =?us-ascii?Q?VRzQt07w03QhfIcYFXCmOPJp/8BbC0VS7QIO9/0cqt5Ndu45QIT+iYy4XVYN?= =?us-ascii?Q?K1NGvaLxlVARGN1Go1AIT/F2/1NUXgApRDds8a7qNqSbPQMzFO4/4XSDo3oY?= =?us-ascii?Q?i5N2beic/KW7L41HSQUxGbmTOEeuzus6EVq1S+x4Dri4roGUo4B2ygY9y+jS?= =?us-ascii?Q?RyOaD73nRu5ibDbjvTcNt0Lxefs+CiZglYO816CCZv1+Tps5C1ipRhO6CRx5?= =?us-ascii?Q?VeRwTg5tDJ9LUjWwd7l9NenalKrJF3lvdD5WBPeaacge0P0ZF7zhGX6Kg5NQ?= =?us-ascii?Q?5PORMw/HnBHzS2FMmSnmKXpvbiEwKMhT5v9mxgSNARalD2qXboMRAwbdvwm7?= =?us-ascii?Q?HQre1e3g/REYB6sBMQMd5iO/gJl5gbasTjU5ZqNOxwx09yDAjqOMUQO/uE1k?= =?us-ascii?Q?4VVT7MDfHlyWIeq0N7cJ8o3LQyF3Gk1O/7eysFnWMTRZxmHRpubm+HExidvx?= =?us-ascii?Q?jtnt9uXIgzb5sGaEKohEGjmmqCEEmmIh7uSyCJro9yFBETrqGR0MulwOowBU?= =?us-ascii?Q?2PNPwCdUWGeljZt/IY4gxVAMpCQ9O4+GU/WD6DF4WIX8I3wmStH4slrFIqgi?= =?us-ascii?Q?7fvFKUtQxhwD6V/AD4xzvAA8dgIb9l/bkkJEPvQzwZaccg7vpkrF99Jzzedc?= =?us-ascii?Q?Y0s1Ig1tgo8EsL31BPKQCg6YmuOrW/bFkDRwmn/1zjYFwuoaIh4G5AEFZS41?= =?us-ascii?Q?mc4TkUwBRVtnP+6hm+8J/F9/Y2Y/scqyJI7ksJwfLYjnaYTGRJ4fCHPIBVCh?= =?us-ascii?Q?Kcb4Kb5yUmIB9BlNRKqx8cKM1pFCd8LUNp7x9wNS3UFQ8FgAHO8GFXfuy1wQ?= =?us-ascii?Q?I2/p986OMcUfo89a7pnmcEMzAiDbATURsz8PLuXbonSz+V8dAHxP68jAwVlX?= =?us-ascii?Q?R5V7CoGtrSAVWDC8Yo7326Teg8XvqsbeeYLicch6YtjYYK4FYKDobTDpKbAM?= =?us-ascii?Q?/tmQ3U2XsTJhYujpAuB/bmEJmjDzEjBgW2XelgdYNNdI7BGCvzjvrXv8uZuK?= =?us-ascii?Q?ZlZO8RgD6uXX4dppC5VSAP/A4Yuxonwrg+/57Vo7AZXVX0ShsugFTUB+NLaj?= =?us-ascii?Q?S+3/odPQ6APCTxB+SFOHx4w8/WnqWkP6HcAvvaf5m6qlGTnMZmL257aU4nUG?= =?us-ascii?Q?JM2gZavt7cjvgVx9XEf2KxJw4QUzY4zO9eP6QVEO+f6hi0cxa75x10Q+XuVy?= =?us-ascii?Q?xvPRx1XuXSn9QTLeIqurirMzgyF61s7Rs9084CvI?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: CY5PR11MB6487.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: e248c5e4-cc2c-4d49-2436-08db034be227 X-MS-Exchange-CrossTenant-originalarrivaltime: 31 Jan 2023 05:27:45.4078 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: ohubLJkEV9feSk5Ww3nU77fxYGuQTO5uftluYWe7ajU5e8OE/0un0zi83qYHX2qNxNSeADWfUZzYJT/4WoRRaw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR11MB7879 X-OriginatorOrg: intel.com X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi Bruce, > -----Original Message----- > From: Richardson, Bruce > Sent: Wednesday, January 18, 2023 12:52 AM > To: Jiang, Cheng1 > Cc: thomas@monjalon.net; mb@smartsharesystems.com; dev@dpdk.org; > Hu, Jiayu ; Ding, Xuan ; Ma, > WenwuX ; Wang, YuanX > ; He, Xingguang > Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application >=20 + > > +static inline void > > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t > kick_batch, uint32_t buf_size, > > + uint16_t mpool_iter_step, struct rte_mbuf **srcs, > struct rte_mbuf > > +**dsts) { > > + int64_t async_cnt =3D 0; > > + int nr_cpl =3D 0; > > + uint32_t index; > > + uint16_t offset; > > + uint32_t i; > > + > > + for (offset =3D 0; offset < mpool_iter_step; offset++) { > > + for (i =3D 0; index =3D i * mpool_iter_step + offset, index < nr_buf= ; > i++) { > > + if (unlikely(rte_dma_copy(dev_id, > > + 0, > > + srcs[index]->buf_iova + > srcs[index]->data_off, > > + dsts[index]->buf_iova + > dsts[index]->data_off, > > + buf_size, > > + 0) < 0)) { > > + rte_dma_submit(dev_id, 0); > > + while (rte_dma_burst_capacity(dev_id, 0) =3D=3D > 0) { > > + nr_cpl =3D rte_dma_completed(dev_id, > 0, MAX_DMA_CPL_NB, > > + NULL, NULL); > > + async_cnt -=3D nr_cpl; > > + } > > + if (rte_dma_copy(dev_id, > > + 0, > > + srcs[index]->buf_iova + > srcs[index]->data_off, > > + dsts[index]->buf_iova + > dsts[index]->data_off, > > + buf_size, > > + 0) < 0) { > > + printf("enqueue fail again at %u\n", > index); > > + printf("space:%d\n", > rte_dma_burst_capacity(dev_id, 0)); > > + rte_exit(EXIT_FAILURE, "DMA > enqueue failed\n"); > > + } > > + } > > + async_cnt++; > > + > > + /** > > + * When '&' is used to wrap an index, mask must be a > power of 2. > > + * That is, kick_batch must be 2^n. > > + */ > > + if (unlikely((async_cnt % kick_batch) =3D=3D 0)) { > > + rte_dma_submit(dev_id, 0); > > + /* add a poll to avoid ring full */ > > + nr_cpl =3D rte_dma_completed(dev_id, 0, > MAX_DMA_CPL_NB, NULL, NULL); > > + async_cnt -=3D nr_cpl; > > + } > > + } > > + > > + rte_dma_submit(dev_id, 0); > > + while (async_cnt > 0) { > > + nr_cpl =3D rte_dma_completed(dev_id, 0, > MAX_DMA_CPL_NB, NULL, NULL); > > + async_cnt -=3D nr_cpl; > > + } >=20 > I have a couple of concerns about the methodology for testing the HW DMA > performance. For example, the inclusion of that final block means that we > are including the latency of the copy operation in the result. The waiting time will not exceed the time of completing job= s. We also introduce initial startup time of waiting DMA starts to complete jo= bs. But if the total jobs are massive, we can ignore the impact of them. Howeve= r, the problem is that we cannot guarantee it in current implementation, espec= ially when memory footprint is small. >=20 > If the objective of the test application is to determine if it is cheaper= for > software to offload a copy operation to HW or do it in SW, then the prima= ry > concern is the HW offload cost. That offload cost should remain constant > irrespective of the size of the copy - since all you are doing is writing= a > descriptor and reading a completion result. However, seeing the results o= f > running the app, I notice that the reported average cycles increases as t= he > packet size increases, which would tend to indicate that we are not givin= g a > realistic measurement of offload cost. I agree with the point that offload cost is very important. When using DMA in an async manner, for N jobs, total DMA processing cycles, total offload = cost cycles, and total core higher-level processing cycles determine the final performance. The total offload cost can be roughly calculated by the averag= e offload cost times the number of offloading operations. And as a benchmark = tool, we can provide the average offload cost (i.e., total cycle of one submit, o= ne kick and one poll function call) which is irrespective of copy size. For example= , for 1024 jobs and batch size 16, the total offload cost is 64 (1024/32) times a= verage offload cost. Repeat 64 times { submit 16 jobs; // always success kick; poll; // 16 jobs are done } As you pointed out, the above estimation method may become not accurate if applications call poll function occasionally or wait for space when fail to= submit. But offload cost is still a very important value for applications, as it sh= ows the basic cost of using DPDK dmadev library and it also provides a rough estima= tion for applications. >=20 > The trouble then becomes how to do so in a more realistic manner. The mos= t > accurate way I can think of in a unit test like this is to offload > entries to the device and measure the cycles taken there. Then wait until > such time as all copies are completed (to eliminate the latency time, whi= ch in > a real-world case would be spent by a core doing something else), and the= n > do a second measurement of the time taken to process all the completions. > In the same way as for a SW copy, any time not spent in memcpy is not cop= y > time, for HW copies any time spent not writing descriptors or reading > completions is not part of the offload cost. >=20 > That said, doing the above is still not fully realistic, as a real-world = app will > likely still have some amount of other overhead, for example, polling > occasionally for completions in between doing other work (though one > would expect this to be relatively cheap). Similarly, if the submission = queue > fills, the app may have to delay waiting for space to submit jobs, and > therefore see some of the HW copy latency. >=20 > Therefore, I think the most realistic way to measure this is to look at t= he rate > of operations while processing is being done in the middle of the test. F= or > example, if we have a simple packet processing application, running the > application just doing RX and TX and measuring the rate allows us to > determine the basic packet I/O cost. Adding in an offload to HW for each > packet and again measuring the rate, will then allow us to compute the tr= ue > offload copy cost of the operation, and should give us a number that rema= ins > flat even as packet size increases. For previous work done on vhost with > DMA acceleration, I believe we saw exactly that - while SW PPS reduced as > packet size increased, with HW copies the PPS remained constant even as > packet size increased. >=20 > The challenge to my mind, is therefore how to implement this in a suitabl= e > unit-test style way, to fit into the framework you have given here. I wou= ld > suggest that the actual performance measurement needs to be done - not > on a total time - but on a fixed time basis within each test. For example= , > when doing HW copies, 1ms into each test run, we need to snapshot the > completed entries, and then say 1ms later measure the number that have > been completed since. In this way, we avoid the initial startup latency w= hile > we wait for jobs to start completing, and we avoid the final latency as w= e > await the last job to complete. We would also include time for some > potentially empty polls, and if a queue size is too small see that reflec= ted in > the performance too. The method you mentioned above is like how we measure NIC RX/TX PPS, where the main thread is in charge of snapshotting the completed jobs for a= fixed time for all worker threads. But the trouble is when to finish one test, as= current framework runs N test cases till all are completed. We may fix the testing = time per case, like 1s, and in each test, the core repeatedly feeds N jobs to DMA un= til running out of the time. Lastly, I want to point out is that the current DMA throughput is tested in= an async manner. The result will be different when using a sync manner. So the bench= mark tool need to tell user it in the final results. Thanks, Jiayu >=20 > Thoughts, input from others? >=20 > /Bruce