From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E89F1A0579; Wed, 7 Apr 2021 11:59:06 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id AAF091410CA; Wed, 7 Apr 2021 11:59:02 +0200 (CEST) Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by mails.dpdk.org (Postfix) with ESMTP id 35DF94013F for ; Wed, 7 Apr 2021 11:59:00 +0200 (CEST) IronPort-SDR: C/LCvFjL/rYiD3JBZWnvgKEC5+KCJx/GuqCuKIOkQE2xDCu13gjhOSVFs7qHRTZ/OrI/PfFAMq 17gbdBPTyNoA== X-IronPort-AV: E=McAfee;i="6000,8403,9946"; a="172747244" X-IronPort-AV: E=Sophos;i="5.82,203,1613462400"; d="scan'208";a="172747244" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Apr 2021 02:58:59 -0700 IronPort-SDR: i4UYvKI48cA40bkB605zOddq8T4U6qEvSw2RddaLsVz2XaPBG2n3ekTvVV7W6SudPR59lVPIwa w6S0MvknY28A== X-IronPort-AV: E=Sophos;i="5.82,203,1613462400"; d="scan'208";a="448179183" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.16.120]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 07 Apr 2021 02:58:57 -0700 Date: Wed, 7 Apr 2021 10:58:53 +0100 From: Bruce Richardson To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: Honnappa Nagarahalli , Tom Barbette , dev@dpdk.org, nd , Alireza Farshin , "Van Haaren, Harry" Message-ID: <20210407095853.GA1644@bricha3-MOBL.ger.corp.intel.com> References: <98CBD80474FA8B44BF855DF32C47DC35C616BF@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C616C1@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C616C1@smartserver.smartshare.dk> Subject: Re: [dpdk-dev] Minutes of Technical Board Meeting, 2021-03-10 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Apr 07, 2021 at 09:11:23AM +0200, Morten Brørup wrote: > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Honnappa > > Nagarahalli > > Sent: Wednesday, April 7, 2021 2:48 AM > > > > > > > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tom Barbette > > > > Sent: Wednesday, March 31, 2021 10:53 AM > > > > > > > > Le 31-03-21 à 02:44, Honnappa Nagarahalli a écrit : > > > > > - Ability to tune the values of #defines > > > > > * Few prominent points discussed > > > > > - This will result in #ifdefs in the code (for ex: in > > testpmd) > > > > > - One option is for all the PMDs to document their > > configurable > > > > #defines in PMD specific header files. Having these distributed is > > > > much easier to search. > > > > > - Can some of the existing #defines be converted to runtime > > > > configurations? For ex: RTE_MAX_LCORE? This might impact ABI. > > > > > * Bruce to think about converting the doc to a blog or an > > email > > > > on the mailing list. But soliciting feedback is most important. > > > > > > > > One alternative path worth looking at is to encourage the use of > > LTO, > > > > and modify APIs so the configuration can be provided at linking > > time, > > > > and propagated by the compiler. > > > > > > > > E.g. one can define rte_max_lcore as a weak constant symbol, equal > > to > > > > 128. At linking time the user may provide a rte_max_lcore that is > > more > > > > tailored, and still, dynamic arrays[rte_max_lcore] will be > > allocatable > > > > on the .bss section, avoiding an indirection. The compiler will be > > > > able to optimize loops etc which is impossible with pure runtime > > > > configuration. > > > > > > > > In packetmill.io we actually pushed this to the next level where > > the > > > > driver can completely change its behavior without recompiling DPDK > > > > itself and spawning ifdefs everywhere. > > > > > > > > However the price is the slowiness of LTO... > > > > > > > > My 2 cents. > > > > > > > > Tom > > > > > > > > > > If we are moving away from Compile Time parameters, I certainly > > prefer Tom's > > > suggestion of Link Time parameters, rather than Run Time parameters. > > I think compile time constants are fine if they are not used in #ifdef. > > For ex: if they are used in 'if (...)', it will help eliminate code and > > branches. > > Yes! > > And "if (...)" is more flexible than #ifdef/#if because it allows the expression to be mixed with non-constants. > > Then perhaps Bruce's script to automatically make C constants out of #defines was not so silly anyway. :-) > > > > > > > > > This might also provide a middle ground for optimizations where > > Compile > > > Time parameters are considered unacceptable by the DPDK community. > > I'm > > > thinking about something along the lines of the "constant size" > > rte_event > > > array presented at the 2020 Userspace Summit by Harry > > > > > (https://static.sched.com/hosted_files/dpdkuserspace2020/d3/dpdk_usersp > > ac > > > e_20_api_performance_hvh.pdf). Taking this thinking even further out, > > a Link > > > Time parameter could perhaps replace the nb_pkts parameter in on > > > optimized rte_eth_rx_burst() function. > > > > > Optimally, I would like to see e.g. the RX burst size being so constant that the PMD's RX function knows it and can use vector functions and possibly loop unrolling, without having to implement a pre-check on nb_pkts and a trailing non-vector loop for receiving any remaining odd nb_pkts. All the DPDK examples use #define MAX_PKT_BURST 32 or similar, and I assume most DPDK applications do too. > > I do not trust the compiler to be clever enough to realize that the PMD's RX function is always called with a specific nb_pkts and optimize all this cruft away at compile time (or at link time), unless it is a #define or a compile time constant. > It certainly is not possible to do at compile time, because the calls are in a different compilation unit from the functions themselves, not to mention that a link-time the RX functions are called via a function pointer. Therefore the only way to do this that I am aware of, is to have a wrapper function use for the common values inside the drivers themselves. For example, inside the i40e driver (which I'm using because it's the one I'm most familiar with), the main receive function is already a wrapper around a raw receive function, using constant-expansion by the compiler of the final parameter (NULL) to automatically remove the code for tracking scattered packets. uint16_t i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL); } We can produce a version of this optimized for 32-element dequeues by special-casing where nb_pkts == 32: uint16_t i40e_recv_pkts_vec_avx2(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts) { if (nb_pkts == 32) return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, 32, NULL); return _recv_raw_pkts_vec_avx2(rx_queue, rx_pkts, nb_pkts, NULL); } Now the compiler when inlining the _raw_ function, can see that it needs two copies, and for the first, that nb_pkts is compile-time constant of 32. However, I'm not sure how useful an optimization like this is, and I'd be interested to see what benefits testing shows. Beyond the loop iteration count of 32, there is also the check after each burst of 8 dequeues in the driver to check that we have a full set of 8 - and abort the loop if not. It's also the case that unless an app is already at maximum load (or overloaded), one would probably not expect to always get a full set of 32 packets each time, as you have no additional headroom for more. /Bruce