From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1FDA5A0C4A; Thu, 17 Jun 2021 18:13:06 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9B38A4067A; Thu, 17 Jun 2021 18:13:05 +0200 (CEST) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by mails.dpdk.org (Postfix) with ESMTP id 9F16040150 for ; Thu, 17 Jun 2021 18:13:03 +0200 (CEST) IronPort-SDR: 8xrUPRb4dsTK6bOLa2/jWDnaRJhtFcY3NhaC/Sw/knbshupadYCJ8tLBTjsa2LHQVbz5c8teS3 UxoYkQkm8hjw== X-IronPort-AV: E=McAfee;i="6200,9189,10018"; a="206219109" X-IronPort-AV: E=Sophos;i="5.83,280,1616482800"; d="scan'208";a="206219109" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jun 2021 09:13:02 -0700 IronPort-SDR: 2FyPCR7AqRE1hKrdG9PaAO/TAjZprecgIyzdghyq9Vjug2dRY9Tllw628nUnH3zQjDX1XY51Dk qpVHwKcF9AsA== X-IronPort-AV: E=Sophos;i="5.83,280,1616482800"; d="scan'208";a="554446663" Received: from fyigit-mobl1.ger.corp.intel.com (HELO [10.213.201.111]) ([10.213.201.111]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jun 2021 09:12:58 -0700 To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Ananyev, Konstantin" , Thomas Monjalon , "Richardson, Bruce" Cc: dev@dpdk.org, olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com, jerinj@marvell.com, gakhil@marvell.com References: <20210614105839.3379790-1-thomas@monjalon.net> <98CBD80474FA8B44BF855DF32C47DC35C6184E@smartserver.smartshare.dk> <2004320.XGyPsaEoyj@thomas> <98CBD80474FA8B44BF855DF32C47DC35C61868@smartserver.smartshare.dk> From: Ferruh Yigit X-User: ferruhy Message-ID: Date: Thu, 17 Jun 2021 17:12:53 +0100 MIME-Version: 1.0 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C61868@smartserver.smartshare.dk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] parray: introduce internal API for dynamic arrays X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 6/17/2021 4:17 PM, Morten Brørup wrote: >> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] >> Sent: Thursday, 17 June 2021 16.59 >> >>>>>> >>>>>> 14/06/2021 15:15, Bruce Richardson: >>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote: >>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas >> Monjalon >>>>>>>>> Sent: Monday, 14 June 2021 12.59 >>>>>>>>> >>>>>>>>> Performance of access in a fixed-size array is very good >>>>>>>>> because of cache locality >>>>>>>>> and because there is a single pointer to dereference. >>>>>>>>> The only drawback is the lack of flexibility: >>>>>>>>> the size of such an array cannot be increase at runtime. >>>>>>>>> >>>>>>>>> An approach to this problem is to allocate the array at >> runtime, >>>>>>>>> being as efficient as static arrays, but still limited to a >> maximum. >>>>>>>>> >>>>>>>>> That's why the API rte_parray is introduced, >>>>>>>>> allowing to declare an array of pointer which can be resized >>>>>>>>> dynamically >>>>>>>>> and automatically at runtime while keeping a good read >> performance. >>>>>>>>> >>>>>>>>> After resize, the previous array is kept until the next resize >>>>>>>>> to avoid crashs during a read without any lock. >>>>>>>>> >>>>>>>>> Each element is a pointer to a memory chunk dynamically >> allocated. >>>>>>>>> This is not good for cache locality but it allows to keep the >> same >>>>>>>>> memory per element, no matter how the array is resized. >>>>>>>>> Cache locality could be improved with mempools. >>>>>>>>> The other drawback is having to dereference one more pointer >>>>>>>>> to read an element. >>>>>>>>> >>>>>>>>> There is not much locks, so the API is for internal use only. >>>>>>>>> This API may be used to completely remove some compilation- >> time >>>>>>>>> maximums. >>>>>>>> >>>>>>>> I get the purpose and overall intention of this library. >>>>>>>> >>>>>>>> I probably already mentioned that I prefer "embedded style >> programming" with fixed size arrays, rather than runtime >> configurability. >>>>> It's >>>>>> my personal opinion, and the DPDK Tech Board clearly prefers >> reducing the amount of compile time configurability, so there is no way >>> for >>>>>> me to stop this progress, and I do not intend to oppose to this >> library. :-) >>>>>>>> >>>>>>>> This library is likely to become a core library of DPDK, so I >> think it is important getting it right. Could you please mention a few >>>>> examples >>>>>> where you think this internal library should be used, and where >> it should not be used. Then it is easier to discuss if the border line >>> between >>>>>> control path and data plane is correct. E.g. this library is not >> intended to be used for dynamically sized packet queues that grow and >>> shrink >>>>> in >>>>>> the fast path. >>>>>>>> >>>>>>>> If the library becomes a core DPDK library, it should probably >> be public instead of internal. E.g. if the library is used to make >>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then some >> applications might also need dynamically sized arrays for their >>>>>> application specific per-port runtime data, and this library >> could serve that purpose too. >>>>>>>> >>>>>>> >>>>>>> Thanks Thomas for starting this discussion and Morten for >> follow-up. >>>>>>> >>>>>>> My thinking is as follows, and I'm particularly keeping in mind >> the cases >>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here. >>>>>>> >>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not >> convinced that >>>>>>> we should switch away from the flat arrays or that we need fully >> dynamic >>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest >> a half-way >>>>>>> house here, where we keep the ethdevs as an array, but one >> allocated/sized >>>>>>> at runtime rather than statically. This would allow us to have a >>>>>>> compile-time default value, but, for use cases that need it, >> allow use of a >>>>>>> flag e.g. "max-ethdevs" to change the size of the parameter >> given to the >>>>>>> malloc call for the array. This max limit could then be >> provided to apps >>>>>>> too if they want to match any array sizes. [Alternatively those >> apps could >>>>>>> check the provided size and error out if the size has been >> increased beyond >>>>>>> what the app is designed to use?]. There would be no extra >> dereferences per >>>>>>> rx/tx burst call in this scenario so performance should be the >> same as >>>>>>> before (potentially better if array is in hugepage memory, I >> suppose). >>>>>> >>>>>> I think we need some benchmarks to decide what is the best >> tradeoff. >>>>>> I spent time on this implementation, but sorry I won't have time >> for benchmarks. >>>>>> Volunteers? >>>>> >>>>> I had only a quick look at your approach so far. >>>>> But from what I can read, in MT environment your suggestion will >> require >>>>> extra synchronization for each read-write access to such parray >> element (lock, rcu, ...). >>>>> I think what Bruce suggests will be much ligther, easier to >> implement and less error prone. >>>>> At least for rte_ethdevs[] and friends. >>>>> Konstantin >>>> >>>> One more thought here - if we are talking about rte_ethdev[] in >> particular, I think we can: >>>> 1. move public function pointers (rx_pkt_burst(), etc.) from >> rte_ethdev into a separate flat array. >>>> We can keep it public to still use inline functions for 'fast' >> calls rte_eth_rx_burst(), etc. to avoid >>>> any regressions. >>>> That could still be flat array with max_size specified at >> application startup. >>>> 2. Hide rest of rte_ethdev struct in .c. >>>> That will allow us to change the struct itself and the whole >> rte_ethdev[] table in a way we like >>>> (flat array, vector, hash, linked list) without ABI/API breakages. >>>> >>>> Yes, it would require all PMDs to change prototype for >> pkt_rx_burst() function >>>> (to accept port_id, queue_id instead of queue pointer), but the >> change is mechanical one. >>>> Probably some macro can be provided to simplify it. >>>> >>> >>> We are already planning some tasks for ABI stability for v21.11, I >> think >>> splitting 'struct rte_eth_dev' can be part of that task, it enables >> hiding more >>> internal data. >> >> Ok, sounds good. >> >>> >>>> The only significant complication I can foresee with implementing >> that approach - >>>> we'll need a an array of 'fast' function pointers per queue, not >> per device as we have now >>>> (to avoid extra indirection for callback implementation). >>>> Though as a bonus we'll have ability to use different RX/TX >> funcions per queue. >>>> >>> >>> What do you think split Rx/Tx callback into its own struct too? >>> >>> Overall 'rte_eth_dev' can be split into three as: >>> 1. rte_eth_dev >>> 2. rte_eth_dev_burst >>> 3. rte_eth_dev_cb >>> >>> And we can hide 1 from applications even with the inline functions. >> >> As discussed off-line, I think: >> it is possible. >> My absolute preference would be to have just 1/2 (with CB hidden). >> But even with 1/2/3 in place I think it would be a good step forward. >> Probably worth to start with 1/2/3 first and then see how difficult it >> would be to switch to 1/2. >> Do you plan to start working on it? >> >> Konstantin > > If you do proceed with this, be very careful. E.g. the inlined rx/tx burst functions should not touch more cache lines than they do today - especially if there are many active ports. The inlined rx/tx burst functions are very simple, so thorough code review (and possibly also of the resulting assembly) is appropriate. Simple performance testing might not detect if more cache lines are accessed than before the modifications. > > Don't get me wrong... I do consider this an improvement of the ethdev library; I'm only asking you to take extra care! > ack If we split as above, I think device specific data 'struct rte_eth_dev_data' should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions access additional cache line. To prevent this, what about duplicating 'data' in 2 (rte_eth_dev_burst)? We have enough space for it to fit into single cache line, currently it is: struct rte_eth_dev { eth_rx_burst_t rx_pkt_burst; /* 0 8 */ eth_tx_burst_t tx_pkt_burst; /* 8 8 */ eth_tx_prep_t tx_pkt_prepare; /* 16 8 */ eth_rx_queue_count_t rx_queue_count; /* 24 8 */ eth_rx_descriptor_done_t rx_descriptor_done; /* 32 8 */ eth_rx_descriptor_status_t rx_descriptor_status; /* 40 8 */ eth_tx_descriptor_status_t tx_descriptor_status; /* 48 8 */ struct rte_eth_dev_data * data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ 'rx_descriptor_done' is deprecated and will be removed;