From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 13A1AA0C46; Fri, 18 Jun 2021 12:21:41 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 8468440150; Fri, 18 Jun 2021 12:21:40 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by mails.dpdk.org (Postfix) with ESMTP id 615A240142 for ; Fri, 18 Jun 2021 12:21:38 +0200 (CEST) IronPort-SDR: Xint5Eu1h9O7E1clGdbYkj0hf7JtrytDBPcRJ2QsX4M7swSoyztBrXGE1/CH4exk+RNFuFg1RR Z7oFIfYE/Jeg== X-IronPort-AV: E=McAfee;i="6200,9189,10018"; a="228057863" X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="228057863" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 03:21:37 -0700 IronPort-SDR: UcWM1m3r8HGKdpTLkk3TYFUrahm/cncm+rb4Ww5pxOaEt279WRS1mVLpXW8aFoSLLiiL2+CLum mXJxylRJVsuw== X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="422179493" Received: from fyigit-mobl1.ger.corp.intel.com (HELO [10.213.219.119]) ([10.213.219.119]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 03:21:34 -0700 To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , "Ananyev, Konstantin" , Thomas Monjalon , "Richardson, Bruce" Cc: dev@dpdk.org, olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com, jerinj@marvell.com, gakhil@marvell.com References: <20210614105839.3379790-1-thomas@monjalon.net> <98CBD80474FA8B44BF855DF32C47DC35C6184E@smartserver.smartshare.dk> <2004320.XGyPsaEoyj@thomas> <98CBD80474FA8B44BF855DF32C47DC35C61868@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35C61869@smartserver.smartshare.dk> From: Ferruh Yigit X-User: ferruhy Message-ID: Date: Fri, 18 Jun 2021 11:21:30 +0100 MIME-Version: 1.0 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C61869@smartserver.smartshare.dk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH] parray: introduce internal API for dynamic arrays X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 6/17/2021 5:55 PM, Morten Brørup wrote: >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ferruh Yigit >> Sent: Thursday, 17 June 2021 18.13 >> >> On 6/17/2021 4:17 PM, Morten Brørup wrote: >>>> From: Ananyev, Konstantin [mailto:konstantin.ananyev@intel.com] >>>> Sent: Thursday, 17 June 2021 16.59 >>>> >>>>>>>> >>>>>>>> 14/06/2021 15:15, Bruce Richardson: >>>>>>>>> On Mon, Jun 14, 2021 at 02:22:42PM +0200, Morten Brørup wrote: >>>>>>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas >>>> Monjalon >>>>>>>>>>> Sent: Monday, 14 June 2021 12.59 >>>>>>>>>>> >>>>>>>>>>> Performance of access in a fixed-size array is very good >>>>>>>>>>> because of cache locality >>>>>>>>>>> and because there is a single pointer to dereference. >>>>>>>>>>> The only drawback is the lack of flexibility: >>>>>>>>>>> the size of such an array cannot be increase at runtime. >>>>>>>>>>> >>>>>>>>>>> An approach to this problem is to allocate the array at >>>> runtime, >>>>>>>>>>> being as efficient as static arrays, but still limited to a >>>> maximum. >>>>>>>>>>> >>>>>>>>>>> That's why the API rte_parray is introduced, >>>>>>>>>>> allowing to declare an array of pointer which can be resized >>>>>>>>>>> dynamically >>>>>>>>>>> and automatically at runtime while keeping a good read >>>> performance. >>>>>>>>>>> >>>>>>>>>>> After resize, the previous array is kept until the next >> resize >>>>>>>>>>> to avoid crashs during a read without any lock. >>>>>>>>>>> >>>>>>>>>>> Each element is a pointer to a memory chunk dynamically >>>> allocated. >>>>>>>>>>> This is not good for cache locality but it allows to keep the >>>> same >>>>>>>>>>> memory per element, no matter how the array is resized. >>>>>>>>>>> Cache locality could be improved with mempools. >>>>>>>>>>> The other drawback is having to dereference one more pointer >>>>>>>>>>> to read an element. >>>>>>>>>>> >>>>>>>>>>> There is not much locks, so the API is for internal use only. >>>>>>>>>>> This API may be used to completely remove some compilation- >>>> time >>>>>>>>>>> maximums. >>>>>>>>>> >>>>>>>>>> I get the purpose and overall intention of this library. >>>>>>>>>> >>>>>>>>>> I probably already mentioned that I prefer "embedded style >>>> programming" with fixed size arrays, rather than runtime >>>> configurability. >>>>>>> It's >>>>>>>> my personal opinion, and the DPDK Tech Board clearly prefers >>>> reducing the amount of compile time configurability, so there is no >> way >>>>> for >>>>>>>> me to stop this progress, and I do not intend to oppose to this >>>> library. :-) >>>>>>>>>> >>>>>>>>>> This library is likely to become a core library of DPDK, so I >>>> think it is important getting it right. Could you please mention a >> few >>>>>>> examples >>>>>>>> where you think this internal library should be used, and where >>>> it should not be used. Then it is easier to discuss if the border >> line >>>>> between >>>>>>>> control path and data plane is correct. E.g. this library is not >>>> intended to be used for dynamically sized packet queues that grow >> and >>>>> shrink >>>>>>> in >>>>>>>> the fast path. >>>>>>>>>> >>>>>>>>>> If the library becomes a core DPDK library, it should probably >>>> be public instead of internal. E.g. if the library is used to make >>>>>>>> RTE_MAX_ETHPORTS dynamic instead of compile time fixed, then >> some >>>> applications might also need dynamically sized arrays for their >>>>>>>> application specific per-port runtime data, and this library >>>> could serve that purpose too. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks Thomas for starting this discussion and Morten for >>>> follow-up. >>>>>>>>> >>>>>>>>> My thinking is as follows, and I'm particularly keeping in mind >>>> the cases >>>>>>>>> of e.g. RTE_MAX_ETHPORTS, as a leading candidate here. >>>>>>>>> >>>>>>>>> While I dislike the hard-coded limits in DPDK, I'm also not >>>> convinced that >>>>>>>>> we should switch away from the flat arrays or that we need >> fully >>>> dynamic >>>>>>>>> arrays that grow/shrink at runtime for ethdevs. I would suggest >>>> a half-way >>>>>>>>> house here, where we keep the ethdevs as an array, but one >>>> allocated/sized >>>>>>>>> at runtime rather than statically. This would allow us to have >> a >>>>>>>>> compile-time default value, but, for use cases that need it, >>>> allow use of a >>>>>>>>> flag e.g. "max-ethdevs" to change the size of the parameter >>>> given to the >>>>>>>>> malloc call for the array. This max limit could then be >>>> provided to apps >>>>>>>>> too if they want to match any array sizes. [Alternatively those >>>> apps could >>>>>>>>> check the provided size and error out if the size has been >>>> increased beyond >>>>>>>>> what the app is designed to use?]. There would be no extra >>>> dereferences per >>>>>>>>> rx/tx burst call in this scenario so performance should be the >>>> same as >>>>>>>>> before (potentially better if array is in hugepage memory, I >>>> suppose). >>>>>>>> >>>>>>>> I think we need some benchmarks to decide what is the best >>>> tradeoff. >>>>>>>> I spent time on this implementation, but sorry I won't have time >>>> for benchmarks. >>>>>>>> Volunteers? >>>>>>> >>>>>>> I had only a quick look at your approach so far. >>>>>>> But from what I can read, in MT environment your suggestion will >>>> require >>>>>>> extra synchronization for each read-write access to such parray >>>> element (lock, rcu, ...). >>>>>>> I think what Bruce suggests will be much ligther, easier to >>>> implement and less error prone. >>>>>>> At least for rte_ethdevs[] and friends. >>>>>>> Konstantin >>>>>> >>>>>> One more thought here - if we are talking about rte_ethdev[] in >>>> particular, I think we can: >>>>>> 1. move public function pointers (rx_pkt_burst(), etc.) from >>>> rte_ethdev into a separate flat array. >>>>>> We can keep it public to still use inline functions for 'fast' >>>> calls rte_eth_rx_burst(), etc. to avoid >>>>>> any regressions. >>>>>> That could still be flat array with max_size specified at >>>> application startup. >>>>>> 2. Hide rest of rte_ethdev struct in .c. >>>>>> That will allow us to change the struct itself and the whole >>>> rte_ethdev[] table in a way we like >>>>>> (flat array, vector, hash, linked list) without ABI/API breakages. >>>>>> >>>>>> Yes, it would require all PMDs to change prototype for >>>> pkt_rx_burst() function >>>>>> (to accept port_id, queue_id instead of queue pointer), but the >>>> change is mechanical one. >>>>>> Probably some macro can be provided to simplify it. >>>>>> >>>>> >>>>> We are already planning some tasks for ABI stability for v21.11, I >>>> think >>>>> splitting 'struct rte_eth_dev' can be part of that task, it enables >>>> hiding more >>>>> internal data. >>>> >>>> Ok, sounds good. >>>> >>>>> >>>>>> The only significant complication I can foresee with implementing >>>> that approach - >>>>>> we'll need a an array of 'fast' function pointers per queue, not >>>> per device as we have now >>>>>> (to avoid extra indirection for callback implementation). >>>>>> Though as a bonus we'll have ability to use different RX/TX >>>> funcions per queue. >>>>>> >>>>> >>>>> What do you think split Rx/Tx callback into its own struct too? >>>>> >>>>> Overall 'rte_eth_dev' can be split into three as: >>>>> 1. rte_eth_dev >>>>> 2. rte_eth_dev_burst >>>>> 3. rte_eth_dev_cb >>>>> >>>>> And we can hide 1 from applications even with the inline functions. >>>> >>>> As discussed off-line, I think: >>>> it is possible. >>>> My absolute preference would be to have just 1/2 (with CB hidden). >>>> But even with 1/2/3 in place I think it would be a good step >> forward. >>>> Probably worth to start with 1/2/3 first and then see how difficult >> it >>>> would be to switch to 1/2. >>>> Do you plan to start working on it? >>>> >>>> Konstantin >>> >>> If you do proceed with this, be very careful. E.g. the inlined rx/tx >> burst functions should not touch more cache lines than they do today - >> especially if there are many active ports. The inlined rx/tx burst >> functions are very simple, so thorough code review (and possibly also >> of the resulting assembly) is appropriate. Simple performance testing >> might not detect if more cache lines are accessed than before the >> modifications. >>> >>> Don't get me wrong... I do consider this an improvement of the ethdev >> library; I'm only asking you to take extra care! >>> >> >> ack >> >> If we split as above, I think device specific data 'struct >> rte_eth_dev_data' >> should be part of 1 (rte_eth_dev). Which means Rx/Tx inline functions >> access >> additional cache line. >> >> To prevent this, what about duplicating 'data' in 2 >> (rte_eth_dev_burst)? We have >> enough space for it to fit into single cache line, currently it is: >> struct rte_eth_dev { >> eth_rx_burst_t rx_pkt_burst; /* 0 8 >> */ >> eth_tx_burst_t tx_pkt_burst; /* 8 8 >> */ >> eth_tx_prep_t tx_pkt_prepare; /* 16 8 >> */ >> eth_rx_queue_count_t rx_queue_count; /* 24 8 >> */ >> eth_rx_descriptor_done_t rx_descriptor_done; /* 32 8 >> */ >> eth_rx_descriptor_status_t rx_descriptor_status; /* 40 8 >> */ >> eth_tx_descriptor_status_t tx_descriptor_status; /* 48 8 >> */ >> struct rte_eth_dev_data * data; /* 56 8 >> */ >> /* --- cacheline 1 boundary (64 bytes) --- */ >> >> 'rx_descriptor_done' is deprecated and will be removed; > > Makes sense. > > Also consider moving 'data' to the top of the new struct, so there is room to add future functions below. (Without growing to more than the one cache line size, one new function can be added when 'rx_descriptor_done' has been removed.) > +1