From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7D5EEA04FD; Sun, 3 Jul 2022 21:38:26 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 27A1D41132; Sun, 3 Jul 2022 21:38:26 +0200 (CEST) Received: from forward501p.mail.yandex.net (forward501p.mail.yandex.net [77.88.28.111]) by mails.dpdk.org (Postfix) with ESMTP id 0D5AD4021F; Sun, 3 Jul 2022 21:38:25 +0200 (CEST) Received: from iva3-32ab41554fb4.qloud-c.yandex.net (iva3-32ab41554fb4.qloud-c.yandex.net [IPv6:2a02:6b8:c0c:c82:0:640:32ab:4155]) by forward501p.mail.yandex.net (Yandex) with ESMTP id 5FE656212768; Sun, 3 Jul 2022 22:38:24 +0300 (MSK) Received: from iva1-dcde80888020.qloud-c.yandex.net (iva1-dcde80888020.qloud-c.yandex.net [2a02:6b8:c0c:7695:0:640:dcde:8088]) by iva3-32ab41554fb4.qloud-c.yandex.net (mxback/Yandex) with ESMTP id MMSWnyi7bJ-cOfiLwkY; Sun, 03 Jul 2022 22:38:24 +0300 X-Yandex-Fwd: 2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail; t=1656877104; bh=G7Y+RJ4hHXAF+9fccix1Q+znqkaf9sdhvLJ/YO4+5CU=; h=In-Reply-To:From:Subject:Cc:References:Date:Message-ID:To; b=a2OcSJz+ieckIYQX5dLFPKZaqIf3U8NTtB4LmQNWCk7YemOqMIwW0yNIqy4Lgn227 nUgYzmXGlOW9oeMYltDdXRzlf1xu8aDCsMuSv1sHTHuHn3bDSvWXFAi19GLwdTAzKa RDpUrRliXlbZFYvrrby0XLqQ6GVbildPWpem52nQ= Authentication-Results: iva3-32ab41554fb4.qloud-c.yandex.net; dkim=pass header.i=@yandex.ru Received: by iva1-dcde80888020.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA id DoIIL9LIN4-cMMKe6FM; Sun, 03 Jul 2022 22:38:23 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) Message-ID: <91f748cd-14c1-91ca-befe-64db36789346@yandex.ru> Date: Sun, 3 Jul 2022 20:38:21 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: Optimizations are not features Content-Language: en-US To: Honnappa Nagarahalli , Andrew Rybchenko , =?UTF-8?Q?Morten_Br=c3=b8rup?= , Jerin Jacob Cc: dpdk-dev , "techboard@dpdk.org" , nd References: <98CBD80474FA8B44BF855DF32C47DC35D870EB@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D870EF@smartserver.smartshare.dk> <497ed526-2fb7-805a-f72c-5909b85a62c1@yandex.ru> From: Konstantin Ananyev In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org 29/06/2022 21:44, Honnappa Nagarahalli пишет: > > >> >> 04/06/2022 13:51, Andrew Rybchenko пишет: >>> On 6/4/22 15:19, Morten Brørup wrote: >>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com] >>>>> Sent: Saturday, 4 June 2022 13.10 >>>>> >>>>> On Sat, Jun 4, 2022 at 3:30 PM Andrew Rybchenko >>>>> wrote: >>>>>> >>>>>> On 6/4/22 12:33, Jerin Jacob wrote: >>>>>>> On Sat, Jun 4, 2022 at 2:39 PM Morten Brørup >>>>> wrote: >>>>>>>> >>>>>>>> I would like the DPDK community to change its view on compile >>>>>>>> time >>>>> options. Here is why: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Application specific performance micro-optimizations like “fast >>>>> mbuf free” and “mbuf direct re-arm” are being added to DPDK and >>>>> presented as features. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> They are not features, but optimizations, and I don’t understand >>>>> the need for them to be available at run-time! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Instead of adding a bunch of exotic exceptions to the fast path >>>>>>>> of >>>>> the PMDs, they should be compile time options. This will improve >>>>> performance by avoiding branches in the fast path, both for the >>>>> applications using them, and for generic applications (where the >>>>> exotic code is omitted). >>>>>>> >>>>>>> Agree. I think, keeping the best of both worlds would be >>>>>>> >>>>>>> -Enable the feature/optimization as runtime -Have a compile-time >>>>>>> option to disable the feature/optimization as >>>>> an override. >>>>>> >>>>>> It is hard to find the right balance, but in general compile time >>>>>> options are a nightmare for maintenance. Number of required builds >>>>>> will grow as an exponent. >>>> >>>> Test combinations are exponential for N features, regardless if N are >>>> runtime or compile time options. >>> >>> But since I'm talking about build checks I don't care about >>> exponential grows in run time. Yes, testing should care, but it is a separate >> story. >>> >>>> >>>>>> Of course, we can >>>>>> limit number of checked combinations, but it will result in flow of >>>>>> patches to fix build in other cases. >>>>> >>>>> The build breakage can be fixed if we use (2) vs (1) >>>>> >>>>> 1) >>>>> #ifdef ... >>>>> My feature >>>>> #endif >>>>> >>>>> 2) >>>>> static __rte_always_inline int >>>>> rte_has_xyz_feature(void) >>>>> { >>>>> #ifdef RTE_LIBRTE_XYZ_FEATURE >>>>>          return RTE_LIBRTE_XYZ_FEATURE; #else >>>>>          return 0; >>>>> #endif >>>>> } >>>>> >>>>> if(rte_has_xyz_feature())) { >>>>> My feature code >>>>> >>>>> } >>>>> >>> >>> Jerin, thanks, very good example. >>> >>>> I'm not sure all the features can be covered by that, e.g. added >>>> fields in structures. >>> >>> +1 >>> >>>> >>>> Also, I would consider such features "opt in" at compile time only. >>>> As such, they could be allowed to break the ABI/API. >>>> >>>>> >>>>> >>>>>> Also compile time options tend to make code less readable which >>>>>> makes all aspects of the development harder. >>>>>> >>>>>> Yes, compile time is nice for micro optimizations, but I have great >>>>>> concerns that it is a right way to go. >>>>>> >>>>>>>> Please note that I am only talking about the performance >>>>> optimizations that are limited to application specific use cases. I >>>>> think it makes sense to require that performance optimizing an >>>>> application also requires recompiling the performance critical >>>>> libraries used by it. >>>>>>>> abandon some of existing functionality to create a 'short-cut' >>>>>>>> >>>>>>>> >>>>>>>> Allowing compile time options for application specific >>>>>>>> performance >>>>> optimizations in DPDK would also open a path for other >>>>> optimizations, which can only be achieved at compile time, such as >>>>> “no fragmented packets”, “no attached mbufs” and “single mbuf pool”. >>>>> And even more exotic optimizations, such as the “indexed mempool >>>>> cache”, which was rejected due to ABI violations – they could be >>>>> marked as “risky and untested” or similar, but still be part of the DPDK main >> repository. >>>>>>>> >> >> >> Thanks Morten for bringing it up, it is an interesting topic. >> Though I look at it from different angle. >> All optimizations you mentioned above introduce new limitations: >> MBUF_FAST_FREE - no indirect mbufs and multiple mempools, mempool object >> indexes - mempool size is limited to 4GB, direct rearm - drop ability to >> stop/reconfigure TX queue, while RX queue is still running, etc. >> Note that all these limitations are not forced by HW. >> All of them are pure SW limitations that developers forced in (or tried to) to get >> few extra performance. >> That's concerning tendency. >> >> As more and more such 'optimization via limitation' will come in: >> - DPDK feature list will become more and more fragmented. >> - Would cause more and more confusion for the users. >> - Unmet expectations - difference in performance between 'default' >> and 'optimized' version of DPDK will become bigger and bigger. >> - As Andrew already mentioned, maintaining all these 'sub-flavours' >> of DPDK will become more and more difficult. > The point that we need to remember is, these features/optimizations are introduced after seeing performance issues in practical use cases. Sorry I didn't get it: what performance issues you are talking about? If let say our mempool code is sub-optimal in some place for some architecture due to bad design or bad implementation - please point to it and let's try to fix it, instead of avoiding using mempool API If you just saying that avoiding using mempool in some cases could buy us few extra performance (a short-cut), then yes it surely could. Another question - is it really worth it? Having all mbufs management covered by one SW abstraction helps a lot in terms of project maintainability, further extensions, introducing new common optimizations, etc. > DPDK is not being used in just one use case, it is being used in several use cases which have their own unique requirements. Is 4GB enough for packet buffers - yes it is enough in certain use cases. Are their NICs with single port - yes there are. Sure there are NICs with one port. But also there are NICs with 2 ports, 4 ports, etc. Should we maintain specific DPDK sub-versions for all these cases? From my perspective - no. It would be overwhelming effort for DPDK community, plus many customers use DPDK to build their own products that supposed to work seamlessly across multiple use-cases/platforms. HW is being created because use cases and business cases exist. It is obvious that as DPDK gets adopted on more platforms that differ largely, the features will increase and it will become complex. Complexity should not be used as a criteria to reject patches. Well, we do have plenty of HW specific optimizations inside DPDK and we put a lot of effort that all this HW specific staff be transparent to the user as much as possible. I don't see why for SW specific optimizations it should be different. > > There is different perspective to what you are calling as 'limitations'. By 'limitations' I mean situation when user has to cut off existing functionality to enable these 'optimizations'. I can argue that multiple mempools, stop/reconfigure TX queue while RX queue is still running are exotic. Just because those are allowed currently (probably accidently) does not mean they are being used. Are there use cases that make use of these features? If DPDK examples/l3fwd doesn't use these features, it doesn't mean they are useless :) I believe both multiple mempools (indirect-mbufs) and ability to start/stop queues separately are major DPDK features that are used across many real-world deployments. > > The base/existing design for DPDK was done with one particular HW architecture in mind where there was an abundance of resources. Unfortunately, that HW architecture is fast evolving and DPDK is adopted in use cases where that kind of resources are not available. For ex: efficiency cores are being introduced by every CPU vendor now. Soon enough, we will see big-little architecture in networking as well. The existing PMD design introduces 512B of stores (256B for copying to stack variable and 256B to store lcore cache) and 256B load/store on RX side every 32 packets back to back. It doesn't make sense to have that kind of memcopy for little/efficiency cores just for the driver code. I don't object about specific use-case optimizations. Specially if the use-case is a common one. But I think such changes has to be transparent to the user as much as possible and shouldn't cause further DPDK code fragmentation (new CONFIG options, etc.). I understand that it is not always possible, but for pure SW based optimizations, I think it is a reasonable expectation. >> >> So, probably instead of making such changes easier, we need somehow to >> persuade developers to think more about optimizations that would be generic >> and transparent to the user. > Or may be we need to think of creating alternate ways of programming. > >> I do realize that it is not always possible due to various reasons (HW limitations, >> external dependencies, etc.) but that's another story. >> >> Let's take for example MBUF_FAST_FREE. >> In fact, I am not sure that we need it as tx offload flag at all. >> PMD TX-path has all necessary information to decide at run-time can it do >> fast_free() for not: >> At tx_burst() PMD can check are all mbufs satisfy these conditions (same >> mempool, refcnt==1) and update some fields and/or counters inside TXQ to >> reflect it. >> Then, at tx_free() we can use this info to decide between fast_free() and >> normal_free(). >> As at tx_burst() we read mbuf fields anyway, impact for this extra step I guess >> would be minimal. >> Yes, most likely, it wouldn't be as fast as with current TX offload flag, or >> conditional compilation approach. >> But it might be still significantly faster then normal_free(), plus such approach >> will be generic and transparent to the user. > IMO, this depends on the philosophy that we want to adopt. I would prefer to make control plane complex for performance gains on the data plane. The performance on the data plane has a multiplying effect due to the ratio of number of cores assigned for data plane vs control plane. > > I am not against evaluating alternatives, but the alternative approaches need to have similar (not the same) performance. > >> >> Konstantin