From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 034CBA04BA; Wed, 7 Oct 2020 12:44:49 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id AC4714C9D; Wed, 7 Oct 2020 12:44:47 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 0BD442BC7 for ; Wed, 7 Oct 2020 12:44:44 +0200 (CEST) IronPort-SDR: 9WfccUwrjUZrIEqnMkKpMUr0AyXF+40+kX3dk1cSdH+3uDttjZcgFE9kyrb9M2Oy4Nemf4G2+Q 1XettXREKXKQ== X-IronPort-AV: E=McAfee;i="6000,8403,9766"; a="182364817" X-IronPort-AV: E=Sophos;i="5.77,346,1596524400"; d="scan'208";a="182364817" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 03:44:43 -0700 IronPort-SDR: bVMytXNjJQU4ugWBr+t2v4ECMnGIYYvP1zboEqAadP5iEujP2lmqSwQPaQ+XmLWrWW+Zmy1HzO UN0DCR5yWe2Q== X-IronPort-AV: E=Sophos;i="5.77,346,1596524400"; d="scan'208";a="342812756" Received: from rnicolau-mobl1.ger.corp.intel.com (HELO [10.251.81.252]) ([10.251.81.252]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 03:44:41 -0700 To: "Ananyev, Konstantin" , "Van Haaren, Harry" , Jerin Jacob Cc: Honnappa Nagarahalli , "Richardson, Bruce" , "dev@dpdk.org" , "jerinj@marvell.com" , nd References: <20200908105211.10066-1-radu.nicolau@intel.com> <46118f3466274596a663d7d44abb680a@intel.com> <20200925102805.GD923@bricha3-MOBL.ger.corp.intel.com> From: "Nicolau, Radu" Message-ID: Date: Wed, 7 Oct 2020 11:44:39 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 10/6/2020 11:13 AM, Ananyev, Konstantin wrote: >>> -----Original Message----- >>> From: Jerin Jacob >>> Sent: Monday, October 5, 2020 5:35 PM >>> To: Nicolau, Radu >>> Cc: Honnappa Nagarahalli ; Richardson, Bruce >>> ; Ananyev, Konstantin >>> ; Van Haaren, Harry >>> ; dev@dpdk.org; jerinj@marvell.com; nd >>> >>> Subject: Re: [dpdk-dev] [PATCH v1] event/sw: performance improvements >>> >>> On Tue, Sep 29, 2020 at 2:32 PM Nicolau, Radu wrote: >>>> >>>> On 9/28/2020 5:02 PM, Honnappa Nagarahalli wrote: >>>>> >>>>>>> Add minimum burst throughout the scheduler pipeline and a flush counter. >>>>>>> Replace ring API calls with local single threaded implementation where >>>>>>> possible. >>>>>>> >>>>>>> Signed-off-by: Radu Nicolau mailto:radu.nicolau@intel.com >>>>>>> >>>>>>> Thanks for the patch, a few comments inline. >>>>>>> >>>>>>> Why not make these APIs part of the rte_ring library? You could further >>>>>> optimize them by keeping the indices on the same cacheline. >>>>>>> I'm not sure there is any need for non thread-safe rings outside this >>>>>> particular case. >>>>>>> [Honnappa] I think if we add the APIs, we will find the use cases. >>>>>>> But, more than that, I understand that rte_ring structure is exposed to the >>>>>> application. The reason for doing that is the inline functions that rte_ring >>>>>> provides. IMO, we should still maintain modularity and should not use the >>>>>> internals of the rte_ring structure outside of the library. >>>>>>> +1 to that. >>>>>>> >>>>>>> BTW, is there any real perf benefit from such micor-optimisation? >>>>>> I'd tend to view these as use-case specific, and I'm not sure we should clutter >>>>>> up the ring library with yet more functions, especially since they can't be >>>>>> mixed with the existing enqueue/dequeue functions, since they don't use >>>>>> the head pointers. >>>>> IMO, the ring library is pretty organized with the recent addition of HTS/RTS >>> modes. This can be one of the modes and should allow us to use the existing >>> functions (though additional functions are required as well). >>>>> The other concern I have is, this implementation can be further optimized by >>> using a single cache line for the pointers. It uses 2 cache lines just because of the >>> layout of the rte_ring structure. >>>>> There was a question earlier about the performance improvements of this >>> patch? Are there any % performance improvements that can be shared? >>>>> It is also possible to change the above functions to use the head/tail pointers >>> from producer or the consumer cache line alone to check for perf differences. >>>> I don't have a % for the final improvement for this change alone, but >>>> there was some improvement in the memory overhead measurable during >>>> development, which very likely resulted in the whole optimization having >>>> more headroom. >>>> >>>> I agree that this may be further optimized, maybe by having a local >>>> implementation of a ring-like container instead. >>> Have we decided on the next steps for this patch? Is the plan to >>> supersede this patch and have different >>> one in rte_ring subsystem, >> My preference is to merge this version of the patch; >> 1) The ring helper functions are stripped to the SW PMD usage, and not valid to use in the general. >> 2) Adding static inline APIs in an LTS without extensive doesn't seem a good idea. >> >> If Honnappa is OK with the above solution for 20.11, we can see about moving the rings part of the >> code to rte_ring library location in 21.02, and give ourselves some time to settle the usage/API before >> the next LTS. >> > As ring library maintainer I share Honnappa concern that another library not uses public ring API, > but instead accesses ring internals directly. Obviously such coding practice is not welcomed > as it makes harder to maintain/extend ring library in future. > About 2) - these new API can(/shoud) be marked an experimental anyway. > As another thing - it is still unclear what a performance gain we are talking about here. > Is it really worth it comparing to just using SP/SC? The change itself came after I analyzed the memory bound sections of the code, and I just did a quick test, I got about 3.5% improvement in throughput,  maybe not so much but significant for such a small change, and depending on the usecase it may be more. As for the implementation itself, I would favour having a custom ring like container in the PMD code, this will solve the issue of using rte_ring internals while still allow for full optimisation. If this is acceptable, I will follow up by tomorrow.