From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 6F2541BE37 for ; Thu, 5 Jul 2018 16:45:44 +0200 (CEST) X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 05 Jul 2018 07:45:42 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,312,1526367600"; d="scan'208";a="242872148" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga005.fm.intel.com with ESMTP; 05 Jul 2018 07:45:35 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w65EjYIe018987; Thu, 5 Jul 2018 15:45:34 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w65EjYgt031680; Thu, 5 Jul 2018 15:45:34 +0100 Received: (from lma25@localhost) by sivswdev01.ir.intel.com with LOCAL id w65EjYqq031675; Thu, 5 Jul 2018 15:45:34 +0100 Date: Thu, 5 Jul 2018 15:45:34 +0100 From: "Liang, Ma" To: Kevin Traynor Cc: Radu Nicolau , dev@dpdk.org, david.hunt@intel.com Message-ID: <20180705144534.GB25741@sivswdev01.ir.intel.com> References: <1529505898-6458-2-git-send-email-liang.j.ma@intel.com> <1530013217-22300-1-git-send-email-radu.nicolau@intel.com> <014ba7a7-c9a7-86da-e705-a688e53b83b3@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <014ba7a7-c9a7-86da-e705-a688e53b83b3@redhat.com> User-Agent: Mutt/1.9.1 (2017-09-22) Subject: Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jul 2018 14:45:45 -0000 On 27 Jun 18:33, Kevin Traynor wrote: > On 06/26/2018 12:40 PM, Radu Nicolau wrote: > > From: Liang Ma > > > > 1. Abstract > > > > For packet processing workloads such as DPDK polling is continuous. > > This means CPU cores always show 100% busy independent of how much work > > those cores are doing. It is critical to accurately determine how busy > > a core is hugely important for the following reasons: > > > > * No indication of overload conditions > > > > * User do not know how much real load is on a system meaning resulted in > > wasted energy as no power management is utilized > > > > Tried and failed schemes include calculating the cycles required from > > the load on the core, in other words the busyness. For example, > > how many cycles it costs to handle each packet and determining the > > frequency cost per core. Due to the varying nature of traffic, types of > > frames and cost in cycles to process, this mechanism becomes complex > > quickly where a simple scheme is required to solve the problems. > > > > 2. Proposed solution > > > > For all polling mechanism, the proposed solution focus on how many times > > empty poll executed instead of calculating how many cycles it cost to > > handle each packet. The less empty poll number means current core is busy > > with processing workload, therefore, the higher frequency is needed. The > > high empty poll number indicate current core has lots spare time, > > therefore, we can lower the frequency. > > > > Hi Liang/Radu, > > I can see the benefit of providing an API for the application to provide > the num rx from each poll, and then have the library step down/up the > freq based on that. However, not sure I follow why you are adding the > complexity of defining power states and training modes. > > > 2.1 Power state definition: > > > > LOW: the frequency is used for purge mode. > > > > MED: the frequency is used to process modest traffic workload. > > > > HIGH: the frequency is used to process busy traffic workload. > > > > Why does there need to be user defined freq levels? Why not just keep > stepping down the freq until there is some user-defined threshold of > zero polls reached. e.g. keep stepping down until 10% of polls are zero > poll and have a tail of some time (perhaps user defined) for the step down. tranfer from one P-state to another P-state need update MSR which is expensive. and swap the state too many times will disturb the worker core performance. > > > 2.2 There are two phases to establish the power management system: > > > > a.Initialization/Training phase. There is no traffic pass-through, > > the system will test average empty poll numbers with > > LOW/MED/HIGH power state. Those average empty poll numbers > > will be the baseline > > for the normal phase. The system will collect all core's counter > > every 100ms. The Training phase will take 5 seconds. > > > > This is requiring an application to sit for 5 secs in order to train and > align poll numbers with states? That doesn't seem realistic to me. Because each CPU SKU has different configuration, micro-arch, cache size, power state number etc. it's has to be tested in Training phase to find the base line. simple app can block RX for the First 5 secs. > > > b.Normal phase. When the real traffic pass-though, the system will > > compare run-time empty poll moving average value with base line > > then make decision to move to HIGH power state of MED power > > state. The system will collect all core's counter every 10ms. > > > > I only reviewed this commit msg and API usage, so maybe I didn't fully > get the use case or details, but it seems quite awkward from an > application perspective IMHO. > > > 3. Proposed API > > > > 1. rte_power_empty_poll_stat_init(void); > > which is used to initialize the power management system. > > > > 2. rte_power_empty_poll_stat_free(void); > > which is used to free the resource hold by power management system. > > > > 3. rte_power_empty_poll_stat_update(unsigned int lcore_id); > > which is used to update specific core empty poll counter, not thread safe > > > > 4. rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); > > which is used to update specific core valid poll counter, not thread safe > > > > I think 4 could be dropped and 3 used instead. It could be a simple API > that takes in the core and nb_pkts from a poll. Seems clearer than > making a separate API for a special value of nb_pkts (i.e. 0) and the > application having to check to know which API should be called. Agree. > > > 5. rte_power_empty_poll_stat_fetch(unsigned int lcore_id); > > which is used to get specific core empty poll counter. > > > > 6. rte_power_poll_stat_fetch(unsigned int lcore_id); > > which is used to get specific core valid poll counter. > > > > 7. rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit); > > which allow user customize the frequency of power state. > > > > 8. rte_power_empty_poll_setup_timer(void); > > which is used to setup the timer/callback to process all above counter. > > > > The new API should be experimental > > > ChangeLog: > > v2: fix some coding style issues > > v3: rename the filename, API name. > > v4: updated makefile and symbol list > > > > Signed-off-by: Liang Ma > > Signed-off-by: Radu Nicolau > > --- > > lib/librte_power/Makefile | 5 +- > > lib/librte_power/meson.build | 5 +- > > lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++ > > lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++ > > lib/librte_power/rte_power_version.map | 14 +- > > 5 files changed, 742 insertions(+), 5 deletions(-) > > create mode 100644 lib/librte_power/rte_power_empty_poll.c > > create mode 100644 lib/librte_power/rte_power_empty_poll.h > > > > Is there any in-tree documentation planned? > > Kevin.