From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by dpdk.org (Postfix) with ESMTP id A80AA1B3A0 for ; Wed, 27 Jun 2018 19:33:06 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 0A86D4078BBB; Wed, 27 Jun 2018 17:33:06 +0000 (UTC) Received: from ktraynor.remote.csb (ovpn-117-152.ams2.redhat.com [10.36.117.152]) by smtp.corp.redhat.com (Postfix) with ESMTP id D8BA921565E1; Wed, 27 Jun 2018 17:33:04 +0000 (UTC) To: Radu Nicolau , dev@dpdk.org, liang.j.ma@intel.com Cc: david.hunt@intel.com References: <1529505898-6458-2-git-send-email-liang.j.ma@intel.com> <1530013217-22300-1-git-send-email-radu.nicolau@intel.com> From: Kevin Traynor Organization: Red Hat Message-ID: <014ba7a7-c9a7-86da-e705-a688e53b83b3@redhat.com> Date: Wed, 27 Jun 2018 18:33:04 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <1530013217-22300-1-git-send-email-radu.nicolau@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Wed, 27 Jun 2018 17:33:06 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Wed, 27 Jun 2018 17:33:06 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'ktraynor@redhat.com' RCPT:'' Subject: Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 27 Jun 2018 17:33:07 -0000 On 06/26/2018 12:40 PM, Radu Nicolau wrote: > From: Liang Ma > > 1. Abstract > > For packet processing workloads such as DPDK polling is continuous. > This means CPU cores always show 100% busy independent of how much work > those cores are doing. It is critical to accurately determine how busy > a core is hugely important for the following reasons: > > * No indication of overload conditions > > * User do not know how much real load is on a system meaning resulted in > wasted energy as no power management is utilized > > Tried and failed schemes include calculating the cycles required from > the load on the core, in other words the busyness. For example, > how many cycles it costs to handle each packet and determining the > frequency cost per core. Due to the varying nature of traffic, types of > frames and cost in cycles to process, this mechanism becomes complex > quickly where a simple scheme is required to solve the problems. > > 2. Proposed solution > > For all polling mechanism, the proposed solution focus on how many times > empty poll executed instead of calculating how many cycles it cost to > handle each packet. The less empty poll number means current core is busy > with processing workload, therefore, the higher frequency is needed. The > high empty poll number indicate current core has lots spare time, > therefore, we can lower the frequency. > Hi Liang/Radu, I can see the benefit of providing an API for the application to provide the num rx from each poll, and then have the library step down/up the freq based on that. However, not sure I follow why you are adding the complexity of defining power states and training modes. > 2.1 Power state definition: > > LOW: the frequency is used for purge mode. > > MED: the frequency is used to process modest traffic workload. > > HIGH: the frequency is used to process busy traffic workload. > Why does there need to be user defined freq levels? Why not just keep stepping down the freq until there is some user-defined threshold of zero polls reached. e.g. keep stepping down until 10% of polls are zero poll and have a tail of some time (perhaps user defined) for the step down. > 2.2 There are two phases to establish the power management system: > > a.Initialization/Training phase. There is no traffic pass-through, > the system will test average empty poll numbers with > LOW/MED/HIGH power state. Those average empty poll numbers > will be the baseline > for the normal phase. The system will collect all core's counter > every 100ms. The Training phase will take 5 seconds. > This is requiring an application to sit for 5 secs in order to train and align poll numbers with states? That doesn't seem realistic to me. > b.Normal phase. When the real traffic pass-though, the system will > compare run-time empty poll moving average value with base line > then make decision to move to HIGH power state of MED power > state. The system will collect all core's counter every 10ms. > I only reviewed this commit msg and API usage, so maybe I didn't fully get the use case or details, but it seems quite awkward from an application perspective IMHO. > 3. Proposed API > > 1. rte_power_empty_poll_stat_init(void); > which is used to initialize the power management system. > > 2. rte_power_empty_poll_stat_free(void); > which is used to free the resource hold by power management system. > > 3. rte_power_empty_poll_stat_update(unsigned int lcore_id); > which is used to update specific core empty poll counter, not thread safe > > 4. rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); > which is used to update specific core valid poll counter, not thread safe > I think 4 could be dropped and 3 used instead. It could be a simple API that takes in the core and nb_pkts from a poll. Seems clearer than making a separate API for a special value of nb_pkts (i.e. 0) and the application having to check to know which API should be called. > 5. rte_power_empty_poll_stat_fetch(unsigned int lcore_id); > which is used to get specific core empty poll counter. > > 6. rte_power_poll_stat_fetch(unsigned int lcore_id); > which is used to get specific core valid poll counter. > > 7. rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit); > which allow user customize the frequency of power state. > > 8. rte_power_empty_poll_setup_timer(void); > which is used to setup the timer/callback to process all above counter. > The new API should be experimental > ChangeLog: > v2: fix some coding style issues > v3: rename the filename, API name. > v4: updated makefile and symbol list > > Signed-off-by: Liang Ma > Signed-off-by: Radu Nicolau > --- > lib/librte_power/Makefile | 5 +- > lib/librte_power/meson.build | 5 +- > lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++ > lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++ > lib/librte_power/rte_power_version.map | 14 +- > 5 files changed, 742 insertions(+), 5 deletions(-) > create mode 100644 lib/librte_power/rte_power_empty_poll.c > create mode 100644 lib/librte_power/rte_power_empty_poll.h > Is there any in-tree documentation planned? Kevin.