From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 20653A04B6;
	Tue, 13 Oct 2020 11:45:17 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 5949E1DA50;
	Tue, 13 Oct 2020 11:45:15 +0200 (CEST)
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 251001DA3D
 for <dev@dpdk.org>; Tue, 13 Oct 2020 11:45:11 +0200 (CEST)
IronPort-SDR: cuffO5QOVxx645gROCRrKbZ3jEm6FoObLGDZGkuc57/++VXdDkQgc0xWeYSFs2VI0cPWhHa3Vo
 xld2wXof2g7Q==
X-IronPort-AV: E=McAfee;i="6000,8403,9772"; a="165943183"
X-IronPort-AV: E=Sophos;i="5.77,369,1596524400"; d="scan'208";a="165943183"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 Oct 2020 02:45:10 -0700
IronPort-SDR: GWoa/eFuZbSNqPV9WnzX7mW8DdR65gUwD+iMf3e33teq8o1CiQxITdq2DYEhiQS0BEllO0eW+2
 ic3GYcama3OA==
X-IronPort-AV: E=Sophos;i="5.77,369,1596524400"; d="scan'208";a="530329906"
Received: from sritenbe-mobl1.ger.corp.intel.com (HELO [10.251.176.132])
 ([10.251.176.132])
 by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 13 Oct 2020 02:45:08 -0700
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "Ma, Liang J" <liang.j.ma@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "Hunt, David" <david.hunt@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>
References: <1599214740-3927-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-2-git-send-email-liang.j.ma@intel.com>
 <BYAPR11MB33014A47B298FA6F574E26759A0B0@BYAPR11MB3301.namprd11.prod.outlook.com>
 <665bcb31-dcf0-553b-bae1-054e5f50e77f@intel.com>
 <BYAPR11MB33014547C6DB0D2EB33524F59A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <3609c5b3-f431-3954-6350-cb2de77b72a7@intel.com>
 <BYAPR11MB330130192DA4C05EF4F6352B9A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <7e3bedc9-db3d-262e-c0ab-62b53d60fc7c@intel.com>
 <BYAPR11MB3301B668B9B4A99BC13BE2D79A090@BYAPR11MB3301.namprd11.prod.outlook.com>
 <a5a4612b-26cb-18de-bb77-339e7f0954ab@intel.com>
 <c5f56104-4947-812c-6390-c34c62ac3c9c@intel.com>
 <BYAPR11MB330186B71AA7EBC3C6214BC39A070@BYAPR11MB3301.namprd11.prod.outlook.com>
 <b95be7af-db3f-1a33-15a0-2330b559fc1c@intel.com>
Message-ID: <679f2f31-17b0-faf1-ac2c-cea541a4258b@intel.com>
Date: Tue, 13 Oct 2020 10:45:05 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
 Thunderbird/68.12.0
MIME-Version: 1.0
In-Reply-To: <b95be7af-db3f-1a33-15a0-2330b559fc1c@intel.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 12-Oct-20 2:13 PM, Burakov, Anatoly wrote:
> On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
>>
>>>>>
>>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The 
>>>>>>>>>>>> instructions
>>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>>> widespread
>>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>>
>>>>>>>>>>>> The power management instructions provide an 
>>>>>>>>>>>> architecture-specific
>>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>>> reached, or
>>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>>> memory
>>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>>> optional
>>>>>>>>>>>> comparison, to avoid sleeping when the expected write has 
>>>>>>>>>>>> already
>>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>>
>>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>>> core.
>>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>>> least
>>>>>>>>>>> it has to be clearly documented, even better some wrapper 
>>>>>>>>>>> provided.
>>>>>>>>>>
>>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>>> overcomplicating
>>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to 
>>>>>>>>>> wake up a
>>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>>> core, or
>>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>>> some other syscalls too),
>>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>>> like that
>>>>>>>>> be safe and sufficient:
>>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>>> CAS(addr, val, &val);
>>>>>>>>> ?
>>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>>> to sleep
>>>>>>>>> have to be an essential part of this feature.
>>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>>> for these instructions:
>>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>>> - user can adjust that value
>>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>>> Konstantin
>>>>>>>>>
>>>>>>>>
>>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>>
>>>>>>> You don't the value to wait for, you just need an address.
>>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>>> same as monitor() does.
>>>>>>
>>>>>> Sorry, i meant the address. We don't know the address we're 
>>>>>> sleeping on.
>>>>>>
>>>>>>>
>>>>>>>> That's not
>>>>>>>> always the case - with this particular PMD power management 
>>>>>>>> scheme, we
>>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>>
>>>>>>> That's fine - you can store address inside you callback metadata
>>>>>>> and do wakeup as part of _disable_ function.
>>>>>>>
>>>>>>
>>>>>> The address may be different, and by the time we access the 
>>>>>> address it
>>>>>> may become stale, so i don't see how that would help unless you're
>>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>>
>>>>> Yes, we'll need something to sync here for sure.
>>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>>> pthread_cond_wait().
>>>>> Konstantin
>>>>>
>>>>
>>>> The idea was to provide an intrinsic-like function - as in, raw
>>>> instruction call, without anything extra. We even added the 
>>>> masks/values
>>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>>> without those.
>>>>
>>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>>> adding overhead to calls that function but doesn't need the sync 
>>>> mechanism?
>>
>> Yes, might be two flavours, something like
>> rte_power_monitor() and rte_power_monitor_sync()
>> or whatever would be a better name.
>>
>>>>
>>>
>>> Also, how would having a spinlock help to synchronize? Are you
>>> suggesting we do UMWAIT on a spinlock address, or something to that 
>>> effect?
>>>
>>
>> I thought about something very similar to cond_wait() working model:
>>
>> /*
>>   * Caller has to obtain lock before calling that function.
>>   */
>> static inline int rte_power_monitor_sync(const volatile void *p,
>>                  const uint64_t expected_value, const uint64_t 
>> value_mask,
>>                  const uint32_t state, const uint64_t tsc_timestamp, 
>> rte_spinlock_t *lck)
>> {
>> /* do whatever preparations are needed */
>>                 ....
>> umonitor(p);
>>
>> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == 
>> expected_value) {
>> return 0;
>>   }
>>
>> /* release lock and go to sleep */
>> rte_spinlock_unlock(lck);
>> rflags = umwait();
>>
>> /* grab lock back after wakeup */
>> rte_spinlock_lock(lck);
>>
>> /* do rest of processing */
>> ....
>> }
>>
>> /* similar go cond_signal */
>> static inline void rte_power_monitor_wakeup(volatile void *p)
>> {
>> uint64_t v;
>>
>> v = __atomic_load_n(p, __ATOMIC_RELAXED);
>> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, 
>> __ATOMIC_RELAXED);
>> }
>>
>>
>> Now in librte_power:
>>
>> struct pmd_queue_cfg {
>>         /* to protect state and wait_addr */
>>         rte_spinlock_t lck;
>>         enum pmd_mgmt_state pwr_mgmt_state;
>>         void *wait_addr;
>>         /* rest fields */
>>        ....
>> } __rte_cache_aligned;
>>
>>
>> static uint16_t
>> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
>> {
>>
>>          struct pmd_queue_cfg *q_conf;
>>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
>>
>>          if (unlikely(nb_rx == 0)) {
>>                  q_conf->empty_poll_stats++;
>>                  if (unlikely(q_conf->empty_poll_stats > 
>> EMPTYPOLL_MAX)) {
>>                          volatile void *target_addr;
>>                          uint64_t expected, mask;
>>                          uint16_t ret;
>>
>>           /* grab the lock and check the state */
>>                         rte_spinlock_lock(&q_conf->lck);
>>           If (q-conf->state == ENABLED) {
>>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>>                                                      &target_addr, 
>> &expected, &mask);
>>            If (ret == 0) {
>> q_conf->wait_addr = target_addr;
>> rte_power_monitor(target_addr, ..., &q_conf->lck);
>>           }
>>            /* reset the wait_addr */
>>            q_conf->wait_addr = NULL;
>>           }
>>           rte_spinlock_unlock(&q_conf->lck);
>>           ....
>> }
>>
>> nt
>> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>>                                  uint16_t port_id,
>>                                  uint16_t queue_id)
>> {
>> ...
>> /* grab the lock and change the state */
>>                 rte_spinlock_lock(&q_conf->lck);
>> queue_cfg->state = DISABLED;
>>
>> /* wakeup if necessary */
>> If (queue_cfg->wakeup_addr != NULL)
>> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
>>
>> rte_spinlock_unlock(&q_conf->lck);
>> ...
>> }
>>
> 
> Yeah, seems that i understood you correctly the first time then. I'm not 
> completely convinced that this overhead and complexity is worth the 
> trouble, to be honest. I mean, it's not like we're going to sleep 
> indefinitely, this isn't like pthread wait - the biggest sleep time i've 
> seen was around half a second and i'm not sure there is a use case for 
> enabling/disabling this functionality willy nilly ever 5 seconds.
> 

Back story: we've had a little internal chat and basically agreed to 
Konstantin's proposal, with slight modifications. That is, we need to be 
able to wake up the core because otherwise we have no deterministic way 
of stopping the sleeping RX path, however there is no need to expose 
this mechanism in a public API and it can be kept inside the power 
library instead.

-- 
Thanks,
Anatoly