From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DF72AA04B6;
	Mon, 12 Oct 2020 15:13:16 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id B67911BAA8;
	Mon, 12 Oct 2020 15:13:15 +0200 (CEST)
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id 8E4551BAA8
 for <dev@dpdk.org>; Mon, 12 Oct 2020 15:13:12 +0200 (CEST)
IronPort-SDR: zOhXUHgVM2ZPTGiqyB2OUhvMrB0oRSRburTXWzJE9S2qqPGN+8BgnidstTVGQaoGIuFNL0Glh0
 O4MpZ3lRmmtQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9771"; a="162270990"
X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="162270990"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Oct 2020 06:13:11 -0700
IronPort-SDR: tIdFqUBUKDI3gvoLrZuChwy0gH/LzKa6N35NIvc1qbSS7lcMZ3420Y9/AxS7Dai3erPVP9A85B
 0/pqlH9GLhxQ==
X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="344882418"
Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.195.67])
 ([10.213.195.67])
 by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Oct 2020 06:13:09 -0700
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "Ma, Liang J" <liang.j.ma@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "Hunt, David" <david.hunt@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>
References: <1599214740-3927-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-2-git-send-email-liang.j.ma@intel.com>
 <BYAPR11MB33014A47B298FA6F574E26759A0B0@BYAPR11MB3301.namprd11.prod.outlook.com>
 <665bcb31-dcf0-553b-bae1-054e5f50e77f@intel.com>
 <BYAPR11MB33014547C6DB0D2EB33524F59A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <3609c5b3-f431-3954-6350-cb2de77b72a7@intel.com>
 <BYAPR11MB330130192DA4C05EF4F6352B9A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <7e3bedc9-db3d-262e-c0ab-62b53d60fc7c@intel.com>
 <BYAPR11MB3301B668B9B4A99BC13BE2D79A090@BYAPR11MB3301.namprd11.prod.outlook.com>
 <a5a4612b-26cb-18de-bb77-339e7f0954ab@intel.com>
 <c5f56104-4947-812c-6390-c34c62ac3c9c@intel.com>
 <BYAPR11MB330186B71AA7EBC3C6214BC39A070@BYAPR11MB3301.namprd11.prod.outlook.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <b95be7af-db3f-1a33-15a0-2330b559fc1c@intel.com>
Date: Mon, 12 Oct 2020 14:13:07 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
 Thunderbird/68.12.0
MIME-Version: 1.0
In-Reply-To: <BYAPR11MB330186B71AA7EBC3C6214BC39A070@BYAPR11MB3301.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote:
> 
>>>>
>>>>>>>>>>> Add two new power management intrinsics, and provide an
>>>>>>>>>>> implementation
>>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>>>>> are implemented as raw byte opcodes because there is not yet
>>>>>>>>>>> widespread
>>>>>>>>>>> compiler support for these instructions.
>>>>>>>>>>>
>>>>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>>>>> function to either wait until a specified TSC timestamp is
>>>>>>>>>>> reached, or
>>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a
>>>>>>>>>>> memory
>>>>>>>>>>> location is written to. The monitor function also provides an
>>>>>>>>>>> optional
>>>>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>>>>> happened, and no more writes are expected.
>>>>>>>>>>
>>>>>>>>>> I think what this API is missing - a function to wakeup sleeping
>>>>>>>>>> core.
>>>>>>>>>> If user can/should use some system call to achieve that, then at
>>>>>>>>>> least
>>>>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>>>>
>>>>>>>>> I don't think it's possible to do that without severely
>>>>>>>>> overcomplicating
>>>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>>>>> sleeping core would be to send some kind of interrupt to the
>>>>>>>>> core, or
>>>>>>>>> trigger a write to the cache-line in question.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>>>>> (on top of my head - membarrier() does that, might be there are
>>>>>>>> some other syscalls too),
>>>>>>>> or something hand-made. For hand-made, I wonder would something
>>>>>>>> like that
>>>>>>>> be safe and sufficient:
>>>>>>>> uint64_t val = atomic_load(addr);
>>>>>>>> CAS(addr, val, &val);
>>>>>>>> ?
>>>>>>>> Anyway, one way or another - I think ability to wakeup core we put
>>>>>>>> to sleep
>>>>>>>> have to be an essential part of this feature.
>>>>>>>> As I understand linux kernel will limit max amount of sleep time
>>>>>>>> for these instructions:
>>>>>>>> https://lwn.net/Articles/790920/
>>>>>>>> But relying just on that, seems too vague for me:
>>>>>>>> - user can adjust that value
>>>>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>>>>> Konstantin
>>>>>>>>
>>>>>>>
>>>>>>> This implies knowing the value the core is sleeping on.
>>>>>>
>>>>>> You don't the value to wait for, you just need an address.
>>>>>> And you can make wakeup function to accept address as a parameter,
>>>>>> same as monitor() does.
>>>>>
>>>>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>>>>
>>>>>>
>>>>>>> That's not
>>>>>>> always the case - with this particular PMD power management scheme, we
>>>>>>> get the address from the PMD and it stays inside the callback.
>>>>>>
>>>>>> That's fine - you can store address inside you callback metadata
>>>>>> and do wakeup as part of _disable_ function.
>>>>>>
>>>>>
>>>>> The address may be different, and by the time we access the address it
>>>>> may become stale, so i don't see how that would help unless you're
>>>>> suggesting to have some kind of synchronization mechanism there.
>>>>
>>>> Yes, we'll need something to sync here for sure.
>>>> Sorry, I should say it straightway, to avoid further misunderstanding.
>>>> Let say, associate a spin_lock with monitor(), by analogy with
>>>> pthread_cond_wait().
>>>> Konstantin
>>>>
>>>
>>> The idea was to provide an intrinsic-like function - as in, raw
>>> instruction call, without anything extra. We even added the masks/values
>>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT
>>> without those.
>>>
>>> Perhaps we can provide a synchronize-able wrapper around it to avoid
>>> adding overhead to calls that function but doesn't need the sync mechanism?
> 
> Yes, might be two flavours, something like
> rte_power_monitor() and rte_power_monitor_sync()
> or whatever would be a better name.
> 
>>>
>>
>> Also, how would having a spinlock help to synchronize? Are you
>> suggesting we do UMWAIT on a spinlock address, or something to that effect?
>>
> 
> I thought about something very similar to cond_wait() working model:
> 
> /*
>   * Caller has to obtain lock before calling that function.
>   */
> static inline int rte_power_monitor_sync(const volatile void *p,
>                  const uint64_t expected_value, const uint64_t value_mask,
>                  const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck)
> {
> /* do whatever preparations are needed */
>                 ....
> umonitor(p);
> 
> if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) {
> return 0;
>   }
> 
> /* release lock and go to sleep */
> rte_spinlock_unlock(lck);
> rflags = umwait();
> 
> /* grab lock back after wakeup */
> rte_spinlock_lock(lck);
> 
> /* do rest of processing */
> ....
> }
> 
> /* similar go cond_signal */
> static inline void rte_power_monitor_wakeup(volatile void *p)
> {
> uint64_t v;
> 
> v = __atomic_load_n(p, __ATOMIC_RELAXED);
> __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
> }
> 
> 
> Now in librte_power:
> 
> struct pmd_queue_cfg {
>         /* to protect state and wait_addr */
>         rte_spinlock_t lck;
>         enum pmd_mgmt_state pwr_mgmt_state;
>         void *wait_addr;
>         /* rest fields */
>        ....
> } __rte_cache_aligned;
> 
> 
> static uint16_t
> rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *_  __rte_unused)
> {
> 
>          struct pmd_queue_cfg *q_conf;
>          q_conf = &port_cfg[port_id].queue_cfg[qidx];
> 
>          if (unlikely(nb_rx == 0)) {
>                  q_conf->empty_poll_stats++;
>                  if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) {
>                          volatile void *target_addr;
>                          uint64_t expected, mask;
>                          uint16_t ret;
> 
>           /* grab the lock and check the state */
>                         rte_spinlock_lock(&q_conf->lck);
>           If (q-conf->state == ENABLED) {
>                          ret = rte_eth_get_wake_addr(port_id, qidx,
>                                                      &target_addr, &expected, &mask);
>            If (ret == 0) {
> q_conf->wait_addr = target_addr;
> rte_power_monitor(target_addr, ..., &q_conf->lck);
>           }
>            /* reset the wait_addr */
>            q_conf->wait_addr = NULL;
>           }
>           rte_spinlock_unlock(&q_conf->lck);
>           ....
> }
> 
> nt
> rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id,
>                                  uint16_t port_id,
>                                  uint16_t queue_id)
> {
> ...
> /* grab the lock and change the state */
>                 rte_spinlock_lock(&q_conf->lck);
> queue_cfg->state = DISABLED;
> 
> /* wakeup if necessary */
> If (queue_cfg->wakeup_addr != NULL)
> rte_power_monitor_wakeup(queue_cfg->wakeup_addr);
> 
> rte_spinlock_unlock(&q_conf->lck);
> ...
> }
> 

Yeah, seems that i understood you correctly the first time then. I'm not 
completely convinced that this overhead and complexity is worth the 
trouble, to be honest. I mean, it's not like we're going to sleep 
indefinitely, this isn't like pthread wait - the biggest sleep time i've 
seen was around half a second and i'm not sure there is a use case for 
enabling/disabling this functionality willy nilly ever 5 seconds.

-- 
Thanks,
Anatoly