From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id DF72AA04B6; Mon, 12 Oct 2020 15:13:16 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id B67911BAA8; Mon, 12 Oct 2020 15:13:15 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 8E4551BAA8 for ; Mon, 12 Oct 2020 15:13:12 +0200 (CEST) IronPort-SDR: zOhXUHgVM2ZPTGiqyB2OUhvMrB0oRSRburTXWzJE9S2qqPGN+8BgnidstTVGQaoGIuFNL0Glh0 O4MpZ3lRmmtQ== X-IronPort-AV: E=McAfee;i="6000,8403,9771"; a="162270990" X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="162270990" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Oct 2020 06:13:11 -0700 IronPort-SDR: tIdFqUBUKDI3gvoLrZuChwy0gH/LzKa6N35NIvc1qbSS7lcMZ3420Y9/AxS7Dai3erPVP9A85B 0/pqlH9GLhxQ== X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="344882418" Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.195.67]) ([10.213.195.67]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Oct 2020 06:13:09 -0700 To: "Ananyev, Konstantin" , "Ma, Liang J" , "dev@dpdk.org" Cc: "Hunt, David" , "stephen@networkplumber.org" References: <1599214740-3927-1-git-send-email-liang.j.ma@intel.com> <1601647919-25312-1-git-send-email-liang.j.ma@intel.com> <1601647919-25312-2-git-send-email-liang.j.ma@intel.com> <665bcb31-dcf0-553b-bae1-054e5f50e77f@intel.com> <3609c5b3-f431-3954-6350-cb2de77b72a7@intel.com> <7e3bedc9-db3d-262e-c0ab-62b53d60fc7c@intel.com> From: "Burakov, Anatoly" Message-ID: Date: Mon, 12 Oct 2020 14:13:07 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 12-Oct-20 1:50 PM, Ananyev, Konstantin wrote: > >>>> >>>>>>>>>>> Add two new power management intrinsics, and provide an >>>>>>>>>>> implementation >>>>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions >>>>>>>>>>> are implemented as raw byte opcodes because there is not yet >>>>>>>>>>> widespread >>>>>>>>>>> compiler support for these instructions. >>>>>>>>>>> >>>>>>>>>>> The power management instructions provide an architecture-specific >>>>>>>>>>> function to either wait until a specified TSC timestamp is >>>>>>>>>>> reached, or >>>>>>>>>>> optionally wait until either a TSC timestamp is reached or a >>>>>>>>>>> memory >>>>>>>>>>> location is written to. The monitor function also provides an >>>>>>>>>>> optional >>>>>>>>>>> comparison, to avoid sleeping when the expected write has already >>>>>>>>>>> happened, and no more writes are expected. >>>>>>>>>> >>>>>>>>>> I think what this API is missing - a function to wakeup sleeping >>>>>>>>>> core. >>>>>>>>>> If user can/should use some system call to achieve that, then at >>>>>>>>>> least >>>>>>>>>> it has to be clearly documented, even better some wrapper provided. >>>>>>>>> >>>>>>>>> I don't think it's possible to do that without severely >>>>>>>>> overcomplicating >>>>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a >>>>>>>>> sleeping core would be to send some kind of interrupt to the >>>>>>>>> core, or >>>>>>>>> trigger a write to the cache-line in question. >>>>>>>>> >>>>>>>> >>>>>>>> Yes, I think we either need a syscall that would do an IPI for us >>>>>>>> (on top of my head - membarrier() does that, might be there are >>>>>>>> some other syscalls too), >>>>>>>> or something hand-made. For hand-made, I wonder would something >>>>>>>> like that >>>>>>>> be safe and sufficient: >>>>>>>> uint64_t val = atomic_load(addr); >>>>>>>> CAS(addr, val, &val); >>>>>>>> ? >>>>>>>> Anyway, one way or another - I think ability to wakeup core we put >>>>>>>> to sleep >>>>>>>> have to be an essential part of this feature. >>>>>>>> As I understand linux kernel will limit max amount of sleep time >>>>>>>> for these instructions: >>>>>>>> https://lwn.net/Articles/790920/ >>>>>>>> But relying just on that, seems too vague for me: >>>>>>>> - user can adjust that value >>>>>>>> - wouldn't apply to older kernels and non-linux cases >>>>>>>> Konstantin >>>>>>>> >>>>>>> >>>>>>> This implies knowing the value the core is sleeping on. >>>>>> >>>>>> You don't the value to wait for, you just need an address. >>>>>> And you can make wakeup function to accept address as a parameter, >>>>>> same as monitor() does. >>>>> >>>>> Sorry, i meant the address. We don't know the address we're sleeping on. >>>>> >>>>>> >>>>>>> That's not >>>>>>> always the case - with this particular PMD power management scheme, we >>>>>>> get the address from the PMD and it stays inside the callback. >>>>>> >>>>>> That's fine - you can store address inside you callback metadata >>>>>> and do wakeup as part of _disable_ function. >>>>>> >>>>> >>>>> The address may be different, and by the time we access the address it >>>>> may become stale, so i don't see how that would help unless you're >>>>> suggesting to have some kind of synchronization mechanism there. >>>> >>>> Yes, we'll need something to sync here for sure. >>>> Sorry, I should say it straightway, to avoid further misunderstanding. >>>> Let say, associate a spin_lock with monitor(), by analogy with >>>> pthread_cond_wait(). >>>> Konstantin >>>> >>> >>> The idea was to provide an intrinsic-like function - as in, raw >>> instruction call, without anything extra. We even added the masks/values >>> etc. only because there's no race-less way to combine UMONITOR/UMWAIT >>> without those. >>> >>> Perhaps we can provide a synchronize-able wrapper around it to avoid >>> adding overhead to calls that function but doesn't need the sync mechanism? > > Yes, might be two flavours, something like > rte_power_monitor() and rte_power_monitor_sync() > or whatever would be a better name. > >>> >> >> Also, how would having a spinlock help to synchronize? Are you >> suggesting we do UMWAIT on a spinlock address, or something to that effect? >> > > I thought about something very similar to cond_wait() working model: > > /* > * Caller has to obtain lock before calling that function. > */ > static inline int rte_power_monitor_sync(const volatile void *p, > const uint64_t expected_value, const uint64_t value_mask, > const uint32_t state, const uint64_t tsc_timestamp, rte_spinlock_t *lck) > { > /* do whatever preparations are needed */ > .... > umonitor(p); > > if (value_mask != 0 && *((const uint64_t *)p) & value_mask == expected_value) { > return 0; > } > > /* release lock and go to sleep */ > rte_spinlock_unlock(lck); > rflags = umwait(); > > /* grab lock back after wakeup */ > rte_spinlock_lock(lck); > > /* do rest of processing */ > .... > } > > /* similar go cond_signal */ > static inline void rte_power_monitor_wakeup(volatile void *p) > { > uint64_t v; > > v = __atomic_load_n(p, __ATOMIC_RELAXED); > __atomic_compare_exchange_n(p, v, &v, 1, __ATOMIC_RELAXED, __ATOMIC_RELAXED); > } > > > Now in librte_power: > > struct pmd_queue_cfg { > /* to protect state and wait_addr */ > rte_spinlock_t lck; > enum pmd_mgmt_state pwr_mgmt_state; > void *wait_addr; > /* rest fields */ > .... > } __rte_cache_aligned; > > > static uint16_t > rte_power_mgmt_umwait(uint16_t port_id, uint16_t qidx, > struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx, > uint16_t max_pkts __rte_unused, void *_ __rte_unused) > { > > struct pmd_queue_cfg *q_conf; > q_conf = &port_cfg[port_id].queue_cfg[qidx]; > > if (unlikely(nb_rx == 0)) { > q_conf->empty_poll_stats++; > if (unlikely(q_conf->empty_poll_stats > EMPTYPOLL_MAX)) { > volatile void *target_addr; > uint64_t expected, mask; > uint16_t ret; > > /* grab the lock and check the state */ > rte_spinlock_lock(&q_conf->lck); > If (q-conf->state == ENABLED) { > ret = rte_eth_get_wake_addr(port_id, qidx, > &target_addr, &expected, &mask); > If (ret == 0) { > q_conf->wait_addr = target_addr; > rte_power_monitor(target_addr, ..., &q_conf->lck); > } > /* reset the wait_addr */ > q_conf->wait_addr = NULL; > } > rte_spinlock_unlock(&q_conf->lck); > .... > } > > nt > rte_power_pmd_mgmt_queue_disable(unsigned int lcore_id, > uint16_t port_id, > uint16_t queue_id) > { > ... > /* grab the lock and change the state */ > rte_spinlock_lock(&q_conf->lck); > queue_cfg->state = DISABLED; > > /* wakeup if necessary */ > If (queue_cfg->wakeup_addr != NULL) > rte_power_monitor_wakeup(queue_cfg->wakeup_addr); > > rte_spinlock_unlock(&q_conf->lck); > ... > } > Yeah, seems that i understood you correctly the first time then. I'm not completely convinced that this overhead and complexity is worth the trouble, to be honest. I mean, it's not like we're going to sleep indefinitely, this isn't like pthread wait - the biggest sleep time i've seen was around half a second and i'm not sure there is a use case for enabling/disabling this functionality willy nilly ever 5 seconds. -- Thanks, Anatoly