From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 21C85A04B6;
	Mon, 12 Oct 2020 12:35:34 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 71EC01D66C;
	Mon, 12 Oct 2020 12:35:29 +0200 (CEST)
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by dpdk.org (Postfix) with ESMTP id DA85E1D666
 for <dev@dpdk.org>; Mon, 12 Oct 2020 12:35:27 +0200 (CEST)
IronPort-SDR: ZDkAz9/JaY5IC/hjoGeF2dE4UCvHKFWI5pZ9do8ywOt+tlAxzVd7rBhgHy3P8d0k+dAdS7XULD
 pC6jFSIbi8XQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9771"; a="145584673"
X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="145584673"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Oct 2020 03:35:25 -0700
IronPort-SDR: ilTsQQUqdmUmyj9E7rjRTPXaCgeG+wnF7Wf/8ZB5Mv2QaGOOfZ7OoySU5wJC9+X1ozU3FMpydc
 afvCOpE3EL9A==
X-IronPort-AV: E=Sophos;i="5.77,366,1596524400"; d="scan'208";a="344830561"
Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.195.67])
 ([10.213.195.67])
 by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Oct 2020 03:35:24 -0700
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "Ma, Liang J" <liang.j.ma@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "Hunt, David" <david.hunt@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>
References: <1599214740-3927-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-1-git-send-email-liang.j.ma@intel.com>
 <1601647919-25312-2-git-send-email-liang.j.ma@intel.com>
 <BYAPR11MB33014A47B298FA6F574E26759A0B0@BYAPR11MB3301.namprd11.prod.outlook.com>
 <665bcb31-dcf0-553b-bae1-054e5f50e77f@intel.com>
 <BYAPR11MB33014547C6DB0D2EB33524F59A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <3609c5b3-f431-3954-6350-cb2de77b72a7@intel.com>
 <BYAPR11MB330130192DA4C05EF4F6352B9A080@BYAPR11MB3301.namprd11.prod.outlook.com>
 <7e3bedc9-db3d-262e-c0ab-62b53d60fc7c@intel.com>
 <BYAPR11MB3301B668B9B4A99BC13BE2D79A090@BYAPR11MB3301.namprd11.prod.outlook.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <a5a4612b-26cb-18de-bb77-339e7f0954ab@intel.com>
Date: Mon, 12 Oct 2020 11:35:21 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
 Thunderbird/68.12.0
MIME-Version: 1.0
In-Reply-To: <BYAPR11MB3301B668B9B4A99BC13BE2D79A090@BYAPR11MB3301.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v4 02/10] eal: add power management intrinsics
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 10-Oct-20 2:19 PM, Ananyev, Konstantin wrote:
> 
> 
>>>>>>>> Add two new power management intrinsics, and provide an implementation
>>>>>>>> in eal/x86 based on UMONITOR/UMWAIT instructions. The instructions
>>>>>>>> are implemented as raw byte opcodes because there is not yet widespread
>>>>>>>> compiler support for these instructions.
>>>>>>>>
>>>>>>>> The power management instructions provide an architecture-specific
>>>>>>>> function to either wait until a specified TSC timestamp is reached, or
>>>>>>>> optionally wait until either a TSC timestamp is reached or a memory
>>>>>>>> location is written to. The monitor function also provides an optional
>>>>>>>> comparison, to avoid sleeping when the expected write has already
>>>>>>>> happened, and no more writes are expected.
>>>>>>>
>>>>>>> I think what this API is missing - a function to wakeup sleeping core.
>>>>>>> If user can/should use some system call to achieve that, then at least
>>>>>>> it has to be clearly documented, even better some wrapper provided.
>>>>>>
>>>>>> I don't think it's possible to do that without severely overcomplicating
>>>>>> the intrinsic and its usage, because AFAIK the only way to wake up a
>>>>>> sleeping core would be to send some kind of interrupt to the core, or
>>>>>> trigger a write to the cache-line in question.
>>>>>>
>>>>>
>>>>> Yes, I think we either need a syscall that would do an IPI for us
>>>>> (on top of my head - membarrier() does that, might be there are some other syscalls too),
>>>>> or something hand-made. For hand-made, I wonder would something like that
>>>>> be safe and sufficient:
>>>>> uint64_t val = atomic_load(addr);
>>>>> CAS(addr, val, &val);
>>>>> ?
>>>>> Anyway, one way or another - I think ability to wakeup core we put to sleep
>>>>> have to be an essential part of this feature.
>>>>> As I understand linux kernel will limit max amount of sleep time for these instructions:
>>>>> https://lwn.net/Articles/790920/
>>>>> But relying just on that, seems too vague for me:
>>>>> - user can adjust that value
>>>>> - wouldn't apply to older kernels and non-linux cases
>>>>> Konstantin
>>>>>
>>>>
>>>> This implies knowing the value the core is sleeping on.
>>>
>>> You don't the value to wait for, you just need an address.
>>> And you can make wakeup function to accept address as a parameter,
>>> same as monitor() does.
>>
>> Sorry, i meant the address. We don't know the address we're sleeping on.
>>
>>>
>>>> That's not
>>>> always the case - with this particular PMD power management scheme, we
>>>> get the address from the PMD and it stays inside the callback.
>>>
>>> That's fine - you can store address inside you callback metadata
>>> and do wakeup as part of _disable_ function.
>>>
>>
>> The address may be different, and by the time we access the address it
>> may become stale, so i don't see how that would help unless you're
>> suggesting to have some kind of synchronization mechanism there.
> 
> Yes, we'll need something to sync here for sure.
> Sorry, I should say it straightway, to avoid further misunderstanding.
> Let say, associate a spin_lock with monitor(), by analogy with pthread_cond_wait().
> Konstantin
> 

The idea was to provide an intrinsic-like function - as in, raw 
instruction call, without anything extra. We even added the masks/values 
etc. only because there's no race-less way to combine UMONITOR/UMWAIT 
without those.

Perhaps we can provide a synchronize-able wrapper around it to avoid 
adding overhead to calls that function but doesn't need the sync mechanism?

-- 
Thanks,
Anatoly