From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id E1BDEA0C41;
	Wed, 23 Jun 2021 11:56:56 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id C54144003F;
	Wed, 23 Jun 2021 11:56:56 +0200 (CEST)
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by mails.dpdk.org (Postfix) with ESMTP id 748064003E
 for <dev@dpdk.org>; Wed, 23 Jun 2021 11:56:55 +0200 (CEST)
IronPort-SDR: /sOJSx0AFXgdpzM5A7U8815Oqg8+VZJsnjqbvOQKpOc/msklB7h3auQpDI//PINI0pH8fuu8zJ
 4BDIl8U0gcaw==
X-IronPort-AV: E=McAfee;i="6200,9189,10023"; a="292858478"
X-IronPort-AV: E=Sophos;i="5.83,293,1616482800"; d="scan'208";a="292858478"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 23 Jun 2021 02:56:54 -0700
IronPort-SDR: 5EyPrKqluThirNW458B0LqR8G5XJO4R8j9PjmDM4kL/ZvKK+oC5vmEsAZxPEIHDnIOeTLOPdxp
 uqNGm8h/iRnw==
X-IronPort-AV: E=Sophos;i="5.83,293,1616482800"; d="scan'208";a="556093053"
Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.203.112])
 ([10.213.203.112])
 by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 23 Jun 2021 02:56:53 -0700
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "dev@dpdk.org" <dev@dpdk.org>, "Hunt, David" <david.hunt@intel.com>,
 Ray Kinsella <mdr@ashroe.eu>, Neil Horman <nhorman@tuxdriver.com>
Cc: "Loftus, Ciara" <ciara.loftus@intel.com>
References: <cover.1622548381.git.anatoly.burakov@intel.com>
 <601f09ad562f97ca1e0077cb36ba46e448df382d.1622548381.git.anatoly.burakov@intel.com>
 <DM6PR11MB44912A2A901697A8153450D19A099@DM6PR11MB4491.namprd11.prod.outlook.com>
 <b4ff0f89-9533-ea38-82e9-0bf4d5f95d40@intel.com>
 <DM6PR11MB44910BA1C16BF15CB55F13029A089@DM6PR11MB4491.namprd11.prod.outlook.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <10918197-9793-0c46-4021-34145092b868@intel.com>
Date: Wed, 23 Jun 2021 10:56:49 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <DM6PR11MB44910BA1C16BF15CB55F13029A089@DM6PR11MB4491.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v1 5/7] power: support callbacks for multiple
 Rx queues
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 23-Jun-21 10:49 AM, Ananyev, Konstantin wrote:
> 
>>
>> On 22-Jun-21 10:41 AM, Ananyev, Konstantin wrote:
>>>
>>>> Currently, there is a hard limitation on the PMD power management
>>>> support that only allows it to support a single queue per lcore. This is
>>>> not ideal as most DPDK use cases will poll multiple queues per core.
>>>>
>>>> The PMD power management mechanism relies on ethdev Rx callbacks, so it
>>>> is very difficult to implement such support because callbacks are
>>>> effectively stateless and have no visibility into what the other ethdev
>>>> devices are doing.  This places limitations on what we can do within the
>>>> framework of Rx callbacks, but the basics of this implementation are as
>>>> follows:
>>>>
>>>> - Replace per-queue structures with per-lcore ones, so that any device
>>>>     polled from the same lcore can share data
>>>> - Any queue that is going to be polled from a specific lcore has to be
>>>>     added to the list of cores to poll, so that the callback is aware of
>>>>     other queues being polled by the same lcore
>>>> - Both the empty poll counter and the actual power saving mechanism is
>>>>     shared between all queues polled on a particular lcore, and is only
>>>>     activated when a special designated "power saving" queue is polled. To
>>>>     put it another way, we have no idea which queue the user will poll in
>>>>     what order, so we rely on them telling us that queue X is the last one
>>>>     in the polling loop, so any power management should happen there.
>>>> - A new API is added to mark a specific Rx queue as "power saving".
>>>
>>> Honestly, I don't understand the logic behind that new function.
>>> I understand that depending on HW we ca monitor either one or multiple queues.
>>> That's ok, but why we now need to mark one queue as a 'very special' one?
>>
>> Because we don't know which of the queues we are supposed to sleep on.
>>
>> Imagine a situation where you have 3 queues. What usually happens is you
>> poll them in a loop, so q0, q1, q2, q0, q1, q2... etc. We only want to
>> enter power-optimized state on polling q2, because otherwise we're
>> risking going into power optimized state while q1 or q2 have traffic.
> 
> That's why before going to sleep we need to make sure that for *all* queues
> we have at least EMPTYPOLL_MAX empty polls.
> Then the order of queue checking wouldn't matter.
> With your example it should be:
> if (q1.empty_polls >  EMPTYPOLL_MAX && q2. empty_polls >  EMPTYPOLL_MAX &&
>       q3.empy_pools >  EMPTYPOLL_MAX)
>          goto_sleep;
> 
> Don't take me wrong, I am not suggesting to make *precisely* that checks
> in the actual code (it could be time consuming if number of checks is big),
> but the logic needs to remain.
> 

The empty poll counter is *per core*, not *per queue*. All the shared 
data is per core. We only increment empty poll counter on last queue, 
but we drop it to 0 on any queue that has received traffic. That way, we 
can avoid checking/incrementing empty poll counters for multiple queues. 
In other words, this is effectively achieving what you're suggesting, 
but without per-queue checks.

Of course, i could make it per-queue like before, but then we just end 
up doing way more checks on every callback and basically need to have 
the same logic anyway, so why bother?

>>
>> Worst case scenario, we enter sleep after polling q0, then traffic
>> arrives at q2, we wake up, and then attempt to go to sleep on q1 instead
>> of skipping it. Essentially, we will be attempting to sleep at every
>> queue, instead of once in a loop. This *might* be OK for multi-monitor
>> because we'll be aborting sleep due to sleep condition check failure,
>> but for modes like rte_pause()/rte_power_pause()-based sleep, we will be
>> entering sleep unconditionally, and will be risking to sleep at q1 while
>> there's traffic at q2.
>>
>> So, we need this mechanism to be activated once every *loop*, not per queue.
>>
>>> Why can't rte_power_ethdev_pmgmt_queue_enable() just:
>>> Check is number of monitored queues exceed HW/SW capabilities,
>>> and if so then just return a failure.
>>> Otherwise add queue to the list and treat them all equally, i.e:
>>> go to power save mode when number of sequential empty polls on
>>> all monitored queues will exceed EMPTYPOLL_MAX threshold?
>>>
>>>>     Failing to call this API will result in no power management, however
>>>>     when having only one queue per core it is obvious which queue is the
>>>>     "power saving" one, so things will still work without this new API for
>>>>     use cases that were previously working without it.
>>>> - The limitation on UMWAIT-based polling is not removed because UMWAIT
>>>>     is incapable of monitoring more than one address.
>>>>
>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>> ---
>>>>    lib/power/rte_power_pmd_mgmt.c | 335 ++++++++++++++++++++++++++-------
>>>>    lib/power/rte_power_pmd_mgmt.h |  34 ++++
>>>>    lib/power/version.map          |   3 +
>>>>    3 files changed, 306 insertions(+), 66 deletions(-)
>>>>
>>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>>>> index 0707c60a4f..60dd21a19c 100644
>>>> --- a/lib/power/rte_power_pmd_mgmt.c
>>>> +++ b/lib/power/rte_power_pmd_mgmt.c
>>>> @@ -33,7 +33,19 @@ enum pmd_mgmt_state {
>>>>         PMD_MGMT_ENABLED
>>>>    };
>>>>
>>>> -struct pmd_queue_cfg {
>>>> +struct queue {
>>>> +     uint16_t portid;
>>>> +     uint16_t qid;
>>>> +};
>>>
>>> Just a thought: if that would help somehow, it can be changed to:
>>> union queue {
>>>           uint32_t raw;
>>>           struct { uint16_t portid, qid;
>>>           };
>>> };
>>>
>>> That way in queue find/cmp functions below you can operate with single raw 32-bt values.
>>> Probably not that important, as all these functions are on slow path, but might look nicer.
>>
>> Sure, that can work. We actually do comparisons with power save queue on
>> fast path, so maybe that'll help.
>>
>>>
>>>> +struct pmd_core_cfg {
>>>> +     struct queue queues[RTE_MAX_ETHPORTS];
>>>
>>> If we'll have ability to monitor multiple queues per lcore, would it be always enough?
>>>   From other side, it is updated on control path only.
>>> Wouldn't normal list with malloc(/rte_malloc) would be more suitable here?
>>
>> You're right, it should be dynamically allocated.
>>
>>>
>>>> +     /**< Which port-queue pairs are associated with this lcore? */
>>>> +     struct queue power_save_queue;
>>>> +     /**< When polling multiple queues, all but this one will be ignored */
>>>> +     bool power_save_queue_set;
>>>> +     /**< When polling multiple queues, power save queue must be set */
>>>> +     size_t n_queues;
>>>> +     /**< How many queues are in the list? */
>>>>         volatile enum pmd_mgmt_state pwr_mgmt_state;
>>>>         /**< State of power management for this queue */
>>>>         enum rte_power_pmd_mgmt_type cb_mode;
>>>> @@ -43,8 +55,97 @@ struct pmd_queue_cfg {
>>>>         uint64_t empty_poll_stats;
>>>>         /**< Number of empty polls */
>>>>    } __rte_cache_aligned;
>>>> +static struct pmd_core_cfg lcore_cfg[RTE_MAX_LCORE];
>>>>
>>>> -static struct pmd_queue_cfg port_cfg[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT];
>>>> +static inline bool
>>>> +queue_equal(const struct queue *l, const struct queue *r)
>>>> +{
>>>> +     return l->portid == r->portid && l->qid == r->qid;
>>>> +}
>>>> +
>>>> +static inline void
>>>> +queue_copy(struct queue *dst, const struct queue *src)
>>>> +{
>>>> +     dst->portid = src->portid;
>>>> +     dst->qid = src->qid;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +queue_is_power_save(const struct pmd_core_cfg *cfg, const struct queue *q) {
>>>
>>> Here and in other places - any reason why standard DPDK coding style is not used?
>>
>> Just accidental :)
>>
>>>
>>>> +     const struct queue *pwrsave = &cfg->power_save_queue;
>>>> +
>>>> +     /* if there's only single queue, no need to check anything */
>>>> +     if (cfg->n_queues == 1)
>>>> +             return true;
>>>> +     return cfg->power_save_queue_set && queue_equal(q, pwrsave);
>>>> +}
>>>> +
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly