From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6E278A0A0C;
	Tue, 29 Jun 2021 15:23:24 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 012D5411C5;
	Tue, 29 Jun 2021 15:23:24 +0200 (CEST)
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
 by mails.dpdk.org (Postfix) with ESMTP id AABE140E01
 for <dev@dpdk.org>; Tue, 29 Jun 2021 15:23:21 +0200 (CEST)
X-IronPort-AV: E=McAfee;i="6200,9189,10029"; a="271999226"
X-IronPort-AV: E=Sophos;i="5.83,308,1616482800"; d="scan'208";a="271999226"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2021 06:23:20 -0700
X-IronPort-AV: E=Sophos;i="5.83,308,1616482800"; d="scan'208";a="625629966"
Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.252.170])
 ([10.213.252.170])
 by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 29 Jun 2021 06:23:19 -0700
To: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "dev@dpdk.org" <dev@dpdk.org>, "Hunt, David" <david.hunt@intel.com>
Cc: "Loftus, Ciara" <ciara.loftus@intel.com>
References: <cover.1624629506.git.anatoly.burakov@intel.com>
 <cover.1624884053.git.anatoly.burakov@intel.com>
 <676eab0e1eb6c63acb170893675daa5a39eac29d.1624884053.git.anatoly.burakov@intel.com>
 <DM6PR11MB4491030DED8F1E21880416C89A039@DM6PR11MB4491.namprd11.prod.outlook.com>
 <77f30ce1-40b0-b427-c0f9-359a350358a1@intel.com>
 <DM6PR11MB44913223DE75F7C0702A656C9A029@DM6PR11MB4491.namprd11.prod.outlook.com>
 <c5013b55-14f5-927f-d4f8-8632ada51f30@intel.com>
 <DM6PR11MB449187EDE097F16B91D7EF5F9A029@DM6PR11MB4491.namprd11.prod.outlook.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <7c40a1b5-18bf-eb40-19ee-13eaa0bdf45e@intel.com>
Date: Tue, 29 Jun 2021 14:23:15 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <DM6PR11MB449187EDE097F16B91D7EF5F9A029@DM6PR11MB4491.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx
 queues
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 29-Jun-21 1:14 PM, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Sent: Tuesday, June 29, 2021 12:40 PM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Hunt, David <david.hunt@intel.com>
>> Cc: Loftus, Ciara <ciara.loftus@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH v3 6/7] power: support monitoring multiple Rx queues
>>
>> On 29-Jun-21 1:07 AM, Ananyev, Konstantin wrote:
>>>
>>>
>>>>>> Use the new multi-monitor intrinsic to allow monitoring multiple ethdev
>>>>>> Rx queues while entering the energy efficient power state. The multi
>>>>>> version will be used unconditionally if supported, and the UMWAIT one
>>>>>> will only be used when multi-monitor is not supported by the hardware.
>>>>>>
>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> ---
>>>>>>     doc/guides/prog_guide/power_man.rst |  9 ++--
>>>>>>     lib/power/rte_power_pmd_mgmt.c      | 76 ++++++++++++++++++++++++++++-
>>>>>>     2 files changed, 80 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
>>>>>> index fac2c19516..3245a5ebed 100644
>>>>>> --- a/doc/guides/prog_guide/power_man.rst
>>>>>> +++ b/doc/guides/prog_guide/power_man.rst
>>>>>> @@ -221,13 +221,16 @@ power saving whenever empty poll count reaches a certain number.
>>>>>>     The "monitor" mode is only supported in the following configurations and scenarios:
>>>>>>
>>>>>>     * If ``rte_cpu_get_intrinsics_support()`` function indicates that
>>>>>> +  ``rte_power_monitor_multi()`` function is supported by the platform, then
>>>>>> +  monitoring multiple Ethernet Rx queues for traffic will be supported.
>>>>>> +
>>>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that only
>>>>>>       ``rte_power_monitor()`` is supported by the platform, then monitoring will be
>>>>>>       limited to a mapping of 1 core 1 queue (thus, each Rx queue will have to be
>>>>>>       monitored from a different lcore).
>>>>>>
>>>>>> -* If ``rte_cpu_get_intrinsics_support()`` function indicates that the
>>>>>> -  ``rte_power_monitor()`` function is not supported, then monitor mode will not
>>>>>> -  be supported.
>>>>>> +* If ``rte_cpu_get_intrinsics_support()`` function indicates that neither of the
>>>>>> +  two monitoring functions are supported, then monitor mode will not be supported.
>>>>>>
>>>>>>     * Not all Ethernet devices support monitoring, even if the underlying
>>>>>>       platform may support the necessary CPU instructions. Please refer to
>>>>>> diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
>>>>>> index 7762cd39b8..aab2d4f1ee 100644
>>>>>> --- a/lib/power/rte_power_pmd_mgmt.c
>>>>>> +++ b/lib/power/rte_power_pmd_mgmt.c
>>>>>> @@ -155,6 +155,24 @@ queue_list_remove(struct pmd_core_cfg *cfg, const union queue *q)
>>>>>>          return 0;
>>>>>>     }
>>>>>>
>>>>>> +static inline int
>>>>>> +get_monitor_addresses(struct pmd_core_cfg *cfg,
>>>>>> +             struct rte_power_monitor_cond *pmc)
>>>>>> +{
>>>>>> +     const struct queue_list_entry *qle;
>>>>>> +     size_t i = 0;
>>>>>> +     int ret;
>>>>>> +
>>>>>> +     TAILQ_FOREACH(qle, &cfg->head, next) {
>>>>>> +             struct rte_power_monitor_cond *cur = &pmc[i];
>>>>>
>>>>> Looks like you never increment 'i' value inside that function.
>>>>> Also it probably will be safer to add 'num' parameter to check that
>>>>> we will never over-run pmc[] boundaries.
>>>>
>>>> Will fix in v4, good catch!
>>>>
>>>>>
>>>>>> +             const union queue *q = &qle->queue;
>>>>>> +             ret = rte_eth_get_monitor_addr(q->portid, q->qid, cur);
>>>>>> +             if (ret < 0)
>>>>>> +                     return ret;
>>>>>> +     }
>>>>>> +     return 0;
>>>>>> +}
>>>>>> +
>>>>>>     static void
>>>>>>     calc_tsc(void)
>>>>>>     {
>>>>>> @@ -183,6 +201,48 @@ calc_tsc(void)
>>>>>>          }
>>>>>>     }
>>>>>>
>>>>>> +static uint16_t
>>>>>> +clb_multiwait(uint16_t port_id, uint16_t qidx,
>>>>>> +             struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>>>>>> +             uint16_t max_pkts __rte_unused, void *addr __rte_unused)
>>>>>> +{
>>>>>> +     const unsigned int lcore = rte_lcore_id();
>>>>>> +     const union queue q = {.portid = port_id, .qid = qidx};
>>>>>> +     const bool empty = nb_rx == 0;
>>>>>> +     struct pmd_core_cfg *q_conf;
>>>>>> +
>>>>>> +     q_conf = &lcore_cfg[lcore];
>>>>>> +
>>>>>> +     /* early exit */
>>>>>> +     if (likely(!empty)) {
>>>>>> +             q_conf->empty_poll_stats = 0;
>>>>>> +     } else {
>>>>>> +             /* do we care about this particular queue? */
>>>>>> +             if (!queue_is_power_save(q_conf, &q))
>>>>>> +                     return nb_rx;
>>>>>
>>>>> I still don't understand the need of 'special' power_save queue here...
>>>>> Why we can't just have a function:
>>>>>
>>>>> get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(struct pmd_core_cfg *lcore_cfg),
>>>>> and then just:
>>>>>
>>>>> /* all queues have at least EMPTYPOLL_MAX sequential empty polls */
>>>>> if (get_number_of_queues_whose_sequential_empty_polls_less_then_threshold(q_conf) == 0) {
>>>>>        /* go into power-save mode here */
>>>>> }
>>>>
>>>> Okay, let's go through this step by step :)
>>>>
>>>> Let's suppose we have three queues - q0, q1 and q2. We want to sleep
>>>> whenever there's no traffic on *all of them*, however we cannot know
>>>> that until we have checked all of them.
>>>>
>>>> So, let's suppose that q0, q1 and q2 were empty all this time, but now
>>>> some traffic arrived at q2 while we're still checking q0. We see that q0
>>>> is empty, and all of the queues were empty for the last N polls, so we
>>>> think we will be safe to sleep at q0 despite the fact that traffic has
>>>> just arrived at q2.
>>>> This is not an issue with MONITOR mode because we will be able to see if
>>>> current Rx ring descriptor is busy or not via the NIC callback, *but
>>>> this is not possible* with PAUSE and SCALE modes, because they don't
>>>> have the sneaky lookahead function of MONITOR! So, with PAUSE and SCALE
>>>> modes, it is possible to end up in a situation where you *think* you
>>>> don't have any traffic, but you actually do, you just haven't checked
>>>> the relevant queue yet.
>>>
>>> I think such situation is unavoidable.
>>> Yes, traffic can arrive to *any* queue at *any* time.
>>> With your example above - user choose q2 as 'special' queue, but
>>> traffic actually arrives on q0 or q1.
>>> And yes, if user choose PAUSE or SCALE methods he *can* miss the traffic,
>>> because as you said for these methods there is no notification mechanisms.
>>> I think there are just unavoidable limitations with these power-save methods.
>>>
>>>> In order to prevent this from happening, we do not sleep on every queue,
>>>> instead we sleep *once* per loop.
>>>
>>> Yes, totally agree we shouldn't sleep on *every* queue.
>>> We need to go to sleep when there is no traffic on *any* of queues we monitor.
>>>
>>>> That is, we check q0, check q1, check
>>>> q2, and only then we decide whether we want to sleep or not.
>>>
>>>> Of course, with such scheme it is still possible to e.g. sleep in q2
>>>> while there's traffic waiting in q0,
>>>
>>> Yes, exactly.
>>>
>>>> but worst case is less bad with
>>>> this scheme, because we'll be doing at worst 1 extra sleep.
>>>
>>> Hmm, I think it would be one extra sleep anyway.
>>>
>>>> Whereas with what you're suggesting, if we had e.g. 10 queues to poll,
>>>> and we checked q1 but traffic has just arrived at q0, we'll be sleeping
>>>> at q1, then we'll be sleeping at q2, then we'll be sleeping at q3, then
>>>> we'll be sleeping at q4, then we'll be sleeping at q5.... and 9 sleeps
>>>> later we finally reach q0 and find out after all this time that we
>>>> shouldn't have slept in the first place.
>>>
>>> Ah ok, I think I understand now what you are saying.
>>> Sure, to avoid such situation, we'll need to maintain extra counters and
>>> update them properly when we go to sleep.
>>> I should state it clearly at the beginning.
>>> It might be easier to explain what I meant by code snippet:
>>>
>>> lcore_conf needs 2 counters:
>>> uint64_t   nb_queues_ready_to_sleep;
>>> uint64_t   nb_sleeps;
>>>
>>> Plus each queue needs 2 counters:
>>> uint64_t nb_empty_polls;
>>> uint64_t nb_sleeps;
>>>
>>> Now, at rx_callback():
>>>
>>> /* check did sleep happen since previous call,
>>>        if yes, then reset queue counters */
>>> if (queue->nb_sleeps != lcore_conf->nb_sleeps) {
>>>       queue->nb_sleeps = lcore_conf->nb_sleeps;
>>>       queue->nb_empty_polls = 0;
>>> }
>>>
>>>    /* packet arrived, reset counters */
>>>    if (nb_rx != 0) {
>>>      /* queue is not 'ready_to_sleep' any more */
>>>      if (queue->nb_empty_polls > EMPTYPOLL_MAX)
>>>          lcore_conf-> nb_queues_ready_to_sleep--;
>>>      queue->nb_empty_polls = 0;
>>>
>>> /* empty poll */
>>> } else {
>>>       /* queue reaches EMPTYPOLL_MAX threshold, mark it as 'ready_to_sleep' */
>>>       if (queue->nb_empty_polls == EMPTYPOLL_MAX)
>>>          lcore_conf-> nb_queues_ready_to_sleep++;
>>>       queue->nb_empty_polls++;
>>> }
>>>
>>>      /* no traffic on any queue for at least EMPTYPOLL_MAX iterations */
>>>      if (lcore_conf-> nb_queues_ready_to_sleep == lcore_conf->n_queues) {
>>>         /* update counters and sleep */
>>>         lcore_conf->nb_sleeps++;
>>>         lcore_conf-> nb_queues_ready_to_sleep = 0;
>>>         goto_sleep();
>>>      }
>>> }
>>>
>> Actually, i don't think this is going to work, because i can see no
>> (easy) way to get from lcore to specific queue. I mean, you could have
>> an O(N) for loop that will loop over the list of queues every time we
>> enter the callback, but i don't think that's such a good idea.
> 
> I think something like that will work:
> 
> struct queue_list_entry {
>          TAILQ_ENTRY(queue_list_entry) next;
>          union queue queue;
> +     /* pointer to the lcore that queue is managed by */
> +      struct pmd_core_cfg *lcore_cfg;
> +     /* queue RX callback */
> +      const struct rte_eth_rxtx_callback *cur_cb;
> };
> 
> At rte_power_ethdev_pmgmt_queue_enable():
> 
> +      struct queue_list_entry *qle;
> ...
> -        ret = queue_list_add(queue_cfg, &qdata);
> +       qle = queue_list_add(queue_cfg, &qdata);
> ...
> -       queue_cfg->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, NULL);
> +      qle->cur_cb = rte_eth_add_rx_callback(port_id, queue_id, clb, qdata);
> 
> At actual clb_xxx(uint16_t port_id, uint16_t qidx,
>                  struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
>                  uint16_t max_pkts __rte_unused, void *addr __rte_unused)
> {
>     ...
>     struct queue_list_entry *qle = addr;
>     struct pmd_core_cfg *lcore_cfg = qle->lcore_conf;
>     ....
> }
> 
> 

Hm, that's actually a clever solution :) Thanks!

-- 
Thanks,
Anatoly