From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sagigrim@gmail.com>
Received: from mail-wr0-f196.google.com (mail-wr0-f196.google.com
 [209.85.128.196]) by dpdk.org (Postfix) with ESMTP id 5A3E97D9A
 for <dev@dpdk.org>; Thu, 27 Jul 2017 13:12:54 +0200 (CEST)
Received: by mail-wr0-f196.google.com with SMTP id y43so24094517wrd.0
 for <dev@dpdk.org>; Thu, 27 Jul 2017 04:12:54 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:cc:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=sprPTkpWhm0QiJQx/VguNlvhDo+kZnXsrl0tKX3HkrE=;
 b=ZhUnq0Bz3oIzYjTFEMBpt77fHGv8bJLdPaye41JeChwn+0PlvZKob2Binhhzs8vxbl
 DJlrm9KovmNVF5AIA+e7tZPaUOxVOkjb2+XU43b2IN9p2Lvf3F0UUobNY1RYoyyh6uVN
 Y2W6I9zrwLejj5MJ8Ko7IgWY6A+lYFf122VtoXbt84GlzNUch99TB3n4+ST4hc5Iku7a
 nsABLg0Db1Zw2uNpXIc081osQjuZ63Z0ZipcLak9vpX5DuKDo03jUPusbQseq4W2pRNj
 SuOQUnbnHj7FDUMrs3TudoWZLGaiA1oyUK3QtkXSXivkCnpwTUa1PMFN8ePnlDciQ0H8
 vdjg==
X-Gm-Message-State: AIVw1136ItirFZDszYgv1hLS0Wf9OKE9Kk8zb6cwKFMylOk/1Wj+QNk8
 BzLkIqJ4nsra7mHhP+w=
X-Received: by 10.223.170.73 with SMTP id q9mr1390173wrd.61.1501153973620;
 Thu, 27 Jul 2017 04:12:53 -0700 (PDT)
Received: from [192.168.64.116] (bzq-82-81-101-184.red.bezeqint.net.
 [82.81.101.184])
 by smtp.gmail.com with ESMTPSA id 199sm2026889wmu.0.2017.07.27.04.12.48
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 27 Jul 2017 04:12:53 -0700 (PDT)
To: Yongseok Koh <yskoh@mellanox.com>
Cc: adrien.mazarguil@6wind.com, nelio.laranjeiro@6wind.com, dev@dpdk.org
References: <20170720154835.13571-1-yskoh@mellanox.com>
 <fbc908a1-fe53-643d-0d3b-df688232d23c@grimberg.me>
 <20170721151006.GA38779@yongseok-MBP.local>
 <918eb17f-269c-7102-41ab-69ceb95fdacf@grimberg.me>
 <20170725074356.GA4034@minint-98vp2qg>
From: Sagi Grimberg <sagi@grimberg.me>
Message-ID: <cfdb1356-9fc6-f4ec-34c6-3ec1ff571207@grimberg.me>
Date: Thu, 27 Jul 2017 14:12:41 +0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170725074356.GA4034@minint-98vp2qg>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a
 call
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Jul 2017 11:12:54 -0000


>> Yes I realize that, but can't the device still complete in a burst (of
>> unsuppressed completions)? I mean its not guaranteed that for every
>> txq_complete a signaled completion is pending right? What happens if
>> the device has inconsistent completion pacing? Can't the sw grow a
>> batch of completions if txq_complete will process a single completion
>> unconditionally?
> Speculation. First of all, device doesn't delay completion notifications for no
> reason. ASIC is not a SW running on top of a OS.

I'm sorry but this statement is not correct. It might be correct in a
lab environment, but in practice, there are lots of things that can
affect the device timing.

> If a completion comes up late,
> this means device really can't keep up the rate of posting descriptors. If so,
> tx_burst() should generate back-pressure by returning partial Tx, then app can
> make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
> if app is forwarding packets.

Not arguing on that, I was simply suggesting that better heuristics
could be applied than "process one completion unconditionally".

> More serious problem I expected was a case that the THRESH is smaller than
> burst size. In that case, txq->elts[] will be short of slots all the time. But
> fortunately, in MLX PMD, we request one completion per a burst at most, not
> every THRESH of packets.
> 
> If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
> Question to myself was "when does it shrink?". It shrinks when Tx burst is light
> (burst size is smaller than THRESH) because mlx5_tx_complete() is always called
> every time tx_burst() is called. What if it keeps growing? Then, drop is
> necessary and natural like I mentioned above.
> 
> It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
> It is usually done by increasing queue depth. Keeping steady state is more
> important.

Again, I agree jitters are bad, but with proper heuristics in place mlx5
can still keep a low jitter _and_ consume completions faster than
consecutive tx_burst invocations.

> Rather, this patch is helpful for reducing jitters. When I run a profiler, the
> most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
> checking valid CQE, many buffers could be freed in a single call of
> mlx5_tx_complete() at some moment, then it would cause a long delay. This would
> aggravate jitter.

I didn't argue the fact that this patch addresses an issue, but mlx5 is
a driver that is designed to run applications that can act differently
than your test case.

> Of course. I appreciate your time for the review. And keep in mind that nothing
> is impossible in an open source community. I always like to discuss about ideas
> with anyone. But I was just asking to hear more details about your suggestion if
> you wanted me to implement it, rather than giving me one-sentence question :-)

Good to know.

>>> Does "budget" mean the
>>> threshold? If so, calculation of stats for adaptive threshold can impact single
>>> core performance. With multiple cores, adjusting threshold doesn't affect much.
>>
>> If you look at mlx5e driver in the kernel, it maintains online stats on
>> its RX and TX queues. It maintain these stats mostly for adaptive
>> interrupt moderation control (but not only).
>>
>> I was suggesting maintaining per TX queue stats on average completions
>> consumed for each TX burst call, and adjust the stopping condition
>> according to a calculated stat.
> In case of interrupt mitigation, it could be beneficial because interrupt
> handling cost is too costly. But, the beauty of DPDK is polling, isn't it?

If you read again my comment, I didn't suggest to apply stats for
interrupt moderation, I just gave an example of a use-case. I was
suggesting to maintain the online stats for adjusting a threshold of
how much completions to process in a tx burst call (instead of
processing one unconditionally).

> And please remember to ack at the end of this discussion if you are okay so that
> this patch can gets merged. One data point is, single core performance (fwd) of
> vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Yea, I don't mind merging it in given that I don't have time to come
up with anything better (or worse :))

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>