From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vladz@cloudius-systems.com>
Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com
 [209.85.212.171]) by dpdk.org (Postfix) with ESMTP id 7A0E35900
 for <dev@dpdk.org>; Wed, 25 Feb 2015 17:46:47 +0100 (CET)
Received: by mail-wi0-f171.google.com with SMTP id ex7so14267461wid.4
 for <dev@dpdk.org>; Wed, 25 Feb 2015 08:46:47 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
 :cc:subject:references:in-reply-to:content-type
 :content-transfer-encoding;
 bh=j8xQUoz71rMywUNKZN3p1lVKbQZ1zbVQXVPFn+37g1c=;
 b=jUSlWFJn+37SsNMnZEAd0sX2AHWtblVfVGh2sLxZ/u2zma7NcqtRwuG5HsHw+jOCRF
 XtrizAI0pHQhmsG+ieo2bfETovgc1ULVXanIyhY2YaAKukAu8qsFyvWuOceFKdsCkF3w
 Jydj58xoVNrWWDYlEJiumky+CQlBWpz/DJDUFfwfJrFebbEUUZqV87knDdCGNLKhhu9F
 KEpKaVKo9PuVPcCLB5jPnRS9qtWV2Kej4nRp/Tv7noAEpnt+k6eOkofsA6ULZuxT7eGq
 8M8e4YUsSbW3noqw+CR2Yzt1PsXHIoxOOiQryYHDuzlQyNNvZiN0SfclxYiQjYjVA0dZ
 mwkg==
X-Gm-Message-State: ALoCoQnKv4ew92VGjETMnJ/feRpUhndRxoCI662+2omsijS1UAkJbvJPDzDjTqMaKyT3AeTQYXuH
X-Received: by 10.194.216.138 with SMTP id oq10mr7828978wjc.133.1424882807263; 
 Wed, 25 Feb 2015 08:46:47 -0800 (PST)
Received: from [10.0.0.4] (bzq-79-183-20-111.red.bezeqint.net. [79.183.20.111])
 by mx.google.com with ESMTPSA id gm2sm25904085wib.5.2015.02.25.08.46.45
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 25 Feb 2015 08:46:46 -0800 (PST)
Message-ID: <54EDFC74.7050404@cloudius-systems.com>
Date: Wed, 25 Feb 2015 18:46:44 +0200
From: Vlad Zolotarov <vladz@cloudius-systems.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Bruce Richardson <bruce.richardson@intel.com>
References: <54ED9894.3050409@cloudius-systems.com>
 <20150225110228.GA4896@bricha3-MOBL3>
In-Reply-To: <20150225110228.GA4896@bricha3-MOBL3>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] : ixgbe: why bulk allocation is not used for a
 scattered Rx flow?
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 25 Feb 2015 16:46:47 -0000


On 02/25/15 13:02, Bruce Richardson wrote:
> On Wed, Feb 25, 2015 at 11:40:36AM +0200, Vlad Zolotarov wrote:
>> Hi, I have a question about the "scattered Rx" feature: why enabling it
>> disabled "bulk allocation" feature?
> The "bulk-allocation" feature is one where a more optimized RX code path is
> used. For the sake of performance, when doing that code path, certain assumptions
> were made, one of which was that packets would fit inside a single mbuf. Not
> having this assumption makes the receiving of packets much more complicated and
> therefore slower. [For similar reasons, the optimized TX routines e.g. vector
> TX, are only used if it is guaranteed that no hardware offload features are
> going to be used].
>
> Now, it is possible, though challenging, to write optimized code for these more
> complicated cases, such as scattered RX, or TX with offloads or scattered packets.
> In general, we will always want separate routines for the simple case and the
> complicated cases, as the performance hit of checking for the offloads, or
> multi-mbuf packets will be significant enough to hit our performance badly when
> they are not needed. In the case of the vector PMD for ixgbe - our highest
> performance path right now - we have indeed two receive routines, for simple
> and scattered cases. For TX, we only have an optimized path for the simple case,
> but that is not to say that at some point someone may provide one for the
> offload case too.
>
> A final note on scattered packets in particular: if packets are too big to fit
> in a single mbuf, then they are not small packets, and the processing time per
> packet available is, by definition, larger than for packets that fit in a
> single mbuf. For 64-byte packets, the packet arrival rate is 67ns @ 10G, or
> approx 200 cycles at 3GHz. If we assume a standard 2k mbuf, then a packet which
> spans two mbufs takes at least 1654ns, and therefore a 3GHz CPU has nearly 5000
> cycles to process that same packet. Therefore, since the processing budget is
> so much bigger the need to optimize is much less. Therefore it's more important
> to focus on the small packet case, which is what we have done.

Sure. I'm doing my best not to harm the existing code paths: the RSC 
handler is a separate function (i first patched the scalar scattered 
function but now I'm rewriting it as a stand alone routine), I don't 
change the igb_rx_entry (leave it to be a pointer) and keep the 
additional info in separate descriptors in a separate ring that is not 
accessed in a non-RSC flow.

>
>> There is some unclear comment in the ixgbe_recv_scattered_pkts():
>>
>> 		/*
>> 		 * Descriptor done.
>> 		 *
>> 		 * Allocate a new mbuf to replenish the RX ring descriptor.
>> 		 * If the allocation fails:
>> 		 *    - arrange for that RX descriptor to be the first one
>> 		 *      being parsed the next time the receive function is
>> 		 *      invoked [on the same queue].
>> 		 *
>> 		 *    - Stop parsing the RX ring and return immediately.
>> 		 *
>> 		 * This policy does not drop the packet received in the RX
>> 		 * descriptor for which the allocation of a new mbuf failed.
>> 		 * Thus, it allows that packet to be later retrieved if
>> 		 * mbuf have been freed in the mean time.
>> 		 * As a side effect, holding RX descriptors instead of
>> 		 * systematically giving them back to the NIC may lead to
>> 		 * RX ring exhaustion situations.
>> 		 * However, the NIC can gracefully prevent such situations
>> 		 * to happen by sending specific "back-pressure" flow control
>> 		 * frames to its peer(s).
>> 		 */
>>
>> Why the same "policy" can't be done in the bulk-context allocation? - Don't
>> advance the RDT until u've refilled the ring. What do I miss here?
> A lot of the optimizations done in other code paths, such as bulk alloc, may well
> be applicable here, it's just that the work has not been done yet, as the focus
> is elsewhere. For vector PMD RX, we have now routines that work on both regular
> and scattered packets, and both perform much better than the scalar equivalents.
> Also to note that in every RX (and TX) routine, the NIC tail pointer update is
> always done just once at the end of the function.

I see. Thanks for an educated clarification. Although I've spent some 
time with DPDK I still feel sometimes that I don't I fully understand 
the original author's idea and the clarifications like your really help.
I looked at the vectored receive function (_recv_raw_pkts_vec()) and it 
is one cryptic piece of a code! ;) Since u've brought it up - could u 
direct me to the measurements comparing the vectored  and scalar DPDK 
data paths please? I wonder how working without CSUM offload for 
instance may be faster even for small packets like u mentioned above? 
One will have to calculate it in SW in that case and I'm puzzled how 
this may be faster than letting HW do it...

>
>> Another question is about the LRO feature - is there a reason why it's not
>> implemented? I've implemented the LRO support in ixgbe PMD to begin with - I
>> used a "scattered Rx" as a template and now I'm tuning it (things like the
>> stuff above).
>>
>> Is there any philosophical reason why it hasn't been implemented in *any*
>> PMD so far? ;)
> I'm not aware of any philosophical reasons why it hasn't been done. Patches
> are welcome, as always. :-)

Great! So, i'll send what I have once it's ready... ;)

Again, thank for a great clarification.

>
> /Bruce
>