From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by dpdk.org (Postfix) with ESMTP id 7A0E35900 for ; Wed, 25 Feb 2015 17:46:47 +0100 (CET) Received: by mail-wi0-f171.google.com with SMTP id ex7so14267461wid.4 for ; Wed, 25 Feb 2015 08:46:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=j8xQUoz71rMywUNKZN3p1lVKbQZ1zbVQXVPFn+37g1c=; b=jUSlWFJn+37SsNMnZEAd0sX2AHWtblVfVGh2sLxZ/u2zma7NcqtRwuG5HsHw+jOCRF XtrizAI0pHQhmsG+ieo2bfETovgc1ULVXanIyhY2YaAKukAu8qsFyvWuOceFKdsCkF3w Jydj58xoVNrWWDYlEJiumky+CQlBWpz/DJDUFfwfJrFebbEUUZqV87knDdCGNLKhhu9F KEpKaVKo9PuVPcCLB5jPnRS9qtWV2Kej4nRp/Tv7noAEpnt+k6eOkofsA6ULZuxT7eGq 8M8e4YUsSbW3noqw+CR2Yzt1PsXHIoxOOiQryYHDuzlQyNNvZiN0SfclxYiQjYjVA0dZ mwkg== X-Gm-Message-State: ALoCoQnKv4ew92VGjETMnJ/feRpUhndRxoCI662+2omsijS1UAkJbvJPDzDjTqMaKyT3AeTQYXuH X-Received: by 10.194.216.138 with SMTP id oq10mr7828978wjc.133.1424882807263; Wed, 25 Feb 2015 08:46:47 -0800 (PST) Received: from [10.0.0.4] (bzq-79-183-20-111.red.bezeqint.net. [79.183.20.111]) by mx.google.com with ESMTPSA id gm2sm25904085wib.5.2015.02.25.08.46.45 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Feb 2015 08:46:46 -0800 (PST) Message-ID: <54EDFC74.7050404@cloudius-systems.com> Date: Wed, 25 Feb 2015 18:46:44 +0200 From: Vlad Zolotarov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Bruce Richardson References: <54ED9894.3050409@cloudius-systems.com> <20150225110228.GA4896@bricha3-MOBL3> In-Reply-To: <20150225110228.GA4896@bricha3-MOBL3> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx flow? X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Feb 2015 16:46:47 -0000 On 02/25/15 13:02, Bruce Richardson wrote: > On Wed, Feb 25, 2015 at 11:40:36AM +0200, Vlad Zolotarov wrote: >> Hi, I have a question about the "scattered Rx" feature: why enabling it >> disabled "bulk allocation" feature? > The "bulk-allocation" feature is one where a more optimized RX code path is > used. For the sake of performance, when doing that code path, certain assumptions > were made, one of which was that packets would fit inside a single mbuf. Not > having this assumption makes the receiving of packets much more complicated and > therefore slower. [For similar reasons, the optimized TX routines e.g. vector > TX, are only used if it is guaranteed that no hardware offload features are > going to be used]. > > Now, it is possible, though challenging, to write optimized code for these more > complicated cases, such as scattered RX, or TX with offloads or scattered packets. > In general, we will always want separate routines for the simple case and the > complicated cases, as the performance hit of checking for the offloads, or > multi-mbuf packets will be significant enough to hit our performance badly when > they are not needed. In the case of the vector PMD for ixgbe - our highest > performance path right now - we have indeed two receive routines, for simple > and scattered cases. For TX, we only have an optimized path for the simple case, > but that is not to say that at some point someone may provide one for the > offload case too. > > A final note on scattered packets in particular: if packets are too big to fit > in a single mbuf, then they are not small packets, and the processing time per > packet available is, by definition, larger than for packets that fit in a > single mbuf. For 64-byte packets, the packet arrival rate is 67ns @ 10G, or > approx 200 cycles at 3GHz. If we assume a standard 2k mbuf, then a packet which > spans two mbufs takes at least 1654ns, and therefore a 3GHz CPU has nearly 5000 > cycles to process that same packet. Therefore, since the processing budget is > so much bigger the need to optimize is much less. Therefore it's more important > to focus on the small packet case, which is what we have done. Sure. I'm doing my best not to harm the existing code paths: the RSC handler is a separate function (i first patched the scalar scattered function but now I'm rewriting it as a stand alone routine), I don't change the igb_rx_entry (leave it to be a pointer) and keep the additional info in separate descriptors in a separate ring that is not accessed in a non-RSC flow. > >> There is some unclear comment in the ixgbe_recv_scattered_pkts(): >> >> /* >> * Descriptor done. >> * >> * Allocate a new mbuf to replenish the RX ring descriptor. >> * If the allocation fails: >> * - arrange for that RX descriptor to be the first one >> * being parsed the next time the receive function is >> * invoked [on the same queue]. >> * >> * - Stop parsing the RX ring and return immediately. >> * >> * This policy does not drop the packet received in the RX >> * descriptor for which the allocation of a new mbuf failed. >> * Thus, it allows that packet to be later retrieved if >> * mbuf have been freed in the mean time. >> * As a side effect, holding RX descriptors instead of >> * systematically giving them back to the NIC may lead to >> * RX ring exhaustion situations. >> * However, the NIC can gracefully prevent such situations >> * to happen by sending specific "back-pressure" flow control >> * frames to its peer(s). >> */ >> >> Why the same "policy" can't be done in the bulk-context allocation? - Don't >> advance the RDT until u've refilled the ring. What do I miss here? > A lot of the optimizations done in other code paths, such as bulk alloc, may well > be applicable here, it's just that the work has not been done yet, as the focus > is elsewhere. For vector PMD RX, we have now routines that work on both regular > and scattered packets, and both perform much better than the scalar equivalents. > Also to note that in every RX (and TX) routine, the NIC tail pointer update is > always done just once at the end of the function. I see. Thanks for an educated clarification. Although I've spent some time with DPDK I still feel sometimes that I don't I fully understand the original author's idea and the clarifications like your really help. I looked at the vectored receive function (_recv_raw_pkts_vec()) and it is one cryptic piece of a code! ;) Since u've brought it up - could u direct me to the measurements comparing the vectored and scalar DPDK data paths please? I wonder how working without CSUM offload for instance may be faster even for small packets like u mentioned above? One will have to calculate it in SW in that case and I'm puzzled how this may be faster than letting HW do it... > >> Another question is about the LRO feature - is there a reason why it's not >> implemented? I've implemented the LRO support in ixgbe PMD to begin with - I >> used a "scattered Rx" as a template and now I'm tuning it (things like the >> stuff above). >> >> Is there any philosophical reason why it hasn't been implemented in *any* >> PMD so far? ;) > I'm not aware of any philosophical reasons why it hasn't been done. Patches > are welcome, as always. :-) Great! So, i'll send what I have once it's ready... ;) Again, thank for a great clarification. > > /Bruce >