From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by dpdk.org (Postfix) with ESMTP id 093BAC69E for ; Thu, 30 Jul 2015 19:14:32 +0200 (CEST) Received: by wibxm9 with SMTP id xm9so813766wib.0 for ; Thu, 30 Jul 2015 10:14:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=bDGegyqYW9LkCqS689v5GkIF1VNcZAIBwX1poCc78GY=; b=ZdGn04tMDSoGsc199gtobQMO2tt4qE2/6EqVxJz9003sBd5KBZzQxG3E/VJ4Rn7t/D /eVW81W9H5yReA4WtL5PPfiZx3HOaTpOILFJAQfQ79iMVuVD8o/2j4xC2qdJGzqPdnCg MWcHzAbG2H9g/2HyvPqOR5otZH3hmQgunUys5hFhQPJHT1Wb3XM7WrDC+AD9gY86srvK bh4+GIFTFA4UySJuXzZ/dbyzZH98AqFBS4N6jJhLRrr61IcVGgSiicZEMK1Iy5Ng9HyO S5SRhm061PC2Ta6rNk6BShgAwOdJhViU/69aKXYpc1bu+L0zv4ccomRENzrURTOiIG+A yg2g== X-Gm-Message-State: ALoCoQnr2+57aA9Zpm4jvvfj7HkOeUdUAl/4+/17D+Ky+LfMJQjCPQhMj9BWefnlrwIpUtSbj0HR X-Received: by 10.180.187.227 with SMTP id fv3mr8376883wic.43.1438276471859; Thu, 30 Jul 2015 10:14:31 -0700 (PDT) Received: from [10.0.0.166] ([37.142.229.250]) by smtp.googlemail.com with ESMTPSA id lm16sm104115wic.18.2015.07.30.10.14.30 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jul 2015 10:14:30 -0700 (PDT) To: Stephen Hemminger References: <55BA3B5D.4020402@cloudius-systems.com> <20150730091753.1af6cc67@urahara> <55BA4EC6.3030301@cloudius-systems.com> <55BA55D3.2070105@cloudius-systems.com> <20150730100158.1516dab3@urahara> From: Vlad Zolotarov Message-ID: <55BA5B75.3020502@cloudius-systems.com> Date: Thu, 30 Jul 2015 20:14:29 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <20150730100158.1516dab3@urahara> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] RFC: i40e xmit path HW limitation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jul 2015 17:14:32 -0000 On 07/30/15 20:01, Stephen Hemminger wrote: > On Thu, 30 Jul 2015 19:50:27 +0300 > Vlad Zolotarov wrote: > >> >> On 07/30/15 19:20, Avi Kivity wrote: >>> >>> On 07/30/2015 07:17 PM, Stephen Hemminger wrote: >>>> On Thu, 30 Jul 2015 17:57:33 +0300 >>>> Vlad Zolotarov wrote: >>>> >>>>> Hi, Konstantin, Helin, >>>>> there is a documented limitation of xl710 controllers (i40e driver) >>>>> which is not handled in any way by a DPDK driver. >>>>> From the datasheet chapter 8.4.1: >>>>> >>>>> "• A single transmit packet may span up to 8 buffers (up to 8 data >>>>> descriptors per packet including >>>>> both the header and payload buffers). >>>>> • The total number of data descriptors for the whole TSO (explained >>>>> later on in this chapter) is >>>>> unlimited as long as each segment within the TSO obeys the previous >>>>> rule (up to 8 data descriptors >>>>> per segment for both the TSO header and the segment payload buffers)." >>>>> >>>>> This means that, for instance, long cluster with small fragments has to >>>>> be linearized before it may be placed on the HW ring. >>>>> In more standard environments like Linux or FreeBSD drivers the >>>>> solution >>>>> is straight forward - call skb_linearize()/m_collapse() corresponding. >>>>> In the non-conformist environment like DPDK life is not that easy - >>>>> there is no easy way to collapse the cluster into a linear buffer from >>>>> inside the device driver >>>>> since device driver doesn't allocate memory in a fast path and utilizes >>>>> the user allocated pools only. >>>>> >>>>> Here are two proposals for a solution: >>>>> >>>>> 1. We may provide a callback that would return a user TRUE if a give >>>>> cluster has to be linearized and it should always be called before >>>>> rte_eth_tx_burst(). Alternatively it may be called from inside the >>>>> rte_eth_tx_burst() and rte_eth_tx_burst() is changed to return >>>>> some >>>>> error code for a case when one of the clusters it's given has >>>>> to be >>>>> linearized. >>>>> 2. Another option is to allocate a mempool in the driver with the >>>>> elements consuming a single page each (standard 2KB buffers would >>>>> do). Number of elements in the pool should be as Tx ring length >>>>> multiplied by "64KB/(linear data length of the buffer in the pool >>>>> above)". Here I use 64KB as a maximum packet length and not taking >>>>> into an account esoteric things like "Giant" TSO mentioned in the >>>>> spec above. Then we may actually go and linearize the cluster if >>>>> needed on top of the buffers from the pool above, post the buffer >>>>> from the mempool above on the HW ring, link the original >>>>> cluster to >>>>> that new cluster (using the private data) and release it when the >>>>> send is done. >>>> Or just silently drop heavily scattered packets (and increment oerrors) >>>> with a PMD_TX_LOG debug message. >>>> >>>> I think a DPDK driver doesn't have to accept all possible mbufs and do >>>> extra work. It seems reasonable to expect caller to be well behaved >>>> in this restricted ecosystem. >>>> >>> How can the caller know what's well behaved? It's device dependent. >> +1 >> >> Stephen, how do you imagine this well-behaved application? Having switch >> case by an underlying device type and then "well-behaving" correspondingly? >> Not to mention that to "well-behave" the application writer has to read >> HW specs and understand them, which would limit the amount of DPDK >> developers to a very small amount of people... ;) Not to mention that >> the mentioned above switch-case would be a super ugly thing to be found >> in an application that would raise a big question about the >> justification of a DPDK existence as as SDK providing device drivers >> interface. ;) > Either have a RTE_MAX_MBUF_SEGMENTS And what would it be in our care? 8? This would limit the maximum TSO packet to 16KB for 2KB buffers. > that is global or > a mbuf_linearize function? Driver already can stash the > mbuf pool used for Rx and reuse it for the transient Tx buffers. First of all who can guaranty that that pool would meet our needs - namely have large enough buffers? Secondly, using user's Rx mempool for that would be really not nice (read - dirty) towards the user that may had allocated the specific amount of buffers in it according to some calculations that didn't include the usage from the Tx flow. And lastly and most importantly, this would require using the atomic operations during access to Rx mempool, that would both require a specific mempool initialization and would significantly hit the performance. >