From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by dpdk.org (Postfix) with ESMTP id 43E1E106A for ; Mon, 12 Dec 2016 12:52:48 +0100 (CET) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP; 12 Dec 2016 03:52:47 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,336,1477983600"; d="scan'208";a="911234895" Received: from irsmsx106.ger.corp.intel.com ([163.33.3.31]) by orsmga003.jf.intel.com with ESMTP; 12 Dec 2016 03:52:46 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.43]) by IRSMSX106.ger.corp.intel.com ([169.254.8.112]) with mapi id 14.03.0248.002; Mon, 12 Dec 2016 11:51:32 +0000 From: "Ananyev, Konstantin" To: "Kulasek, TomaszX" , Olivier Matz CC: Thomas Monjalon , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation Thread-Index: AQHSRbBXtZ+k3KL6/UOfLLeqKZ5NC6DzVKAAgACPLdCAAHvNgIAAf21AgAmFZoCAAZDxAIAESfRg Date: Mon, 12 Dec 2016 11:51:31 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0E689E@irsmsx105.ger.corp.intel.com> References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com> <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com> <1479922585-8640-2-git-send-email-tomaszx.kulasek@intel.com> <7834627.cBDVu3uoNi@xps13> <2601191342CEEE43887BDE71AB9772583F0E2AB0@irsmsx105.ger.corp.intel.com> <20161202092425.13752e2f@platinum> <2601191342CEEE43887BDE71AB9772583F0E3405@irsmsx105.ger.corp.intel.com> <20161208182417.32ec3933@platinum> <3042915272161B4EB253DA4D77EB373A14F5A409@IRSMSX102.ger.corp.intel.com> In-Reply-To: <3042915272161B4EB253DA4D77EB373A14F5A409@IRSMSX102.ger.corp.intel.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Dec 2016 11:52:50 -0000 Hi Olivier and Tomasz, > -----Original Message----- > From: Kulasek, TomaszX > Sent: Friday, December 9, 2016 5:19 PM > To: Olivier Matz ; Ananyev, Konstantin > Cc: Thomas Monjalon ; dev@dpdk.org > Subject: RE: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation >=20 > Hi Oliver, >=20 > My 5 cents below: >=20 > > -----Original Message----- > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > Sent: Thursday, December 8, 2016 18:24 > > To: Ananyev, Konstantin > > Cc: Thomas Monjalon ; Kulasek, TomaszX > > ; dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation > > > > Hi Konstantin, > > > > On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin" > > wrote: > > > Hi Olivier, > > > > > > > -----Original Message----- > > > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > > > Sent: Friday, December 2, 2016 8:24 AM > > > > To: Ananyev, Konstantin > > > > Cc: Thomas Monjalon ; Kulasek, TomaszX > > > > ; dev@dpdk.org Subject: Re: [dpdk-dev] > > > > [PATCH v12 1/6] ethdev: add Tx preparation > > > > > > > > Hi Konstantin, > > > > > > > > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin" > > > > wrote: > > > > > > > > > > > > 2016-11-23 18:36, Tomasz Kulasek: > > > > > > > +/** > > > > > > > + * Process a burst of output packets on a transmit queue of > > > > > > > an Ethernet device. > > > > > > > + * > > > > > > > + * The rte_eth_tx_prepare() function is invoked to prepare > > > > > > > output packets to be > > > > > > > + * transmitted on the output queue *queue_id* of the Etherne= t > > > > > > > device designated > > > > > > > + * by its *port_id*. > > > > > > > + * The *nb_pkts* parameter is the number of packets to be > > > > > > > prepared which are > > > > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, > > > > > > > each of them > > > > > > > + * allocated from a pool created with > > > > > > > rte_pktmbuf_pool_create(). > > > > > > > + * For each packet to send, the rte_eth_tx_prepare() functio= n > > > > > > > performs > > > > > > > + * the following operations: > > > > > > > + * > > > > > > > + * - Check if packet meets devices requirements for tx > > > > > > > offloads. > > > > > > > + * > > > > > > > + * - Check limitations about number of segments. > > > > > > > + * > > > > > > > + * - Check additional requirements when debug is enabled. > > > > > > > + * > > > > > > > + * - Update and/or reset required checksums when tx offload > > > > > > > is set for packet. > > > > > > > + * > > > > > > > + * Since this function can modify packet data, provided mbuf= s > > > > > > > must be safely > > > > > > > + * writable (e.g. modified data cannot be in shared > > > > > > > segment). > > > > > > > > > > > > I think we will have to remove this limitation in next releases= . > > > > > > As we don't know how it could affect the API, I suggest to > > > > > > declare this API EXPERIMENTAL. > > > > > > > > > > While I don't really mind to mart it as experimental, I don't > > > > > really understand the reasoning: Why " this function can modify > > > > > packet data, provided mbufs must be safely writable" suddenly > > > > > becomes a problem? That seems like and obvious limitation to me > > > > > and let say tx_burst() has the same one. Second, I don't see how > > > > > you are going to remove it without introducing a heavy performanc= e > > > > > impact. Konstantin > > > > > > > > About tx_burst(), I don't think we should force the user to provide > > > > a writable mbuf. There are many use cases where passing a clone > > > > already works as of today and it avoids duplicating the mbuf data. > > > > For instance: traffic generator, multicast, bridging/tap, etc... > > > > > > > > Moreover, this requirement would be inconsistent with the model you > > > > are proposing in case of pipeline: > > > > - tx_prepare() on core X, may update the data > > > > - tx_burst() on core Y, should not touch the data to avoid cache > > > > misses > > > > > > Probably I wasn't very clear in my previous mail. > > > I am not saying that we should force the user to pass a writable mbuf= . > > > What I am saying that for tx_burst() current expectation is that afte= r > > > mbuf is handled to tx_burst() user shouldn't try to modify its buffer > > > contents till TX engine is done with the buffer (mbuf_free() is calle= d > > > by TX func for it). For tx_prep(), I think, it is the same though > > > restrictions are a bit more strict: user should not try to read/write > > > to the mbuf while tx_prep() is not finished with it. What puzzles me > > > is that why that should be the reason to mark tx_prep() as > > > experimental. Konstantin > > > > To be sure we're on the same page, let me reword: > > > > - mbufs passed to tx_prepare() by the application must have their > > headers (l2_len + l3_len + l4_len) writable because the phdr checksum > > can be replaced. It could be precised in the api comment. > > > > - mbufs passed to tx_burst() must not be modified by the driver/hw, nor > > by the application. Yes, agree with both. > > > > > > About the API itself, I have one more question. I know you've already > > discussed this a bit with Adrien, I don't want to spawn a new big threa= d > > from here ;) > > > > The API provides tx_prepare() to check the packets have the proper form= at, > > and possibly modify them (ex: csum offload / tso) to match hw > > requirements. So it does both checks (number of segments) and fixes > > (csum/tso). What determines things that should be checked and things th= at > > should be fixed? > > Just to echo Tomasz detailed reply below: I think right now tx_prepare() doesn't make any fixing. It just: checks that packet satisfies PMD/HW requirements, if not returns an error without trying to fix anything/ If the packet seems valid - calculates and fills L4 cksum to fulfill HW re= qs .=20 About the question what should be checked - as usual there is a tradeoff be= tween=20 additional checks and performance. As Tomasz stated below right now in non-debug mode we do checks only for th= ings that can have a critical impact on HW itself (tx hang, etc.). All extra checks are implemented only in DEBUG mode right now. =20 >=20 > 1) Performance -- we may assume that packets are created with the common = rules (e.g. doesn't tries to count IP checksum for IPv6 > packet, sets all required fields, etc.). For now, additional checks are d= one only in DEBUG mode. >=20 > 2) Uncommon requirements, such a number of segments for TSO/non-TSO, wher= e invalid values can cause unexpected results (or > even hang device on burst), so it is critical (at least for ixgbe and i40= e) and must be checked. >=20 > 3) Checksum field must be initialized in a hardware specific way to count= it properly. >=20 > 4) If packet is invalid its content isn't modified nor restored. >=20 > > The application gets few information from tx_prepare() about what shoul= d > > be done to make the packet accepted by the hw, and the actions will > > probably be different depending on hardware. That's true. I am open to suggestions how in future to provide extra information to the = upper layer. Set rte_errno to different values depending on type of error, OR extra parameter in tx_prepare() that will provide more detailed error in= formation, OR something else? > So could we imagine that in > > the future the function also tries to fix the packet? I've seen your > > comment saying that it has to be an application decision, so what about > > having a parameter saying "fix the packet" or "don't fix it"? > > While I am not directly opposed to that idea, I am a bit skeptical that it could be implemented in useful and effective way. =20 As I said before it may be fixed in different ways: coalesce mbufs, split mbuf chain into multiple ones, might be some sort of combination of these 2 approaches. Again it seems a bit ineffective from performance point of view: form packet first then check and re-adjust it again. Seems more plausible to have it formed in a right way straightway. Also it would mean that we'll need to pass to tx_prepare() information about which mempool to use for new mbufs, probably some extra information.= =20 That's why I think it might be better to do a proper (re)formatting of the = packet before passing it down to tx_prepare(). Might be some sort of generic helper function in rte_net that would use inf= ormation from rte_eth_desc_lim and could be called by upper layer before tx_prepare(= ). But if you think it could be merged into tx_prepare() into some smart and e= ffective way - =20 sure thing, let's look at particular implementation ideas you have.=20 >=20 > The main question here is how invasive in packet data tx_prepare should b= e. >=20 > If I understand you correctly, you're talking about the situation when tx= _prepare will try to restore packet internally, e.g. linearize data > for i40e to meet number of segments requirements, split mbufs, when are t= oo large, maybe someone will wanted to have full > software checksum computation fallback for devices which doesn't support = it, and so on? And then this parameter will indicate "do > the required work for me", or "just check, initialize what should be init= ialized, if fails, let me decide"? >=20 > Implemented model is quite simple and clear: >=20 > 1) Validate uncommon requirements which may cause a problem and let appli= cation to deal with it. > 2) If packet is valid, initialize required fields according to the hardwa= re requirements. >=20 > In fact, it doesn't fix anything for now, but initializes checksums, and = the main reason why it is done here, is the fact that it cannot be > done in a trivial way in the application itself. >=20 > IMHO, we should to keep this API as simple as it possible and not let it = grow in unexpected way (also for performance reason). >=20 > > About rte_eth_desc_lim->nb_seg_max and > > rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially = for > > the 2nd one, because I don't see how it can be used by the application. > > Well, it does not hurt to have them, but for me it looks a bit useless. > > >=20 > * "nb_seg_max": Maximum number of segments per whole packet. > * "nb_mtu_seg_max": Maximum number of segments per one MTU. >=20 > Application can use provided values in a fallowed way: >=20 > * For non-TSO packet, a single transmit packet may span up to "nb_mtu_seg= _max" buffers. > * For TSO packet the total number of data descriptors is "nb_seg_max", an= d each segment within the TSO may span up to > "nb_mtu_seg_max". I suppose these fields will be useful. Let say you are about to form and send packets. Right now, how would you know what are HW limitations on number of segments= per packet? =20 >=20 > > Last thing, I think this API should become the default in the future. > > For instance, it would prevent the application to calculate a phdr csum > > that will not be used by the hw. Not calling tx_prepare() would require > > the user/application to exactly knows the underlying hw and the kind of > > packet that are generated. So for me it means we'll need to also update > > other examples (other testpmd engines, l2fwd, ...). Do you agree? > > >=20 > Most of sample applications doesn't use TX offloads at all and doesn't co= unt checksums. So the question is "should tx_prepare be > used even if not required?" I suppose it wouldn't be a big harm to add tx_prepare() into sample apps wh= en appropriate. In most cases (when simple TS path used) it will be NOP anyway. Konstantin