From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id C5B5F4CC5 for ; Mon, 5 Dec 2016 17:46:13 +0100 (CET) Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga104.jf.intel.com with ESMTP; 05 Dec 2016 08:46:12 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,305,1477983600"; d="scan'208";a="37253525" Received: from irsmsx101.ger.corp.intel.com ([163.33.3.153]) by orsmga004.jf.intel.com with ESMTP; 05 Dec 2016 08:46:10 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.43]) by IRSMSX101.ger.corp.intel.com ([163.33.3.153]) with mapi id 14.03.0248.002; Mon, 5 Dec 2016 16:43:52 +0000 From: "Ananyev, Konstantin" To: Adrien Mazarguil CC: Thomas Monjalon , "dev@dpdk.org" , Rahul Lakkireddy , Stephen Hurd , Jan Medala , Jakub Palider , John Daley , Alejandro Lucero , Harish Patil , Rasesh Mody , Jerin Jacob , Yuanhan Liu , Yong Wang , "Kulasek, TomaszX" , "olivier.matz@6wind.com" Thread-Topic: [dpdk-dev] [PATCH v12 0/6] add Tx preparation Thread-Index: AQHSRbBDKacvBhx2eUCvrHI7LRX1L6DuQp0AgALr7oCAACwIoIABX2MAgAEdxrCABa5agIAAE4Sw Date: Mon, 5 Dec 2016 16:43:52 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0E43B5@irsmsx105.ger.corp.intel.com> References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com> <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com> <8317180.L80Qf11uiu@xps13> <20161130074003.GD10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E20C8@irsmsx105.ger.corp.intel.com> <20161201071518.GG10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E2A9A@irsmsx105.ger.corp.intel.com> <20161205150327.GP10340@6wind.com> In-Reply-To: <20161205150327.GP10340@6wind.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Dec 2016 16:46:14 -0000 Hi Adrien, >=20 > Hi Konstantin, >=20 > On Fri, Dec 02, 2016 at 01:00:55AM +0000, Ananyev, Konstantin wrote: > [...] > > > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote: > > > [...] > > > > > Something is definitely needed here, and only PMDs can provide it= . I think > > > > > applications should not have to clear checksum fields or initiali= ze them to > > > > > some magic value, same goes for any other offload or hardware lim= itation > > > > > that needs to be worked around. > > > > > > > > > > tx_prep() is one possible answer to this issue, however as mentio= ned in the > > > > > original patch it can be very expensive if exposed by the PMD. > > > > > > > > > > Another issue I'm more concerned about is the way limitations are= managed > > > > > (struct rte_eth_desc_lim). While not officially tied to tx_prep()= , this > > > > > structure contains new fields that are only relevant to a few dev= ices, and I > > > > > fear it will keep growing with each new hardware quirk to manage,= breaking > > > > > ABIs in the process. > > > > > > > > Well, if some new HW capability/limitation would arise and we'd lik= e to support > > > > it in DPDK, then yes we probably would need to think how to incorpo= rate it here. > > > > Do you have anything particular in mind here? > > > > > > Nothing in particular, so for the sake of the argument, let's suppose= that I > > > would like to add a field to expose some limitation that only applies= to my > > > PMD during TX but looks generic enough to make sense, e.g. maximum pa= cket > > > size when VLAN tagging is requested. > > > > Hmm, I didn't hear about such limitations so far, but if it is real cas= e - > > sure, feel free to submit the patch. >=20 > I won't, that was hypothetical. Then why we discussing it? :) >=20 > > > PMDs are free to set that field to some > > > special value (say, 0) if they do not care. > > > > > > Since that field exists however, conscious applications should check = its > > > value for each packet that needs to be transmitted. This extra code c= auses a > > > slowdown just by sitting in the data path. Since it is not the only f= ield in > > > that structure, the performance impact can be significant. > > > > > > Even though this code is inside applications, it remains unfair to PM= Ds for > > > which these tests are irrelevant. This problem is identified and addr= essed > > > by tx_prepare(). > > > > I suppose the question is why do we need: > > uint16_t nb_seg_max; > > uint16_t nb_mtu_seg_max; > > as we now have tx_prepare(), right? > > > > For two reasons: > > 1. Some people might feel that tx_prepare() is not good (smart/fast) en= ough > > for them and would prefer to do necessary preparations for TX offloads = themselves. > > > > 2. Even if people do use tx_prepare() they still should take this infor= mation into accout. > > As an example ixgbe can't TX packets with then 40 segments. > > Obviously ixbge_tx_prep() performs that check and returns an error. >=20 > Problem is that tx_prepare() also provides safeties which are not part of > tx_burst(), such as not going over nb_mtu_seg_max. Because of this and th= e > fact struct rte_eth_desc_lim can grow new fields anytime, application > developers will be tempted to just call tx_prepare() and focus on more > useful things. NP with that, that was an intention beyond introducing it. >=20 > Put another way, from a user's point of view, tx_prepare() is an opaque > function that greatly increases tx_burst()'s ability to send mbufs as > requested, with extra error checking on top; applications not written to = run > on a specific PMD/device (all of them ideally) will thus call tx_prepare(= ) > at some point. >=20 > > But it wouldn't try to merge/reallocate mbufs for you. > > User still has to do it himself, or just prevent creating such long cha= ins somehow. >=20 > Yes, that's another debate. PMDs could still implement a software fallbac= k > for unlikely slow events like these. The number of PMDs is not going to > decrease, each device having its own set of weird limitations in specific > cases, PMDs should do their best to process mbufs even if that means slow= ly > due to the lack of preparation. >=20 > tx_prepare() has its uses but should really be optional, in the sense tha= t > if that function is not called, tx_burst() should deal with it somehow. As I said before, I don't think it is a good idea to put everything in tx_b= urst(). If PMD driver prefer things that way, yes tx_burst() can deal with each and possible offload requirement itself, but it shouldn't be mandatory.=20 >=20 > > > Thanks to tx_prepare(), these checks are moved back into PMDs where t= hey > > > belong. PMDs that do not need them do not have to provide support for > > > tx_prepare() and do not suffer any performance impact as result; > > > applications only have to make sure tx_prepare() is always called at = some > > > point before tx_burst(). > > > > > > Once you reach this stage, you've effectively made tx_prepare() manda= tory > > > before tx_burst(). If some bug occurs, then perhaps you forgot to cal= l > > > tx_prepare(), you just need to add it. The total cost for doing TX is > > > therefore tx_prepare() + tx_burst(). > > > > > > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare= () will > > > remain optional for long. Sure, PMDs that do not implement it do not = care, > > > I'm focusing on applications, for which the performance impact of cal= ling > > > tx_prepare() followed by tx_burst() is higher than a single tx_burst(= ) > > > performing all the necessary preparation at once. > > > > > > [...] > > > > > Following the same logic, why can't such a thing be made part of = the TX > > > > > burst function as well (through a direct call to rte_phdr_cksum_f= ix() > > > > > whenever necessary). From an application standpoint, what are the= advantages > > > > > of having to: > > > > > > > > > > if (tx_prep()) // iterate and update mbufs as needed > > > > > tx_burst(); // iterate and send > > > > > > > > > > Compared to: > > > > > > > > > > tx_burst(); // iterate, update as needed and send > > > > > > > > I think that was discussed extensively quite a lot previously here: > > > > As Thomas already replied - main motivation is to allow user > > > > to execute them on different stages of packet TX pipeline, > > > > and probably on different cores. > > > > I think that provides better flexibility to the user to when/where > > > > do these preparations and hopefully would lead to better performanc= e. > > > > > > And I agree, I think this use case is valid but does not warrant such= a high > > > penalty when your application does not need that much flexibility. Si= mple > > > (yet conscious) applications need the highest performance. Complex on= es as > > > you described already suffer quite a bit from IPCs and won't mind a c= ouple > > > of extra CPU cycles right? > > > > It would mean an extra cache-miss for every packet, so I think performa= nce hit > > would be quite significant. >=20 > A performance hit has to occur somewhere regardless, because something ha= s > to be done in order to send packets that need it. Whether this cost is in > application code or in a PMD function, it remains part of TX. Depending on the place the final cost would differ quite a lot. If you call tx_prepare() somewhere close to the place where you fill the pa= cket header contents, then most likely the data that tx_prepare() has to access will be= already in the cache. So the performance penalty will be minimal. If you'll try to access the same data later (at tx_burst), then the possibi= lity that it would still be in cache is much less. If you calling tx_burst() from other core then data would for sure be out o= f cache, and even worse can still be in another core cache. >=20 > > About the 'simple' case when tx_prep() and tx_burst() are called on the= same core, > > Why do you believe that: > > tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(),= ...}? >=20 > I mean instead of two function calls with their own loops: >=20 > tx_prepare() { foreach (pkt) { check(); extra_check(); ... } } >=20 > tx_burst() { foreach (pkt) { check(); stuff(); ... } } >=20 > You end up with one: >=20 > tx_burst() { foreach (pkt) { check(); extra_check(); stuff(); ... } } >=20 > Which usually is more efficient. I really doubt that. If it would be that, what is the point to process packet in bulks? Usually dividing processing into different stages and at each stage process= ing multiple packet at once helps to improve performance. At least for IA. Look for example how we had to change l3fwd to improve its performance. >=20 > > tx_prep() itself is quite expensive, let say for Intel HW it includes: > > - read mbuf fileds (2 cache-lines), > > - read packet header (1/2 cache-lines) > > - calculate pseudo-header csum > > - update packet header > > Comparing to that price of extra function call seems neglectable > > (if we TX packets in bursts of course). >=20 > We agree its performance is a critical issue then, sharing half the read > steps with tx_burst() would make sense to me. I didn't understand that sentence. >=20 > > > Yes they will, therefore we need a method that satisfies both cases. > > > > > > As a possible solution, a special mbuf flag could be added to each mb= uf > > > having gone through tx_prepare(). That way, tx_burst() could skip som= e > > > checks and things it would otherwise have done. > > > > That's an interesting idea, but it has one drawback: > > As I understand, it means that from now on if user doing preparations o= n his own, > > he had to setup this flag, otherwise tx_burst() would do extra unnecess= ary work. > > So any existing applications that using TX offloads and do preparation = by themselves > > would have to be modified to avoid performance loss. >=20 > In my opinion, users should not do preparation on their own. People already do it now. > If we provide a > generic method, it has to be fast enough to replace theirs. Perhaps not a= s > fast since it would work with all PMDs (usual trade-off), but acceptably = so. >=20 > > > Another possibility, telling the PMD first that you always intend to = use > > > tx_prepare() and getting a simpler/faster tx_burst() callback as a re= sult. > > > > That what we have right now (at least for Intel HW): > > it is a user responsibility to do the necessary preparations/checks bef= ore calling tx_burst(). > > With tx_prepare() we just remove from user the headache to implement tx= _prepare() on his own. > > Now he can use a 'proper' PMD provided function. > > > > My vote still would be for that model. >=20 > OK, then in a nutshell: >=20 > 1. Users are not expected to perform preparation/checks themselves anymor= e, > if they do, it's their problem. I think we need to be backward compatible here. If the existing app doing what tx_prepare() supposed to do, it should keep = working. =20 >=20 > 2. If configured through an API to be defined, tx_burst() can be split in > two and applications must call tx_prepare() at some point before > tx_burst(). >=20 > 3. Otherwise tx_burst() should perform the necessary preparation and chec= ks > on its own by default (when tx_prepare() is not expected). As I said before, I don't think it should be mandatory for tx_burst() to do= what tx_prepare() does. If some particular implementation of tx_burst() prefers to do things that w= ay - that's fine. But it shouldn't be required to. >=20 > 4. We probably still need some mbuf flag to mark mbufs that cannot be > modified, the refcount could also serve as a hint. If mbuf can't be modified, you probably just wouldn't call the function tha= t supposed to do that, tx_prepare() in that case. =20 Konstantin