From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 253B62BA8 for ; Tue, 6 Dec 2016 11:56:32 +0100 (CET) Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga102.jf.intel.com with ESMTP; 06 Dec 2016 02:56:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,309,1477983600"; d="scan'208";a="39268852" Received: from irsmsx103.ger.corp.intel.com ([163.33.3.157]) by orsmga005.jf.intel.com with ESMTP; 06 Dec 2016 02:56:28 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.43]) by IRSMSX103.ger.corp.intel.com ([169.254.3.91]) with mapi id 14.03.0248.002; Tue, 6 Dec 2016 10:56:28 +0000 From: "Ananyev, Konstantin" To: Adrien Mazarguil CC: Thomas Monjalon , "dev@dpdk.org" , Rahul Lakkireddy , Stephen Hurd , Jan Medala , Jakub Palider , John Daley , Alejandro Lucero , Harish Patil , Rasesh Mody , Jerin Jacob , Yuanhan Liu , Yong Wang , "Kulasek, TomaszX" , "olivier.matz@6wind.com" Thread-Topic: [dpdk-dev] [PATCH v12 0/6] add Tx preparation Thread-Index: AQHSRbBDKacvBhx2eUCvrHI7LRX1L6DuQp0AgALr7oCAACwIoIABX2MAgAEdxrCABa5agIAAE4SwgAAgtICAAQGb8A== Date: Tue, 6 Dec 2016 10:56:26 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0E45D6@irsmsx105.ger.corp.intel.com> References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com> <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com> <8317180.L80Qf11uiu@xps13> <20161130074003.GD10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E20C8@irsmsx105.ger.corp.intel.com> <20161201071518.GG10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E2A9A@irsmsx105.ger.corp.intel.com> <20161205150327.GP10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E43B5@irsmsx105.ger.corp.intel.com> <20161205181021.GT10340@6wind.com> In-Reply-To: <20161205181021.GT10340@6wind.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.180] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Dec 2016 10:56:33 -0000 Hi Adrien, >=20 > On Mon, Dec 05, 2016 at 04:43:52PM +0000, Ananyev, Konstantin wrote: > [...] > > > On Fri, Dec 02, 2016 at 01:00:55AM +0000, Ananyev, Konstantin wrote: > > > [...] > > > > > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wro= te: > > > > > [...] > > > > > > Do you have anything particular in mind here? > > > > > > > > > > Nothing in particular, so for the sake of the argument, let's sup= pose that I > > > > > would like to add a field to expose some limitation that only app= lies to my > > > > > PMD during TX but looks generic enough to make sense, e.g. maximu= m packet > > > > > size when VLAN tagging is requested. > > > > > > > > Hmm, I didn't hear about such limitations so far, but if it is real= case - > > > > sure, feel free to submit the patch. > > > > > > I won't, that was hypothetical. > > > > Then why we discussing it? :) >=20 > Just to make a point, which is that new limitations may appear anytime an= d > tx_prepare() can now be used to check for them. First patch of the series > does it: >=20 > + uint16_t nb_seg_max; /**< Max number of segments per whole packe= t. */ > + uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */ >=20 > And states that: >=20 > + * For each packet to send, the rte_eth_tx_prepare() function performs > + * the following operations: > + * > + * - Check if packet meets devices requirements for tx offloads. > + * > + * - Check limitations about number of segments. > + * > + * - Check additional requirements when debug is enabled. > + * > + * - Update and/or reset required checksums when tx offload is set for = packet. I think I already explained in my previous email why I think that nb_seg_max and nb_mtu_seg_max are not redundant because of tx_prepare(). >>From my point they are complement to tx_prepare():=20 Even if people do use tx_prepare() they still should take this information = into account. As an example ixgbe can't TX packets with then 40 segments. tx_prepare() for ixgbe will flag that issue, but it can't make a decision o= n user behalf=20 what to do in that case: drop the packet, try to coalesce it into the packe= t with less number of segments, split the packet into several smaller, etc. That's up to user to make such decision, and to make it, user might need th= is information. >=20 > It's like making this function mandatory IMO. That's probably where confusion starts: I don't think that tx_prepare() should be mandatory for the user to call. Yes, it should be a recommended way. But the user still should have the ability to by-pass it, if he believes there is no need for it, or he prefers to implement the same functionality on his own. As an example, if the user knows that he is going to send a group of one-segment packets that don't require any tx offloads, he can safely sk= ip tx_prepare() for them. >=20 > > > > > PMDs are free to set that field to some > > > > > special value (say, 0) if they do not care. > > > > > > > > > > Since that field exists however, conscious applications should ch= eck its > > > > > value for each packet that needs to be transmitted. This extra co= de causes a > > > > > slowdown just by sitting in the data path. Since it is not the on= ly field in > > > > > that structure, the performance impact can be significant. Conscious user will probably use this information at the stage of packet f= ormation. He probably has to do this sort of things for large packets anyway: Check what is the underlying mtu, to decide does he need to split the packe= t, or enable tso for it, etc. =20 > > > > > > > > > > Even though this code is inside applications, it remains unfair t= o PMDs for > > > > > which these tests are irrelevant. This problem is identified and = addressed > > > > > by tx_prepare(). > > > > > > > > I suppose the question is why do we need: > > > > uint16_t nb_seg_max; > > > > uint16_t nb_mtu_seg_max; > > > > as we now have tx_prepare(), right? > > > > > > > > For two reasons: > > > > 1. Some people might feel that tx_prepare() is not good (smart/fast= ) enough > > > > for them and would prefer to do necessary preparations for TX offlo= ads themselves. > > > > > > > > 2. Even if people do use tx_prepare() they still should take this i= nformation into accout. > > > > As an example ixgbe can't TX packets with then 40 segments. > > > > Obviously ixbge_tx_prep() performs that check and returns an error. > > > > > > Problem is that tx_prepare() also provides safeties which are not par= t of > > > tx_burst(), such as not going over nb_mtu_seg_max. Because of this an= d the > > > fact struct rte_eth_desc_lim can grow new fields anytime, application > > > developers will be tempted to just call tx_prepare() and focus on mor= e > > > useful things. > > > > NP with that, that was an intention beyond introducing it. > > > > > Put another way, from a user's point of view, tx_prepare() is an opaq= ue > > > function that greatly increases tx_burst()'s ability to send mbufs as > > > requested, with extra error checking on top; applications not written= to run > > > on a specific PMD/device (all of them ideally) will thus call tx_prep= are() > > > at some point. > > > > > > > But it wouldn't try to merge/reallocate mbufs for you. > > > > User still has to do it himself, or just prevent creating such long= chains somehow. > > > > > > Yes, that's another debate. PMDs could still implement a software fal= lback > > > for unlikely slow events like these. The number of PMDs is not going = to > > > decrease, each device having its own set of weird limitations in spec= ific > > > cases, PMDs should do their best to process mbufs even if that means = slowly > > > due to the lack of preparation. > > > > > > tx_prepare() has its uses but should really be optional, in the sense= that > > > if that function is not called, tx_burst() should deal with it someho= w. > > > > As I said before, I don't think it is a good idea to put everything in = tx_burst(). > > If PMD driver prefer things that way, yes tx_burst() can deal with each= and > > possible offload requirement itself, but it shouldn't be mandatory. >=20 > In effect, having to call tx_prepare() otherwise makes this step mandator= y > anyway. Looks like we are not going to agree here. >=20 > > > > > Thanks to tx_prepare(), these checks are moved back into PMDs whe= re they > > > > > belong. PMDs that do not need them do not have to provide support= for > > > > > tx_prepare() and do not suffer any performance impact as result; > > > > > applications only have to make sure tx_prepare() is always called= at some > > > > > point before tx_burst(). > > > > > > > > > > Once you reach this stage, you've effectively made tx_prepare() m= andatory > > > > > before tx_burst(). If some bug occurs, then perhaps you forgot to= call > > > > > tx_prepare(), you just need to add it. The total cost for doing T= X is > > > > > therefore tx_prepare() + tx_burst(). > > > > > > > > > > I'm perhaps a bit pessimistic mind you, but I do not think tx_pre= pare() will > > > > > remain optional for long. Sure, PMDs that do not implement it do = not care, > > > > > I'm focusing on applications, for which the performance impact of= calling > > > > > tx_prepare() followed by tx_burst() is higher than a single tx_bu= rst() > > > > > performing all the necessary preparation at once. > > > > > > > > > > [...] > > > > > > > Following the same logic, why can't such a thing be made part= of the TX > > > > > > > burst function as well (through a direct call to rte_phdr_cks= um_fix() > > > > > > > whenever necessary). From an application standpoint, what are= the advantages > > > > > > > of having to: > > > > > > > > > > > > > > if (tx_prep()) // iterate and update mbufs as needed > > > > > > > tx_burst(); // iterate and send > > > > > > > > > > > > > > Compared to: > > > > > > > > > > > > > > tx_burst(); // iterate, update as needed and send > > > > > > > > > > > > I think that was discussed extensively quite a lot previously h= ere: > > > > > > As Thomas already replied - main motivation is to allow user > > > > > > to execute them on different stages of packet TX pipeline, > > > > > > and probably on different cores. > > > > > > I think that provides better flexibility to the user to when/wh= ere > > > > > > do these preparations and hopefully would lead to better perfor= mance. > > > > > > > > > > And I agree, I think this use case is valid but does not warrant = such a high > > > > > penalty when your application does not need that much flexibility= . Simple > > > > > (yet conscious) applications need the highest performance. Comple= x ones as > > > > > you described already suffer quite a bit from IPCs and won't mind= a couple > > > > > of extra CPU cycles right? > > > > > > > > It would mean an extra cache-miss for every packet, so I think perf= ormance hit > > > > would be quite significant. > > > > > > A performance hit has to occur somewhere regardless, because somethin= g has > > > to be done in order to send packets that need it. Whether this cost i= s in > > > application code or in a PMD function, it remains part of TX. > > > > Depending on the place the final cost would differ quite a lot. > > If you call tx_prepare() somewhere close to the place where you fill th= e packet header > > contents, then most likely the data that tx_prepare() has to access wil= l be already in the cache. > > So the performance penalty will be minimal. > > If you'll try to access the same data later (at tx_burst), then the pos= sibility that it would still > > be in cache is much less. > > If you calling tx_burst() from other core then data would for sure be o= ut of cache, > > and even worse can still be in another core cache. >=20 > Well sure, that's why I also think tx_prepare() has its uses, only that > since tx_prepare() is optional, tx_burst() should provide the same > functionality when tx_prepare() is not called. As I understand, to implement what you are proposing (TX_PREPARED mbuf->ol_= flag) it will be required:=20 a) Modify all existing applications that do similar to tx_prepare() stuff o= n their own, otherwise they'll would hit performance penalty. b) Modify at least all Intel PMDs and might be some others too (vmxnet3?). Step b) probably wouldn't cause any significant performance impact straight= way, but it's for sure wouldn't make things faster, and would increase tx_burst(= ) code complexity quite a lot.=20 >>From other side, I can't see any real benefit that we will have in return. So I still opposed to that idea. >=20 > > > > About the 'simple' case when tx_prep() and tx_burst() are called on= the same core, > > > > Why do you believe that: > > > > tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_pre= p(), ...}? > > > > > > I mean instead of two function calls with their own loops: > > > > > > tx_prepare() { foreach (pkt) { check(); extra_check(); ... } } > > > > > > tx_burst() { foreach (pkt) { check(); stuff(); ... } } > > > > > > You end up with one: > > > > > > tx_burst() { foreach (pkt) { check(); extra_check(); stuff(); ... } = } > > > > > > Which usually is more efficient. > > > > I really doubt that. > > If it would be that, what is the point to process packet in bulks? > > Usually dividing processing into different stages and at each stage pro= cessing > > multiple packet at once helps to improve performance. > > At least for IA. > > Look for example how we had to change l3fwd to improve its performance. >=20 > Depends quite a bit on usage pattern. It is less efficient for applicatio= ns > that do not modify mbuf contents because of the additional function call = and > inner loop. If the application doesn't modify mbuf contents that it can simply skip cal= ling tx_prepare(). >=20 > Note that I'm only pushing for the ability to conveniently address both > cases with maximum performance. >=20 > > > > tx_prep() itself is quite expensive, let say for Intel HW it includ= es: > > > > - read mbuf fileds (2 cache-lines), > > > > - read packet header (1/2 cache-lines) > > > > - calculate pseudo-header csum > > > > - update packet header > > > > Comparing to that price of extra function call seems neglectable > > > > (if we TX packets in bursts of course). > > > > > > We agree its performance is a critical issue then, sharing half the r= ead > > > steps with tx_burst() would make sense to me. > > > > I didn't understand that sentence. >=20 > I meant this step can be shared (in addition to loop etc): >=20 > - read mbuf fileds (2 cache-lines), Ah ok, you still believe that mixing tx_burst and tx_prepare code together would give us noticeable performance benefit. As I said above, I don't think it would, but you are welcome to try and prove me wrong. >=20 > > > > > Yes they will, therefore we need a method that satisfies both cas= es. > > > > > > > > > > As a possible solution, a special mbuf flag could be added to eac= h mbuf > > > > > having gone through tx_prepare(). That way, tx_burst() could skip= some > > > > > checks and things it would otherwise have done. > > > > > > > > That's an interesting idea, but it has one drawback: > > > > As I understand, it means that from now on if user doing preparatio= ns on his own, > > > > he had to setup this flag, otherwise tx_burst() would do extra unne= cessary work. > > > > So any existing applications that using TX offloads and do preparat= ion by themselves > > > > would have to be modified to avoid performance loss. > > > > > > In my opinion, users should not do preparation on their own. > > > > People already do it now. >=20 > But we do not want them to anymore thanks to this new API, for reasons > described in the motivation section of the cover letter, right? We probably wouldn't recommend that, but if people would like to use their = own stuff, or shortcuts - I don't want to stop them here. >=20 > > > If we provide a > > > generic method, it has to be fast enough to replace theirs. Perhaps n= ot as > > > fast since it would work with all PMDs (usual trade-off), but accepta= bly so. > > > > > > > > Another possibility, telling the PMD first that you always intend= to use > > > > > tx_prepare() and getting a simpler/faster tx_burst() callback as = a result. > > > > > > > > That what we have right now (at least for Intel HW): > > > > it is a user responsibility to do the necessary preparations/checks= before calling tx_burst(). > > > > With tx_prepare() we just remove from user the headache to implemen= t tx_prepare() on his own. > > > > Now he can use a 'proper' PMD provided function. > > > > > > > > My vote still would be for that model. > > > > > > OK, then in a nutshell: > > > > > > 1. Users are not expected to perform preparation/checks themselves an= ymore, > > > if they do, it's their problem. > > > > I think we need to be backward compatible here. > > If the existing app doing what tx_prepare() supposed to do, it should k= eep working. >=20 > It should work, only if they keep doing it as well as call tx_burst() > directly, they will likely get lower performance. >=20 > > > 2. If configured through an API to be defined, tx_burst() can be spli= t in > > > two and applications must call tx_prepare() at some point before > > > tx_burst(). > > > > > > 3. Otherwise tx_burst() should perform the necessary preparation and = checks > > > on its own by default (when tx_prepare() is not expected). > > > > As I said before, I don't think it should be mandatory for tx_burst() t= o do what tx_prepare() does. > > If some particular implementation of tx_burst() prefers to do things th= at way - that's fine. > > But it shouldn't be required to. >=20 > You're right, however applications might find it convenient. I think most > will end up with something like the following: >=20 > if (tx_prepare(pkts)) > tx_burst(pkts)); Looking at existing DPDK apps - most of them do use some sort of TX bufferi= zation. So, even in a simplistic app it would probably be: =20 tx_prepare(pkts); tx_buffer(pkts); >=20 > > > 4. We probably still need some mbuf flag to mark mbufs that cannot be > > > modified, the refcount could also serve as a hint. > > > > If mbuf can't be modified, you probably just wouldn't call the function= that supposed to do that, > > tx_prepare() in that case. >=20 > I think it would be easier to document what offload flags may cause the > tx_burst() function to modify mbuf contents, so applications have the > ability to set or strip these flags on a mbuf basis. That way there is no > need to call tx_prepare() without knowing exactly what it's going to do. Not sure I understand what exactly do you propose in the last paragraph? Konstantin