From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f45.google.com (mail-wm0-f45.google.com [74.125.82.45]) by dpdk.org (Postfix) with ESMTP id F10ED2BA8 for ; Wed, 7 Dec 2016 11:09:07 +0100 (CET) Received: by mail-wm0-f45.google.com with SMTP id f82so158580598wmf.1 for ; Wed, 07 Dec 2016 02:09:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=ux+/aOm9Xq7TxQZ64erxE0GCAx5kfUpwLFKuLPX6S94=; b=b9PXPtHHGfJ1FbfQWwJmFW/BVzLdU7ISXOFlzijSTPY/KgG5AsAZTxy41X4BMmviGy CwEj/OdB6xTEzhx+o4KNDXPo8tgDMOdiTQeX6fqeOKMDklPxxJ86RaiOt5Wfimw5PlyN ciFtmr5Ik32i1Qs1HbTvkHgn1JTnM+VG+6wuGs2EVEmPHJbKtsOIDSkz6kVFbRnjluIV XDM8MXvYQH3lYhbVcF2kf5y76AT2GWy0TYUkNlRd5N30QQoR7Lgl/AQZbPJ/YNpJGO72 jO46XxXklGm3b3orycXYPArcwUJM6gnZTJiEZeHyx2JJeEeV04Ii0RoNJExAZ8AusYB4 XRew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ux+/aOm9Xq7TxQZ64erxE0GCAx5kfUpwLFKuLPX6S94=; b=DFbUCcvaj1kc2urriqPKFJBhVfyapvlCCeP9mFMtCHH02ja32UkoPo3Mg6mQ8X0t8G eIWW9hhbP1YTX7xP2LgVlw7gqaBVOeVM5WNDR51A3OuMe8kYeiiJOzO+DtEaOAfIb/0Q i+yhh0CSRCPcx80lKrChfMi1nCgzoQT6Frdw4sP4KHuSDkqyPbPLTkwjFKPMaGhfMzMu VqaX0aKl09A+YusmVKyuZQT233kZCHOCcxStpCemomFj9QoBkc2OuHbBg8UZS8btcXk3 7cKJSZGJMYz6cvsD6pZ0kn6YqsANFlh+mlUWIGxSNZK0tzn3+NbqtbnacnvdSoTIE9Vy 94mQ== X-Gm-Message-State: AKaTC02BAwk+zZOd+VcYVE4ELfkXMu1GMIDKCNywboJARKU0N5s3+e+fEzHkRPwFut5ZUTeK X-Received: by 10.28.111.138 with SMTP id c10mr1713893wmi.135.1481105346939; Wed, 07 Dec 2016 02:09:06 -0800 (PST) Received: from 6wind.com (guy78-3-82-239-227-177.fbx.proxad.net. [82.239.227.177]) by smtp.gmail.com with ESMTPSA id 204sm8785567wmj.7.2016.12.07.02.09.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 07 Dec 2016 02:09:05 -0800 (PST) Date: Wed, 7 Dec 2016 11:08:57 +0100 From: Adrien Mazarguil To: "Ananyev, Konstantin" Cc: Thomas Monjalon , "dev@dpdk.org" , Rahul Lakkireddy , Stephen Hurd , Jan Medala , Jakub Palider , John Daley , Alejandro Lucero , Harish Patil , Rasesh Mody , Jerin Jacob , Yuanhan Liu , Yong Wang , "Kulasek, TomaszX" , "olivier.matz@6wind.com" Message-ID: <20161207100857.GF10340@6wind.com> References: <20161130074003.GD10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E20C8@irsmsx105.ger.corp.intel.com> <20161201071518.GG10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E2A9A@irsmsx105.ger.corp.intel.com> <20161205150327.GP10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E43B5@irsmsx105.ger.corp.intel.com> <20161205181021.GT10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E45D6@irsmsx105.ger.corp.intel.com> <20161206135951.GY10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E4C13@irsmsx105.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2601191342CEEE43887BDE71AB9772583F0E4C13@irsmsx105.ger.corp.intel.com> Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Dec 2016 10:09:08 -0000 On Tue, Dec 06, 2016 at 08:31:35PM +0000, Ananyev, Konstantin wrote: > > Hi Konstantin, > > > > On Tue, Dec 06, 2016 at 10:56:26AM +0000, Ananyev, Konstantin wrote: > > > > > > Hi Adrien, > > > > > > > > > > > On Mon, Dec 05, 2016 at 04:43:52PM +0000, Ananyev, Konstantin wrote: > > > > [...] > > > > > > On Fri, Dec 02, 2016 at 01:00:55AM +0000, Ananyev, Konstantin wrote: > > > > > > [...] > > > > > > > > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote: > > > > > > > > [...] > > > > > > > > > Do you have anything particular in mind here? > > > > > > > > > > > > > > > > Nothing in particular, so for the sake of the argument, let's suppose that I > > > > > > > > would like to add a field to expose some limitation that only applies to my > > > > > > > > PMD during TX but looks generic enough to make sense, e.g. maximum packet > > > > > > > > size when VLAN tagging is requested. > > > > > > > > > > > > > > Hmm, I didn't hear about such limitations so far, but if it is real case - > > > > > > > sure, feel free to submit the patch. > > > > > > > > > > > > I won't, that was hypothetical. > > > > > > > > > > Then why we discussing it? :) > > > > > > > > Just to make a point, which is that new limitations may appear anytime and > > > > tx_prepare() can now be used to check for them. First patch of the series > > > > does it: > > > > > > > > + uint16_t nb_seg_max; /**< Max number of segments per whole packet. */ > > > > + uint16_t nb_mtu_seg_max; /**< Max number of segments per one MTU */ > > > > > > > > And states that: > > > > > > > > + * For each packet to send, the rte_eth_tx_prepare() function performs > > > > + * the following operations: > > > > + * > > > > + * - Check if packet meets devices requirements for tx offloads. > > > > + * > > > > + * - Check limitations about number of segments. > > > > + * > > > > + * - Check additional requirements when debug is enabled. > > > > + * > > > > + * - Update and/or reset required checksums when tx offload is set for packet. > > > > > > I think I already explained in my previous email why I think that > > > nb_seg_max and nb_mtu_seg_max are not redundant because of tx_prepare(). > > > From my point they are complement to tx_prepare(): > > > Even if people do use tx_prepare() they still should take this information into account. > > > As an example ixgbe can't TX packets with then 40 segments. > > > tx_prepare() for ixgbe will flag that issue, but it can't make a decision on user behalf > > > what to do in that case: drop the packet, try to coalesce it into the packet with less > > > number of segments, split the packet into several smaller, etc. > > > That's up to user to make such decision, and to make it, user might need this information. > > > > Yet tx_prepare() has already the ability to update mbuf contents, issue is > > what will this function do in the future, where will it stop? It is defined > > in a way that each PMD does what it wants to make mbufs edible for > > tx_burst(), because of this applications will just always call it to be on > > the safe side. > > > > > > It's like making this function mandatory IMO. > > > > > > That's probably where confusion starts: I don't think that > > > tx_prepare() should be mandatory for the user to call. > > > Yes, it should be a recommended way. > > > But the user still should have the ability to by-pass it, > > > if he believes there is no need for it, or he prefers to implement > > > the same functionality on his own. > > > As an example, if the user knows that he is going to send a group > > > of one-segment packets that don't require any tx offloads, he can safely skip > > > tx_prepare() for them. > > > > I understand your point, and agree with the example you provide. Many > > applications do not know what's inside mbufs though, except perhaps that > > they contain TCP and may want to perform TSO because of that. Those will > > have to call tx_prepare() to be future-proof. > > > > > > > > > > PMDs are free to set that field to some > > > > > > > > special value (say, 0) if they do not care. > > > > > > > > > > > > > > > > Since that field exists however, conscious applications should check its > > > > > > > > value for each packet that needs to be transmitted. This extra code causes a > > > > > > > > slowdown just by sitting in the data path. Since it is not the only field in > > > > > > > > that structure, the performance impact can be significant. > > > > > > Conscious user will probably use this information at the stage of packet formation. > > > He probably has to do this sort of things for large packets anyway: > > > Check what is the underlying mtu, to decide does he need to split the packet, > > > or enable tso for it, etc. > > > > There are already too many things to check, > > There always been, that patch exposes them, before that upper layer probably > had to use some hard-coded defines. > > > applications probably won't mind > > a little help from PMDs. If we keep adding fields to this structure, we'll > > have to provide some sort of PMD-specific function that checks what is > > relevant. > > Why PMD specific? > These fields are generic enough and could be used by upper layer to consult > when packet is formed. I've used the wrong term here, I meant a generic function provided by PMDs, just like tx_prepare(). Some sort of generic tx_check(), whose functionality is currently partially covered by tx_prepare() if I'm not mistaken. > > Furthermore, assuming most packets are fine and do not require any extra > > processing, what is rejected by tx_burst() could enter some unlikely() path > > that attempts to rectify and re-send them. That would at least optimize the > > common scenario. > > It is up to the upper layer to decide what to do with ill-formed packets: > drop/log/try to cure/etc. > Obviously different applications would have different logic and make different decisions here. > If you'd like to introduce a new function (in rte_net or whatever) that would be smart and > generic enough to cure ill-formed packets - you are more than welcome to try. > Though discussion of such fallback function is far of scope of that patch, I believe. I agree it's up to the upper layer, so just to clarify, I meant the tx_burst() function could fail some check and not transmit the remaining buffers, leaving the application to do whatever with them in some unlikely() path: // tx_prepare() has never been called n = tx_burst(pkts, num); if (unlikely(n < num)) { n += recover_here(pkts + n, num - n); } > > > > > > > > Even though this code is inside applications, it remains unfair to PMDs for > > > > > > > > which these tests are irrelevant. This problem is identified and addressed > > > > > > > > by tx_prepare(). > > > > > > > > > > > > > > I suppose the question is why do we need: > > > > > > > uint16_t nb_seg_max; > > > > > > > uint16_t nb_mtu_seg_max; > > > > > > > as we now have tx_prepare(), right? > > > > > > > > > > > > > > For two reasons: > > > > > > > 1. Some people might feel that tx_prepare() is not good (smart/fast) enough > > > > > > > for them and would prefer to do necessary preparations for TX offloads themselves. > > > > > > > > > > > > > > 2. Even if people do use tx_prepare() they still should take this information into accout. > > > > > > > As an example ixgbe can't TX packets with then 40 segments. > > > > > > > Obviously ixbge_tx_prep() performs that check and returns an error. > > > > > > > > > > > > Problem is that tx_prepare() also provides safeties which are not part of > > > > > > tx_burst(), such as not going over nb_mtu_seg_max. Because of this and the > > > > > > fact struct rte_eth_desc_lim can grow new fields anytime, application > > > > > > developers will be tempted to just call tx_prepare() and focus on more > > > > > > useful things. > > > > > > > > > > NP with that, that was an intention beyond introducing it. > > > > > > > > > > > Put another way, from a user's point of view, tx_prepare() is an opaque > > > > > > function that greatly increases tx_burst()'s ability to send mbufs as > > > > > > requested, with extra error checking on top; applications not written to run > > > > > > on a specific PMD/device (all of them ideally) will thus call tx_prepare() > > > > > > at some point. > > > > > > > > > > > > > But it wouldn't try to merge/reallocate mbufs for you. > > > > > > > User still has to do it himself, or just prevent creating such long chains somehow. > > > > > > > > > > > > Yes, that's another debate. PMDs could still implement a software fallback > > > > > > for unlikely slow events like these. The number of PMDs is not going to > > > > > > decrease, each device having its own set of weird limitations in specific > > > > > > cases, PMDs should do their best to process mbufs even if that means slowly > > > > > > due to the lack of preparation. > > > > > > > > > > > > tx_prepare() has its uses but should really be optional, in the sense that > > > > > > if that function is not called, tx_burst() should deal with it somehow. > > > > > > > > > > As I said before, I don't think it is a good idea to put everything in tx_burst(). > > > > > If PMD driver prefer things that way, yes tx_burst() can deal with each and > > > > > possible offload requirement itself, but it shouldn't be mandatory. > > > > > > > > In effect, having to call tx_prepare() otherwise makes this step mandatory > > > > anyway. Looks like we are not going to agree here. > > > > > > > > > > > > Thanks to tx_prepare(), these checks are moved back into PMDs where they > > > > > > > > belong. PMDs that do not need them do not have to provide support for > > > > > > > > tx_prepare() and do not suffer any performance impact as result; > > > > > > > > applications only have to make sure tx_prepare() is always called at some > > > > > > > > point before tx_burst(). > > > > > > > > > > > > > > > > Once you reach this stage, you've effectively made tx_prepare() mandatory > > > > > > > > before tx_burst(). If some bug occurs, then perhaps you forgot to call > > > > > > > > tx_prepare(), you just need to add it. The total cost for doing TX is > > > > > > > > therefore tx_prepare() + tx_burst(). > > > > > > > > > > > > > > > > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() will > > > > > > > > remain optional for long. Sure, PMDs that do not implement it do not care, > > > > > > > > I'm focusing on applications, for which the performance impact of calling > > > > > > > > tx_prepare() followed by tx_burst() is higher than a single tx_burst() > > > > > > > > performing all the necessary preparation at once. > > > > > > > > > > > > > > > > [...] > > > > > > > > > > Following the same logic, why can't such a thing be made part of the TX > > > > > > > > > > burst function as well (through a direct call to rte_phdr_cksum_fix() > > > > > > > > > > whenever necessary). From an application standpoint, what are the advantages > > > > > > > > > > of having to: > > > > > > > > > > > > > > > > > > > > if (tx_prep()) // iterate and update mbufs as needed > > > > > > > > > > tx_burst(); // iterate and send > > > > > > > > > > > > > > > > > > > > Compared to: > > > > > > > > > > > > > > > > > > > > tx_burst(); // iterate, update as needed and send > > > > > > > > > > > > > > > > > > I think that was discussed extensively quite a lot previously here: > > > > > > > > > As Thomas already replied - main motivation is to allow user > > > > > > > > > to execute them on different stages of packet TX pipeline, > > > > > > > > > and probably on different cores. > > > > > > > > > I think that provides better flexibility to the user to when/where > > > > > > > > > do these preparations and hopefully would lead to better performance. > > > > > > > > > > > > > > > > And I agree, I think this use case is valid but does not warrant such a high > > > > > > > > penalty when your application does not need that much flexibility. Simple > > > > > > > > (yet conscious) applications need the highest performance. Complex ones as > > > > > > > > you described already suffer quite a bit from IPCs and won't mind a couple > > > > > > > > of extra CPU cycles right? > > > > > > > > > > > > > > It would mean an extra cache-miss for every packet, so I think performance hit > > > > > > > would be quite significant. > > > > > > > > > > > > A performance hit has to occur somewhere regardless, because something has > > > > > > to be done in order to send packets that need it. Whether this cost is in > > > > > > application code or in a PMD function, it remains part of TX. > > > > > > > > > > Depending on the place the final cost would differ quite a lot. > > > > > If you call tx_prepare() somewhere close to the place where you fill the packet header > > > > > contents, then most likely the data that tx_prepare() has to access will be already in the cache. > > > > > So the performance penalty will be minimal. > > > > > If you'll try to access the same data later (at tx_burst), then the possibility that it would still > > > > > be in cache is much less. > > > > > If you calling tx_burst() from other core then data would for sure be out of cache, > > > > > and even worse can still be in another core cache. > > > > > > > > Well sure, that's why I also think tx_prepare() has its uses, only that > > > > since tx_prepare() is optional, tx_burst() should provide the same > > > > functionality when tx_prepare() is not called. > > > > > > As I understand, to implement what you are proposing (TX_PREPARED mbuf->ol_flag) > > > it will be required: > > > > > > a) Modify all existing applications that do similar to tx_prepare() stuff on their own, > > > otherwise they'll would hit performance penalty. > > > b) Modify at least all Intel PMDs and might be some others too (vmxnet3?). > > > > > > Step b) probably wouldn't cause any significant performance impact straightway, > > > but it's for sure wouldn't make things faster, and would increase tx_burst() code > > > complexity quite a lot. > > > From other side, I can't see any real benefit that we will have in return. > > > So I still opposed to that idea. > > > > Applications gain the ability to perform tx_burst() with offloads without > > having to prepare anything. > > What means 'without preparing anything'? > This is just not possible I think. > One way or another application would have to to decide > what exactly it likes to TX and what HW offloads it likes to use for it. > So at least, it still needs to fill relevant mbuf fields: > pkt_len, data_len, nb_segs, ol_flags, tx_offload, etc. > > >Currently these applications either cannot use > > offloads at all or need to perform PMD-specific voodoo first. > > That's why tx_prepare() is introduced. Yes, and the fact it appeared made me wonder why tx_burst() could not implement its functionality as well. Why can't tx_burst() call rte_net_intel_cksum_prepare() directly? I mean, unless offloads are requested, this additional code should not impact performance. > > The generic > > alternative to this scenario being tx_prepare(), PMDs have to make this step > > as cheap as possible. > > > > Yes that would slow down existing applications, people may find it > > acceptable since we're modifying the TX API here. > > > > > > > > > About the 'simple' case when tx_prep() and tx_burst() are called on the same core, > > > > > > > Why do you believe that: > > > > > > > tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(), ...}? > > > > > > > > > > > > I mean instead of two function calls with their own loops: > > > > > > > > > > > > tx_prepare() { foreach (pkt) { check(); extra_check(); ... } } > > > > > > > > > > > > tx_burst() { foreach (pkt) { check(); stuff(); ... } } > > > > > > > > > > > > You end up with one: > > > > > > > > > > > > tx_burst() { foreach (pkt) { check(); extra_check(); stuff(); ... } } > > > > > > > > > > > > Which usually is more efficient. > > > > > > > > > > I really doubt that. > > > > > If it would be that, what is the point to process packet in bulks? > > > > > Usually dividing processing into different stages and at each stage processing > > > > > multiple packet at once helps to improve performance. > > > > > At least for IA. > > > > > Look for example how we had to change l3fwd to improve its performance. > > > > > > > > Depends quite a bit on usage pattern. It is less efficient for applications > > > > that do not modify mbuf contents because of the additional function call and > > > > inner loop. > > > > > > If the application doesn't modify mbuf contents that it can simply skip calling tx_prepare(). > > > > What if that same application wants to enable some offload as well? > > Hmm, wasn't that your use-case when no offloads (modifications) are required just 3 lines above? Not exactly, this is a case where an application does not touch mbuf contents yet wants to perform some HW offload on them. I'll concede this scenario is likely rare enough for existing offloads. What is likely not, though, is when an application discovers it can request a particular offload right before calling tx_burst(). > > > > Note that I'm only pushing for the ability to conveniently address both > > > > cases with maximum performance. > > > > > > > > > > > tx_prep() itself is quite expensive, let say for Intel HW it includes: > > > > > > > - read mbuf fileds (2 cache-lines), > > > > > > > - read packet header (1/2 cache-lines) > > > > > > > - calculate pseudo-header csum > > > > > > > - update packet header > > > > > > > Comparing to that price of extra function call seems neglectable > > > > > > > (if we TX packets in bursts of course). > > > > > > > > > > > > We agree its performance is a critical issue then, sharing half the read > > > > > > steps with tx_burst() would make sense to me. > > > > > > > > > > I didn't understand that sentence. > > > > > > > > I meant this step can be shared (in addition to loop etc): > > > > > > > > - read mbuf fileds (2 cache-lines), > > > > > > Ah ok, you still believe that mixing tx_burst and tx_prepare code together > > > would give us noticeable performance benefit. > > > As I said above, I don't think it would, but you are welcome to try and > > > prove me wrong. > > > > Depends on what you call noticeable. I guess we can at least agree having > > two separate functions and loops cause more instructions to be generated and > > executed. > > > > Now for the number of spent CPU cycles, depends of course whether mbufs are > > still hot into the cache or not, and as I told you in my opinion we'll > > usually see applications calling tx_prepare() just before tx_burst() to > > benefit from offloads. > > Honestly, Adrien we are going in cycles here. > Just to be clear: > Current patch introduces tx_prepare() without affecting in any way: > 1) existing applications > 2) exiting PMD code (tx_burst) > Meanwhile I still believe it is useful and provide a big step forward in terms > of generalizing usage of HW TX offloads. Yes and agreed. I do not oppose this new API on the basis that it's better than what was available before. > What you propose requires modifications for both existing applications and existing PMD code > (full-featured tx_burst() for at least all Intel PMDs and vmxnet3 has to be significantly modified). > You believe that with these modifications that new tx_burst() implementation > would be noticeably faster than just current: tx_prepare(); tx_burst(); > I personally doubt that it really would (at least on modern IA). > But as I said, you are more than welcome to prove me wrong here. > Let say, provide a patch for ixgbe (or i40e) full-featured tx_burst() implementation, > so that it would combine both tx_prepare() and tx_burst() functionalities into one function. > Then we can run some performance tests with current and yours patches and compare results. > Without that, I don't see any point to discuss your proposition any further. > I just won't agree for such big change in existing PMDs without some solid justification beyond it. I worry that proving you wrong would make you even stronger :), you raise good points and I'm also a bit tired of this discussion. Without performance numbers, my concerns are baseless. Since non-Intel/vmxnet3 PMDs do not need tx_prepare() (yet), standardizing on the case where it might be needed instead of assuming tx_burst() can do all that work remains not better from a usability standpoint. It is done because moving that stuff inside tx_burst() would be expensive even when offloads are not requested, right? > > > > > > > > Yes they will, therefore we need a method that satisfies both cases. > > > > > > > > > > > > > > > > As a possible solution, a special mbuf flag could be added to each mbuf > > > > > > > > having gone through tx_prepare(). That way, tx_burst() could skip some > > > > > > > > checks and things it would otherwise have done. > > > > > > > > > > > > > > That's an interesting idea, but it has one drawback: > > > > > > > As I understand, it means that from now on if user doing preparations on his own, > > > > > > > he had to setup this flag, otherwise tx_burst() would do extra unnecessary work. > > > > > > > So any existing applications that using TX offloads and do preparation by themselves > > > > > > > would have to be modified to avoid performance loss. > > > > > > > > > > > > In my opinion, users should not do preparation on their own. > > > > > > > > > > People already do it now. > > > > > > > > But we do not want them to anymore thanks to this new API, for reasons > > > > described in the motivation section of the cover letter, right? > > > > > > We probably wouldn't recommend that, but if people would like to use their own stuff, > > > or shortcuts - I don't want to stop them here. > > > > > > > > > > > > > If we provide a > > > > > > generic method, it has to be fast enough to replace theirs. Perhaps not as > > > > > > fast since it would work with all PMDs (usual trade-off), but acceptably so. > > > > > > > > > > > > > > Another possibility, telling the PMD first that you always intend to use > > > > > > > > tx_prepare() and getting a simpler/faster tx_burst() callback as a result. > > > > > > > > > > > > > > That what we have right now (at least for Intel HW): > > > > > > > it is a user responsibility to do the necessary preparations/checks before calling tx_burst(). > > > > > > > With tx_prepare() we just remove from user the headache to implement tx_prepare() on his own. > > > > > > > Now he can use a 'proper' PMD provided function. > > > > > > > > > > > > > > My vote still would be for that model. > > > > > > > > > > > > OK, then in a nutshell: > > > > > > > > > > > > 1. Users are not expected to perform preparation/checks themselves anymore, > > > > > > if they do, it's their problem. > > > > > > > > > > I think we need to be backward compatible here. > > > > > If the existing app doing what tx_prepare() supposed to do, it should keep working. > > > > > > > > It should work, only if they keep doing it as well as call tx_burst() > > > > directly, they will likely get lower performance. > > > > > > > > > > 2. If configured through an API to be defined, tx_burst() can be split in > > > > > > two and applications must call tx_prepare() at some point before > > > > > > tx_burst(). > > > > > > > > > > > > 3. Otherwise tx_burst() should perform the necessary preparation and checks > > > > > > on its own by default (when tx_prepare() is not expected). > > > > > > > > > > As I said before, I don't think it should be mandatory for tx_burst() to do what tx_prepare() does. > > > > > If some particular implementation of tx_burst() prefers to do things that way - that's fine. > > > > > But it shouldn't be required to. > > > > > > > > You're right, however applications might find it convenient. I think most > > > > will end up with something like the following: > > > > > > > > if (tx_prepare(pkts)) > > > > tx_burst(pkts)); > > > > > > Looking at existing DPDK apps - most of them do use some sort of TX bufferization. > > > So, even in a simplistic app it would probably be: > > > > > > tx_prepare(pkts); > > > tx_buffer(pkts); > > > > We're down to my word against yours here I guess, to leave the choice to > > application developers, we'd need to provide tx_prepare() and a simpler > > tx_burst() as well as the ability to call tx_burst() directly and still get > > offloads. > > From what I've seen, most DPDK libs/apps do buffer data packets for TX in one or another way: > mtcp, warp17, seastar. I was arguing most applications wouldn't call tx_prepare() far from tx_burst() and that merging them would therefore make sense, which seems to be the case in these examples. Obviously they're doing bufferization. - warp17 would call tx_prepare() inside pkt_flush_tx_q() right before tx_burst(), it wouldn't make sense elsewhere. - mtcp has an interesting loop on tx_burst(), however again, tx_prepare() would be performed inside dpdk_send_pkts(), not where mbufs are populated. - seastar seems to expose the ability to send bursts as well, dpdk_qp::_send() would also call tx_prepare() before tx_burst(). > Not to mention sample apps. > But ok, as said above, if you can prove that tx_burst() you are proposing is really much faster then > tx_prepare(); tx_burst(); > I'll be happy to reconsider. Nowhere I wrote "much" faster, I was just pointing out it would be more efficient if the work was done in a single function, at least for the use cases found in several examples you've provided. Now if you tell me it cannot be significant, fine, but still. > > > > > > 4. We probably still need some mbuf flag to mark mbufs that cannot be > > > > > > modified, the refcount could also serve as a hint. > > > > > > > > > > If mbuf can't be modified, you probably just wouldn't call the function that supposed to do that, > > > > > tx_prepare() in that case. > > > > > > > > I think it would be easier to document what offload flags may cause the > > > > tx_burst() function to modify mbuf contents, so applications have the > > > > ability to set or strip these flags on a mbuf basis. That way there is no > > > > need to call tx_prepare() without knowing exactly what it's going to do. > > > > > > Not sure I understand what exactly do you propose in the last paragraph? > > > > That for each TX offload flag, we document whether preparation might cause a > > mbuf to be written to during the tx_prepare()/tx_burst() phase. One of the > > reasons for tx_prepare() being: > > > > 4) Fields in packet may require different initialization (like e.g. will > > require pseudo-header checksum precalculation, sometimes in a > > different way depending on packet type, and so on). Now application > > needs to care about it. > > > > If we determine what offloads may cause mbuf contents to change (all of them > > perhaps?), then applications can easily strip those flags from outgoing > > "const" mbufs. Then it becomes acceptable for tx_burst() to modify mbuf > > contents as per user request, which removes one reason to rely on > > tx_prepare() for these. > > Hmm, I didn't get you here. > If let say I don't need TX TCP cksum offload, why would I set this flag inside mbuf at first place? > If I do expect PMD to do TX TCP cksum offload for me, then I have to set that flag, > otherwise how PMD would know that I did require that offload? Let me take an example, if you assume a cloned mbuf must be sent, pointed data is not owned by the TX function, and software preparation for TX offloads might cause side effects elsewhere, I'm not talking about the changes performed by HW to the outgoing frame here. One reason for tx_prepare() according to 4) is to make this clear; by calling tx_prepare(), applications fully acknowledge and take responsibility for possible side-effects. Hence my suggestion, we could make this acceptable for tx_burst() by documenting that requesting offloads may cause it to modify mbuf contents as well. Before you reply, note that I've neither acked nor nacked that series, and intend to leave this to other people. So far it's received several acks already so I guess it's probably ready for inclusion. I'm not comfortable with it as an application developer because of the extra API call that I think could be included in tx_burst(), but as a PMD maintainer I do not mind. Might even prove useful someday if some corner-case offload too expensive to fully handle in tx_burst() had to be supported. An optional offload if you will. -- Adrien Mazarguil 6WIND