From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 5B67F58CB for ; Fri, 2 Dec 2016 02:01:01 +0100 (CET) Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga105.fm.intel.com with ESMTP; 01 Dec 2016 17:01:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,284,1477983600"; d="scan'208";a="198101324" Received: from irsmsx110.ger.corp.intel.com ([163.33.3.25]) by fmsmga004.fm.intel.com with ESMTP; 01 Dec 2016 17:00:57 -0800 Received: from irsmsx105.ger.corp.intel.com ([169.254.7.43]) by irsmsx110.ger.corp.intel.com ([163.33.3.25]) with mapi id 14.03.0248.002; Fri, 2 Dec 2016 01:00:56 +0000 From: "Ananyev, Konstantin" To: Adrien Mazarguil CC: Thomas Monjalon , "dev@dpdk.org" , Rahul Lakkireddy , Stephen Hurd , Jan Medala , Jakub Palider , John Daley , Alejandro Lucero , Harish Patil , Rasesh Mody , Jerin Jacob , Yuanhan Liu , Yong Wang , "Kulasek, TomaszX" , "olivier.matz@6wind.com" Thread-Topic: [dpdk-dev] [PATCH v12 0/6] add Tx preparation Thread-Index: AQHSRbBDKacvBhx2eUCvrHI7LRX1L6DuQp0AgALr7oCAACwIoIABX2MAgAEdxrA= Date: Fri, 2 Dec 2016 01:00:55 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772583F0E2A9A@irsmsx105.ger.corp.intel.com> References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com> <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com> <8317180.L80Qf11uiu@xps13> <20161130074003.GD10340@6wind.com> <2601191342CEEE43887BDE71AB9772583F0E20C8@irsmsx105.ger.corp.intel.com> <20161201071518.GG10340@6wind.com> In-Reply-To: <20161201071518.GG10340@6wind.com> Accept-Language: en-IE, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Dec 2016 01:01:02 -0000 Hi Adrien, >=20 > Hi Konstantin, >=20 > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote: > [...] > > > Something is definitely needed here, and only PMDs can provide it. I = think > > > applications should not have to clear checksum fields or initialize t= hem to > > > some magic value, same goes for any other offload or hardware limitat= ion > > > that needs to be worked around. > > > > > > tx_prep() is one possible answer to this issue, however as mentioned = in the > > > original patch it can be very expensive if exposed by the PMD. > > > > > > Another issue I'm more concerned about is the way limitations are man= aged > > > (struct rte_eth_desc_lim). While not officially tied to tx_prep(), th= is > > > structure contains new fields that are only relevant to a few devices= , and I > > > fear it will keep growing with each new hardware quirk to manage, bre= aking > > > ABIs in the process. > > > > Well, if some new HW capability/limitation would arise and we'd like to= support > > it in DPDK, then yes we probably would need to think how to incorporate= it here. > > Do you have anything particular in mind here? >=20 > Nothing in particular, so for the sake of the argument, let's suppose tha= t I > would like to add a field to expose some limitation that only applies to = my > PMD during TX but looks generic enough to make sense, e.g. maximum packet > size when VLAN tagging is requested. Hmm, I didn't hear about such limitations so far, but if it is real case - sure, feel free to submit the patch. =20 > PMDs are free to set that field to some > special value (say, 0) if they do not care. >=20 > Since that field exists however, conscious applications should check its > value for each packet that needs to be transmitted. This extra code cause= s a > slowdown just by sitting in the data path. Since it is not the only field= in > that structure, the performance impact can be significant. >=20 > Even though this code is inside applications, it remains unfair to PMDs f= or > which these tests are irrelevant. This problem is identified and addresse= d > by tx_prepare(). I suppose the question is why do we need: uint16_t nb_seg_max; uint16_t nb_mtu_seg_max; as we now have tx_prepare(), right? For two reasons: 1. Some people might feel that tx_prepare() is not good (smart/fast) enough for them and would prefer to do necessary preparations for TX offloads them= selves. 2. Even if people do use tx_prepare() they still should take this informati= on into accout. As an example ixgbe can't TX packets with then 40 segments. Obviously ixbge_tx_prep() performs that check and returns an error. But it wouldn't try to merge/reallocate mbufs for you. User still has to do it himself, or just prevent creating such long chains = somehow. >=20 > Thanks to tx_prepare(), these checks are moved back into PMDs where they > belong. PMDs that do not need them do not have to provide support for > tx_prepare() and do not suffer any performance impact as result; > applications only have to make sure tx_prepare() is always called at some > point before tx_burst(). >=20 > Once you reach this stage, you've effectively made tx_prepare() mandatory > before tx_burst(). If some bug occurs, then perhaps you forgot to call > tx_prepare(), you just need to add it. The total cost for doing TX is > therefore tx_prepare() + tx_burst(). >=20 > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() w= ill > remain optional for long. Sure, PMDs that do not implement it do not care= , > I'm focusing on applications, for which the performance impact of calling > tx_prepare() followed by tx_burst() is higher than a single tx_burst() > performing all the necessary preparation at once. >=20 > [...] > > > Following the same logic, why can't such a thing be made part of the = TX > > > burst function as well (through a direct call to rte_phdr_cksum_fix() > > > whenever necessary). From an application standpoint, what are the adv= antages > > > of having to: > > > > > > if (tx_prep()) // iterate and update mbufs as needed > > > tx_burst(); // iterate and send > > > > > > Compared to: > > > > > > tx_burst(); // iterate, update as needed and send > > > > I think that was discussed extensively quite a lot previously here: > > As Thomas already replied - main motivation is to allow user > > to execute them on different stages of packet TX pipeline, > > and probably on different cores. > > I think that provides better flexibility to the user to when/where > > do these preparations and hopefully would lead to better performance. >=20 > And I agree, I think this use case is valid but does not warrant such a h= igh > penalty when your application does not need that much flexibility. Simple > (yet conscious) applications need the highest performance. Complex ones a= s > you described already suffer quite a bit from IPCs and won't mind a coupl= e > of extra CPU cycles right? It would mean an extra cache-miss for every packet, so I think performance = hit would be quite significant.=20 About the 'simple' case when tx_prep() and tx_burst() are called on the sam= e core, Why do you believe that: tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(), ...= }? tx_prep() itself is quite expensive, let say for Intel HW it includes: - read mbuf fileds (2 cache-lines), - read packet header (1/2 cache-lines) - calculate pseudo-header csum - update packet header=20 Comparing to that price of extra function call seems neglectable (if we TX packets in bursts of course).=20 >=20 > Yes they will, therefore we need a method that satisfies both cases. >=20 > As a possible solution, a special mbuf flag could be added to each mbuf > having gone through tx_prepare(). That way, tx_burst() could skip some > checks and things it would otherwise have done. That's an interesting idea, but it has one drawback: As I understand, it means that from now on if user doing preparations on hi= s own, he had to setup this flag, otherwise tx_burst() would do extra unnecessary = work. So any existing applications that using TX offloads and do preparation by t= hemselves would have to be modified to avoid performance loss. >=20 > Another possibility, telling the PMD first that you always intend to use > tx_prepare() and getting a simpler/faster tx_burst() callback as a result= . That what we have right now (at least for Intel HW): =20 it is a user responsibility to do the necessary preparations/checks before = calling tx_burst(). =20 With tx_prepare() we just remove from user the headache to implement tx_pre= pare() on his own. Now he can use a 'proper' PMD provided function. My vote still would be for that model. Konstantin