From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id C5B5F4CC5
 for <dev@dpdk.org>; Mon,  5 Dec 2016 17:46:13 +0100 (CET)
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by orsmga104.jf.intel.com with ESMTP; 05 Dec 2016 08:46:12 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,305,1477983600"; d="scan'208";a="37253525"
Received: from irsmsx101.ger.corp.intel.com ([163.33.3.153])
 by orsmga004.jf.intel.com with ESMTP; 05 Dec 2016 08:46:10 -0800
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.43]) by
 IRSMSX101.ger.corp.intel.com ([163.33.3.153]) with mapi id 14.03.0248.002;
 Mon, 5 Dec 2016 16:43:52 +0000
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>, "dev@dpdk.org"
 <dev@dpdk.org>, Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>, Stephen Hurd
 <stephen.hurd@broadcom.com>, Jan Medala <jan@semihalf.com>, Jakub Palider
 <jpa@semihalf.com>, John Daley <johndale@cisco.com>, Alejandro Lucero
 <alejandro.lucero@netronome.com>, Harish Patil <harish.patil@qlogic.com>,
 Rasesh Mody <rasesh.mody@qlogic.com>, Jerin Jacob
 <jerin.jacob@caviumnetworks.com>, Yuanhan Liu <yuanhan.liu@linux.intel.com>,
 Yong Wang <yongwang@vmware.com>, "Kulasek, TomaszX"
 <tomaszx.kulasek@intel.com>, "olivier.matz@6wind.com"
 <olivier.matz@6wind.com>
Thread-Topic: [dpdk-dev] [PATCH v12 0/6] add Tx preparation
Thread-Index: AQHSRbBDKacvBhx2eUCvrHI7LRX1L6DuQp0AgALr7oCAACwIoIABX2MAgAEdxrCABa5agIAAE4Sw
Date: Mon, 5 Dec 2016 16:43:52 +0000
Message-ID: <2601191342CEEE43887BDE71AB9772583F0E43B5@irsmsx105.ger.corp.intel.com>
References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com>
 <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com>
 <8317180.L80Qf11uiu@xps13> <20161130074003.GD10340@6wind.com>
 <2601191342CEEE43887BDE71AB9772583F0E20C8@irsmsx105.ger.corp.intel.com>
 <20161201071518.GG10340@6wind.com>
 <2601191342CEEE43887BDE71AB9772583F0E2A9A@irsmsx105.ger.corp.intel.com>
 <20161205150327.GP10340@6wind.com>
In-Reply-To: <20161205150327.GP10340@6wind.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Dec 2016 16:46:14 -0000

Hi Adrien,

>=20
> Hi Konstantin,
>=20
> On Fri, Dec 02, 2016 at 01:00:55AM +0000, Ananyev, Konstantin wrote:
> [...]
> > > On Wed, Nov 30, 2016 at 10:54:50AM +0000, Ananyev, Konstantin wrote:
> > > [...]
> > > > > Something is definitely needed here, and only PMDs can provide it=
. I think
> > > > > applications should not have to clear checksum fields or initiali=
ze them to
> > > > > some magic value, same goes for any other offload or hardware lim=
itation
> > > > > that needs to be worked around.
> > > > >
> > > > > tx_prep() is one possible answer to this issue, however as mentio=
ned in the
> > > > > original patch it can be very expensive if exposed by the PMD.
> > > > >
> > > > > Another issue I'm more concerned about is the way limitations are=
 managed
> > > > > (struct rte_eth_desc_lim). While not officially tied to tx_prep()=
, this
> > > > > structure contains new fields that are only relevant to a few dev=
ices, and I
> > > > > fear it will keep growing with each new hardware quirk to manage,=
 breaking
> > > > > ABIs in the process.
> > > >
> > > > Well, if some new HW capability/limitation would arise and we'd lik=
e to support
> > > > it in DPDK, then yes we probably would need to think how to incorpo=
rate it here.
> > > > Do you have anything particular in mind here?
> > >
> > > Nothing in particular, so for the sake of the argument, let's suppose=
 that I
> > > would like to add a field to expose some limitation that only applies=
 to my
> > > PMD during TX but looks generic enough to make sense, e.g. maximum pa=
cket
> > > size when VLAN tagging is requested.
> >
> > Hmm, I didn't hear about such limitations so far, but if it is real cas=
e -
> > sure, feel free to submit the patch.
>=20
> I won't, that was hypothetical.

Then why we discussing it? :)

>=20
> > > PMDs are free to set that field to some
> > > special value (say, 0) if they do not care.
> > >
> > > Since that field exists however, conscious applications should check =
its
> > > value for each packet that needs to be transmitted. This extra code c=
auses a
> > > slowdown just by sitting in the data path. Since it is not the only f=
ield in
> > > that structure, the performance impact can be significant.
> > >
> > > Even though this code is inside applications, it remains unfair to PM=
Ds for
> > > which these tests are irrelevant. This problem is identified and addr=
essed
> > > by tx_prepare().
> >
> > I suppose the question is why do we need:
> > uint16_t nb_seg_max;
> > uint16_t nb_mtu_seg_max;
> > as we now have tx_prepare(), right?
> >
> > For two reasons:
> > 1. Some people might feel that tx_prepare() is not good (smart/fast) en=
ough
> > for them and would prefer to do necessary preparations for TX offloads =
themselves.
> >
> > 2. Even if people do use tx_prepare() they still should take this infor=
mation into accout.
> > As an example ixgbe can't TX packets with then 40 segments.
> > Obviously ixbge_tx_prep() performs that check and returns an error.
>=20
> Problem is that tx_prepare() also provides safeties which are not part of
> tx_burst(), such as not going over nb_mtu_seg_max. Because of this and th=
e
> fact struct rte_eth_desc_lim can grow new fields anytime, application
> developers will be tempted to just call tx_prepare() and focus on more
> useful things.

NP with that, that was an intention beyond introducing it.

>=20
> Put another way, from a user's point of view, tx_prepare() is an opaque
> function that greatly increases tx_burst()'s ability to send mbufs as
> requested, with extra error checking on top; applications not written to =
run
> on a specific PMD/device (all of them ideally) will thus call tx_prepare(=
)
> at some point.
>=20
> > But it wouldn't try to merge/reallocate mbufs for you.
> > User still has to do it himself, or just prevent creating such long cha=
ins somehow.
>=20
> Yes, that's another debate. PMDs could still implement a software fallbac=
k
> for unlikely slow events like these. The number of PMDs is not going to
> decrease, each device having its own set of weird limitations in specific
> cases, PMDs should do their best to process mbufs even if that means slow=
ly
> due to the lack of preparation.
>=20
> tx_prepare() has its uses but should really be optional, in the sense tha=
t
> if that function is not called, tx_burst() should deal with it somehow.

As I said before, I don't think it is a good idea to put everything in tx_b=
urst().
If PMD driver prefer things that way, yes tx_burst() can deal with each and
possible offload requirement itself, but it shouldn't be mandatory.=20

>=20
> > > Thanks to tx_prepare(), these checks are moved back into PMDs where t=
hey
> > > belong. PMDs that do not need them do not have to provide support for
> > > tx_prepare() and do not suffer any performance impact as result;
> > > applications only have to make sure tx_prepare() is always called at =
some
> > > point before tx_burst().
> > >
> > > Once you reach this stage, you've effectively made tx_prepare() manda=
tory
> > > before tx_burst(). If some bug occurs, then perhaps you forgot to cal=
l
> > > tx_prepare(), you just need to add it. The total cost for doing TX is
> > > therefore tx_prepare() + tx_burst().
> > >
> > > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare=
() will
> > > remain optional for long. Sure, PMDs that do not implement it do not =
care,
> > > I'm focusing on applications, for which the performance impact of cal=
ling
> > > tx_prepare() followed by tx_burst() is higher than a single tx_burst(=
)
> > > performing all the necessary preparation at once.
> > >
> > > [...]
> > > > > Following the same logic, why can't such a thing be made part of =
the TX
> > > > > burst function as well (through a direct call to rte_phdr_cksum_f=
ix()
> > > > > whenever necessary). From an application standpoint, what are the=
 advantages
> > > > > of having to:
> > > > >
> > > > >  if (tx_prep()) // iterate and update mbufs as needed
> > > > >      tx_burst(); // iterate and send
> > > > >
> > > > > Compared to:
> > > > >
> > > > >  tx_burst(); // iterate, update as needed and send
> > > >
> > > > I think that was discussed extensively quite a lot previously here:
> > > > As Thomas already replied - main motivation is to allow user
> > > > to execute them on different stages of packet TX pipeline,
> > > > and probably on different cores.
> > > > I think that provides better flexibility to the user to when/where
> > > > do these preparations and hopefully would lead to better performanc=
e.
> > >
> > > And I agree, I think this use case is valid but does not warrant such=
 a high
> > > penalty when your application does not need that much flexibility. Si=
mple
> > > (yet conscious) applications need the highest performance. Complex on=
es as
> > > you described already suffer quite a bit from IPCs and won't mind a c=
ouple
> > > of extra CPU cycles right?
> >
> > It would mean an extra cache-miss for every packet, so I think performa=
nce hit
> > would be quite significant.
>=20
> A performance hit has to occur somewhere regardless, because something ha=
s
> to be done in order to send packets that need it. Whether this cost is in
> application code or in a PMD function, it remains part of TX.

Depending on the place the final cost would differ quite a lot.
If you call tx_prepare() somewhere close to the place where you fill the pa=
cket header
contents, then most likely the data that tx_prepare() has to access will be=
 already in the cache.
So the performance penalty will be minimal.
If you'll try to access the same data later (at tx_burst), then the possibi=
lity that it would still
be in cache is much less.
If you calling tx_burst() from other core then data would for sure be out o=
f cache,
and  even worse can still be in another core cache.

>=20
> > About the 'simple' case when tx_prep() and tx_burst() are called on the=
 same core,
> > Why do you believe that:
> > tx_prep(); tx_burst(); would be much slower than tx_burst() {tx_prep(),=
 ...}?
>=20
> I mean instead of two function calls with their own loops:
>=20
>  tx_prepare() { foreach (pkt) { check(); extra_check(); ... } }
>=20
>  tx_burst() { foreach (pkt) { check(); stuff(); ... } }
>=20
> You end up with one:
>=20
>  tx_burst() { foreach (pkt) { check(); extra_check(); stuff(); ... } }
>=20
> Which usually is more efficient.

I really doubt that.
If it would be that, what is the point to process packet in bulks?
Usually dividing processing into different stages and at each stage process=
ing
multiple packet at once helps to improve performance.
At  least for IA.
Look for example how we had to change l3fwd to improve its performance.

>=20
> > tx_prep() itself is quite expensive, let say for Intel HW it includes:
> > - read mbuf fileds (2 cache-lines),
> > - read packet header (1/2 cache-lines)
> > - calculate pseudo-header csum
> >  - update packet header
> > Comparing to that price of extra function call seems neglectable
> > (if we TX packets in bursts of course).
>=20
> We agree its performance is a critical issue then, sharing half the read
> steps with tx_burst() would make sense to me.

I didn't understand that sentence.

>=20
> > > Yes they will, therefore we need a method that satisfies both cases.
> > >
> > > As a possible solution, a special mbuf flag could be added to each mb=
uf
> > > having gone through tx_prepare(). That way, tx_burst() could skip som=
e
> > > checks and things it would otherwise have done.
> >
> > That's an interesting idea, but it has one drawback:
> > As I understand, it means that from now on if user doing preparations o=
n his own,
> > he had to setup this flag, otherwise tx_burst() would do extra unnecess=
ary work.
> > So any existing applications that using TX offloads and do preparation =
by themselves
> > would have to be modified to avoid performance loss.
>=20
> In my opinion, users should not do preparation on their own.

People already do it now.

> If we provide a
> generic method, it has to be fast enough to replace theirs. Perhaps not a=
s
> fast since it would work with all PMDs (usual trade-off), but acceptably =
so.
>=20
> > > Another possibility, telling the PMD first that you always intend to =
use
> > > tx_prepare() and getting a simpler/faster tx_burst() callback as a re=
sult.
> >
> > That what we have right now (at least for Intel HW):
> > it is a user responsibility to do the necessary preparations/checks bef=
ore calling tx_burst().
> > With tx_prepare() we just remove from user the headache to implement tx=
_prepare() on his own.
> > Now he can use a 'proper' PMD provided function.
> >
> > My vote still would be for that model.
>=20
> OK, then in a nutshell:
>=20
> 1. Users are not expected to perform preparation/checks themselves anymor=
e,
>    if they do, it's their problem.

I think we need to be backward compatible here.
If the existing app doing what tx_prepare() supposed to do, it should keep =
working.
=20
>=20
> 2. If configured through an API to be defined, tx_burst() can be split in
>    two and applications must call tx_prepare() at some point before
>    tx_burst().
>=20
> 3. Otherwise tx_burst() should perform the necessary preparation and chec=
ks
>    on its own by default (when tx_prepare() is not expected).

As I said before, I don't think it should be mandatory for tx_burst() to do=
 what tx_prepare() does.
If some particular implementation of tx_burst() prefers to do things that w=
ay - that's fine.
But it shouldn't be required to.

>=20
> 4. We probably still need some mbuf flag to mark mbufs that cannot be
>    modified, the refcount could also serve as a hint.

If mbuf can't be modified, you probably just wouldn't call the function tha=
t supposed to do that,
tx_prepare() in that case. =20

Konstantin