From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tomaszx.kulasek@intel.com>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id 1FA712C60
 for <dev@dpdk.org>; Fri,  9 Dec 2016 18:19:34 +0100 (CET)
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
 by fmsmga102.fm.intel.com with ESMTP; 09 Dec 2016 09:19:20 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,324,1477983600"; d="scan'208";a="1079788608"
Received: from irsmsx110.ger.corp.intel.com ([163.33.3.25])
 by fmsmga001.fm.intel.com with ESMTP; 09 Dec 2016 09:19:19 -0800
Received: from irsmsx102.ger.corp.intel.com ([169.254.2.79]) by
 irsmsx110.ger.corp.intel.com ([163.33.3.25]) with mapi id 14.03.0248.002;
 Fri, 9 Dec 2016 17:19:18 +0000
From: "Kulasek, TomaszX" <tomaszx.kulasek@intel.com>
To: Olivier Matz <olivier.matz@6wind.com>, "Ananyev, Konstantin"
 <konstantin.ananyev@intel.com>
CC: Thomas Monjalon <thomas.monjalon@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
Thread-Index: AQHSRbBkXBK52xh5bEGv2Yak9bo9p6DzVKAAgACQoACAAHpagIAAhEeAgAmAjICAAWYaMA==
Date: Fri, 9 Dec 2016 17:19:18 +0000
Message-ID: <3042915272161B4EB253DA4D77EB373A14F5A409@IRSMSX102.ger.corp.intel.com>
References: <1477486575-25148-1-git-send-email-tomaszx.kulasek@intel.com>
 <1479922585-8640-1-git-send-email-tomaszx.kulasek@intel.com>
 <1479922585-8640-2-git-send-email-tomaszx.kulasek@intel.com>
 <7834627.cBDVu3uoNi@xps13>
 <2601191342CEEE43887BDE71AB9772583F0E2AB0@irsmsx105.ger.corp.intel.com>
 <20161202092425.13752e2f@platinum>
 <2601191342CEEE43887BDE71AB9772583F0E3405@irsmsx105.ger.corp.intel.com>
 <20161208182417.32ec3933@platinum>
In-Reply-To: <20161208182417.32ec3933@platinum>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.180]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Dec 2016 17:19:35 -0000

Hi Oliver,

My 5 cents below:

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, December 8, 2016 18:24
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> <tomaszx.kulasek@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
>=20
> Hi Konstantin,
>=20
> On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin"
> <konstantin.ananyev@intel.com> wrote:
> > Hi Olivier,
> >
> > > -----Original Message-----
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Friday, December 2, 2016 8:24 AM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; Kulasek, TomaszX
> > > <tomaszx.kulasek@intel.com>; dev@dpdk.org Subject: Re: [dpdk-dev]
> > > [PATCH v12 1/6] ethdev: add Tx preparation
> > >
> > > Hi Konstantin,
> > >
> > > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> > > <konstantin.ananyev@intel.com> wrote:
> > > > >
> > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > +/**
> > > > > > + * Process a burst of output packets on a transmit queue of
> > > > > > an Ethernet device.
> > > > > > + *
> > > > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > > > output packets to be
> > > > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > > > device designated
> > > > > > + * by its *port_id*.
> > > > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > > > prepared which are
> > > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > > > each of them
> > > > > > + * allocated from a pool created with
> > > > > > rte_pktmbuf_pool_create().
> > > > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > > > performs
> > > > > > + * the following operations:
> > > > > > + *
> > > > > > + * - Check if packet meets devices requirements for tx
> > > > > > offloads.
> > > > > > + *
> > > > > > + * - Check limitations about number of segments.
> > > > > > + *
> > > > > > + * - Check additional requirements when debug is enabled.
> > > > > > + *
> > > > > > + * - Update and/or reset required checksums when tx offload
> > > > > > is set for packet.
> > > > > > + *
> > > > > > + * Since this function can modify packet data, provided mbufs
> > > > > > must be safely
> > > > > > + * writable (e.g. modified data cannot be in shared
> > > > > > segment).
> > > > >
> > > > > I think we will have to remove this limitation in next releases.
> > > > > As we don't know how it could affect the API, I suggest to
> > > > > declare this API EXPERIMENTAL.
> > > >
> > > > While I don't really mind to mart it as experimental, I don't
> > > > really understand the reasoning: Why " this function can modify
> > > > packet data, provided mbufs must be safely writable" suddenly
> > > > becomes a problem? That seems like and obvious limitation to me
> > > > and let say tx_burst() has the same one. Second, I don't see how
> > > > you are going to remove it without introducing a heavy performance
> > > > impact. Konstantin
> > >
> > > About tx_burst(), I don't think we should force the user to provide
> > > a writable mbuf. There are many use cases where passing a clone
> > > already works as of today and it avoids duplicating the mbuf data.
> > > For instance: traffic generator, multicast, bridging/tap, etc...
> > >
> > > Moreover, this requirement would be inconsistent with the model you
> > > are proposing in case of pipeline:
> > >  - tx_prepare() on core X, may update the data
> > >  - tx_burst() on core Y, should not touch the data to avoid cache
> > > misses
> >
> > Probably I wasn't very clear in my previous mail.
> > I am not saying that we should force the user to pass a writable mbuf.
> > What I am saying that for tx_burst() current expectation is that after
> > mbuf is handled to tx_burst() user shouldn't try to modify its buffer
> > contents till TX engine is done with the buffer (mbuf_free() is called
> > by TX func for it). For tx_prep(), I think, it is the same though
> > restrictions are a bit more strict: user should not try to read/write
> > to the mbuf while tx_prep() is not finished with it. What puzzles me
> > is that why that should be the reason to mark tx_prep() as
> > experimental. Konstantin
>=20
> To be sure we're on the same page, let me reword:
>=20
> - mbufs passed to tx_prepare() by the application must have their
>   headers (l2_len + l3_len + l4_len) writable because the phdr checksum
>   can be replaced. It could be precised in the api comment.
>=20
> - mbufs passed to tx_burst() must not be modified by the driver/hw, nor
>   by the application.
>=20
>=20
> About the API itself, I have one more question. I know you've already
> discussed this a bit with Adrien, I don't want to spawn a new big thread
> from here ;)
>=20
> The API provides tx_prepare() to check the packets have the proper format=
,
> and possibly modify them (ex: csum offload / tso) to match hw
> requirements. So it does both checks (number of segments) and fixes
> (csum/tso). What determines things that should be checked and things that
> should be fixed?
>=20

1) Performance -- we may assume that packets are created with the common ru=
les (e.g. doesn't tries to count IP checksum for IPv6 packet, sets all requ=
ired fields, etc.). For now, additional checks are done only in DEBUG mode.

2) Uncommon requirements, such a number of segments for TSO/non-TSO, where =
invalid values can cause unexpected results (or even hang device on burst),=
 so it is critical (at least for ixgbe and i40e) and must be checked.

3) Checksum field must be initialized in a hardware specific way to count i=
t properly.

4) If packet is invalid its content isn't modified nor restored.

> The application gets few information from tx_prepare() about what should
> be done to make the packet accepted by the hw, and the actions will
> probably be different depending on hardware. So could we imagine that in
> the future the function also tries to fix the packet? I've seen your
> comment saying that it has to be an application decision, so what about
> having a parameter saying "fix the packet" or "don't fix it"?
>=20

The main question here is how invasive in packet data tx_prepare should be.

If I understand you correctly, you're talking about the situation when tx_p=
repare will try to restore packet internally, e.g. linearize data for i40e =
to meet number of segments requirements, split mbufs, when are too large, m=
aybe someone will wanted to have full software checksum computation fallbac=
k for devices which doesn't support it, and so on? And then this parameter =
will indicate "do the required work for me", or "just check, initialize wha=
t should be initialized, if fails, let me decide"?

Implemented model is quite simple and clear:

1) Validate uncommon requirements which may cause a problem and let applica=
tion to deal with it.
2) If packet is valid, initialize required fields according to the hardware=
 requirements.

In fact, it doesn't fix anything for now, but initializes checksums, and th=
e main reason why it is done here, is the fact that it cannot be done in a =
trivial way in the application itself.

IMHO, we should to keep this API as simple as it possible and not let it gr=
ow in unexpected way (also for performance reason).

> About rte_eth_desc_lim->nb_seg_max and
> rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially fo=
r
> the 2nd one, because I don't see how it can be used by the application.
> Well, it does not hurt to have them, but for me it looks a bit useless.
>=20

* "nb_seg_max": Maximum number of segments per whole packet.
* "nb_mtu_seg_max": Maximum number of segments per one MTU.

Application can use provided values in a fallowed way:

* For non-TSO packet, a single transmit packet may span up to "nb_mtu_seg_m=
ax" buffers.
* For TSO packet the total number of data descriptors is "nb_seg_max", and =
each segment within the TSO may span up to "nb_mtu_seg_max".

> Last thing, I think this API should become the default in the future.
> For instance, it would prevent the application to calculate a phdr csum
> that will not be used by the hw. Not calling tx_prepare() would require
> the user/application to exactly knows the underlying hw and the kind of
> packet that are generated. So for me it means we'll need to also update
> other examples (other testpmd engines, l2fwd, ...). Do you agree?
>=20

Most of sample applications doesn't use TX offloads at all and doesn't coun=
t checksums. So the question is "should tx_prepare be used even if not requ=
ired?"

>=20
> Regards,
> Olivier
>=20

Tomasz