From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tomaszx.kulasek@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id 9D6B86CD2
 for <dev@dpdk.org>; Wed, 19 Oct 2016 17:42:52 +0200 (CEST)
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
 by orsmga102.jf.intel.com with ESMTP; 19 Oct 2016 08:42:51 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.31,514,1473145200"; d="scan'208";a="1056156255"
Received: from irsmsx101.ger.corp.intel.com ([163.33.3.153])
 by fmsmga001.fm.intel.com with ESMTP; 19 Oct 2016 08:42:51 -0700
Received: from irsmsx102.ger.corp.intel.com ([169.254.2.198]) by
 IRSMSX101.ger.corp.intel.com ([163.33.3.153]) with mapi id 14.03.0248.002;
 Wed, 19 Oct 2016 16:42:48 +0100
From: "Kulasek, TomaszX" <tomaszx.kulasek@intel.com>
To: Olivier Matz <olivier.matz@6wind.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "thomas.monjalon@6wind.com" <thomas.monjalon@6wind.com>
Thread-Topic: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
Thread-Index: AQHSJiz6u7+zlRpGk0absAG+mtA0aKCuQq+AgAGcFqA=
Date: Wed, 19 Oct 2016 15:42:48 +0000
Message-ID: <3042915272161B4EB253DA4D77EB373A14F37118@IRSMSX102.ger.corp.intel.com>
References: <1476380222-9900-1-git-send-email-tomaszx.kulasek@intel.com>
 <1476457519-6840-1-git-send-email-tomaszx.kulasek@intel.com>
 <1476457519-6840-2-git-send-email-tomaszx.kulasek@intel.com>
 <0e011fea-ddd4-283b-61db-8a7a67b652e2@6wind.com>
In-Reply-To: <0e011fea-ddd4-283b-61db-8a7a67b652e2@6wind.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.181]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 19 Oct 2016 15:42:53 -0000

Hi Olivier,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, October 18, 2016 16:57
> To: Kulasek, TomaszX <tomaszx.kulasek@intel.com>; dev@dpdk.org
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> thomas.monjalon@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation
>=20
> Hi Tomasz,
>=20
> I think the principle of tx_prep() is good, it may for instance help to
> remove the function virtio_tso_fix_cksum() from the virtio, and maybe eve=
n
> change the mbuf TSO/cksum API.
>=20
> I have some questions/comments below, I'm sorry it comes very late.
>=20
> On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> > Added API for `rte_eth_tx_prep`
> >
> > uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> > 	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> >
> > Added fields to the `struct rte_eth_desc_lim`:
> >
> > 	uint16_t nb_seg_max;
> > 		/**< Max number of segments per whole packet. */
> >
> > 	uint16_t nb_mtu_seg_max;
> > 		/**< Max number of segments per one MTU */
>=20
> Not sure I understand the second one. Is this is case of TSO?
>=20
> Is it a usual limitation in different network hardware?
> Can this info be retrieved/used by the application?
>=20

Yes, limitation of number of segments may differ depend of TSO/non TSO, e.g=
. for Fortville NIC. This information is available for application

> >
> > Created `rte_pkt.h` header with common used functions:
> >
> > int rte_validate_tx_offload(struct rte_mbuf *m)
> > 	to validate general requirements for tx offload in packet such a
> > 	flag completness. In current implementation this function is called
> > 	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> >
> > int rte_phdr_cksum_fix(struct rte_mbuf *m)
> > 	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > 	before hardware tx checksum offload.
> > 	 - for non-TSO tcp/udp packets full pseudo-header checksum is
> > 	   counted and set.
> > 	 - for TSO the IP payload length is not included.
>=20
> Why not in rte_net.h?
>=20
>=20
> > [...]
> >
> > @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t
> queue_id,
> >  	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts,
> > nb_pkts);  }
> >
> > +/**
> > + * Process a burst of output packets on a transmit queue of an Etherne=
t
> device.
> > + *
> > + * The rte_eth_tx_prep() function is invoked to prepare output
> > +packets to be
> > + * transmitted on the output queue *queue_id* of the Ethernet device
> > +designated
> > + * by its *port_id*.
> > + * The *nb_pkts* parameter is the number of packets to be prepared
> > +which are
> > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of
> > +them
> > + * allocated from a pool created with rte_pktmbuf_pool_create().
> > + * For each packet to send, the rte_eth_tx_prep() function performs
> > + * the following operations:
> > + *
> > + * - Check if packet meets devices requirements for tx offloads.
>=20
> Do you mean hardware requirements?
> Can the application be aware of these requirements? I mean capability
> flags, or something in dev_infos?

Yes. If some offloads cannot be handled by hardware, it fails. Also if e.g.=
 number of segments is invalid and so on.

>=20
> Maybe the comment could be more precise?
>=20
> > + * - Check limitations about number of segments.
> > + *
> > + * - Check additional requirements when debug is enabled.
>=20
> What kind of additional requirements?
>=20

We may assume, that application setts right, e.g. ip header version is requ=
ired for most of checksum offloads. To not have additional performance over=
head, these checks are done only when debug is on.

> > + *
> > + * - Update and/or reset required checksums when tx offload is set for
> packet.
> > + *
>=20
> By reading this, I think it may not be clear for the user about what
> should be set in the mbuf. In mbuf API, it is said:
>=20
>  * TCP segmentation offload. To enable this offload feature for a
>  * packet to be transmitted on hardware supporting TSO:
>  *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>  *    PKT_TX_TCP_CKSUM)
>  *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
>  *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksu=
m
>  *    to 0 in the packet
>  *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segs=
z
>  *  - calculate the pseudo header checksum without taking ip_len in
> account,
>  *    and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
>  *    rte_ipv6_phdr_cksum() that can be used as helpers.
>=20
>=20
> If I understand well, using tx_prep(), the user will have to do the same
> except writing the IP checksum to 0, and without setting the TCP pseudo
> header checksum, right?
>=20

Right, but this header information is still valid for tx_burst operation do=
ne without preparation stage.

>=20
> > + * The rte_eth_tx_prep() function returns the number of packets ready
> > + to be
> > + * sent. A return value equal to *nb_pkts* means that all packets are
> > + valid and
> > + * ready to be sent.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue through which output packets must
> be
> > + *   sent.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_pkts
> > + *   The address of an array of *nb_pkts* pointers to *rte_mbuf*
> structures
> > + *   which contain the output packets.
> > + * @param nb_pkts
> > + *   The maximum number of packets to process.
> > + * @return
> > + *   The number of packets correct and ready to be sent. The return
> value can be
> > + *   less than the value of the *tx_pkts* parameter when some packet
> doesn't
> > + *   meet devices requirements with rte_errno set appropriately.
> > + */
>=20
> Can we add the constraint that invalid packets are left untouched?
>=20
> I think most of the time there will be a software fallback in that case,
> so it would be good to ensure that this function does not change the flag=
s
> or the packet data.

In current implementation, if packet is invalid, its data is never modified=
. Only checks are done. The only exception is when checksum needs to be upd=
ated or initialized, but it's done after packet validation.
If we want to use/restore packet in application it didn't should be changed=
 in any way for invalid packets.

>=20
> Another thing that could be interesting for the caller is to know the
> reason of the failure. Maybe the different errno types could be detailed
> here. For instance:
> - EINVAL: offload flags are not correctly set (i.e. would fail whatever
>   the hardware)
> - ENOTSUP: the offload feature is not supported by the hardware
> - ...
>=20

Ok.

> > +
> > +#ifdef RTE_ETHDEV_TX_PREP
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf
> **tx_pkts,
> > +		uint16_t nb_pkts)
> > +{
> > +	struct rte_eth_dev *dev =3D &rte_eth_devices[port_id];
> > +
> > +	if (!dev->tx_pkt_prep)
> > +		return nb_pkts;
> > +
> > +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> > +	if (queue_id >=3D dev->data->nb_tx_queues) {
> > +		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=3D%d\n", queue_id);
> > +		rte_errno =3D -EINVAL;
> > +		return 0;
> > +	}
> > +#endif
>=20
> Why checking the queue_id but not the port_id?
>=20
> Maybe the API comment should also be modified to say that the port_id has
> to be valid, because most ethdev functions do the check.
>=20

I can add this check when debug is on to made it more complete, and update =
an API comment for the default case.

> > +
> > +	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
> > +			tx_pkts, nb_pkts);
> > +}
> > +
> > +#else
> > +
> > +static inline uint16_t
> > +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t
> queue_id,
> > +		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts) {
> > +	return nb_pkts;
> > +}
> > +
> > +#endif
> > +
>=20
> nit: I wonder if the #else part should be inside the function instead
> (with the proper RTE_SET_USED()), it would avoid to define the prototype
> twice.
>=20
> >  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t
> count,
> >  		void *userdata);
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 109e666..cfd6284 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -276,6 +276,15 @@ extern "C" {
> >   */
> >  #define PKT_TX_OUTER_IPV4   (1ULL << 59)
> >
> > +#define PKT_TX_OFFLOAD_MASK (    \
> > +		PKT_TX_IP_CKSUM |        \
> > +		PKT_TX_L4_MASK |         \
> > +		PKT_TX_OUTER_IP_CKSUM |  \
> > +		PKT_TX_TCP_SEG |         \
> > +		PKT_TX_QINQ_PKT |        \
> > +		PKT_TX_VLAN_PKT |        \
> > +		PKT_TX_TUNNEL_MASK)
> > +
>=20
> Could you add an API comment?
>=20
> > --- /dev/null
> > +++ b/lib/librte_net/rte_pkt.h
> >
> > [...]
> > +/**
> > + * Validate general requirements for tx offload in packet.
> > + */
>=20
> The API comment does not have the usual format.
>=20
> > +static inline int
> > +rte_validate_tx_offload(struct rte_mbuf *m)
>=20
> should be const struct rte_mbuf *m
>=20
> > +{
> > +	uint64_t ol_flags =3D m->ol_flags;
> > +
> > +	/* Does packet set any of available offloads? */
> > +	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> > +		return 0;
> > +
> > +	/* IP checksum can be counted only for IPv4 packet */
> > +	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> > +		return -EINVAL;
> > +
> > +	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> > +		/* IP type not set */
> > +		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> > +			return -EINVAL;
> > +
> > +	if (ol_flags & PKT_TX_TCP_SEG)
> > +		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
> > +		if ((m->tso_segsz =3D=3D 0) ||
> > +				((ol_flags & PKT_TX_IPV4) && !(ol_flags &
> PKT_TX_IP_CKSUM)))
> > +			return -EINVAL;
> > +
> > +	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> > +	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags &
> PKT_TX_OUTER_IPV4))
> > +		return -EINVAL;
> > +
> > +	return 0;
> > +}
>=20
> It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is set.
>=20

Yes, for performance reasons we expect, that application sets these values =
right. This is the most of "additional checks" mentioned before.

> I'd say this function should go in rte_mbuf.h, because it's purely relate=
d
> to mbuf flags, and does not rely on packet data or network headers.
>=20
>=20
> > +
> > +/**
> > + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> > +before
> > + * hardware tx checksum.
> > + * For non-TSO tcp/udp packets full pseudo-header checksum is counted
> and set.
> > + * For TSO the IP payload length is not included.
> > + */
>=20
> The API comment should be fixed.

Ok.

>=20
> > +static inline int
> > +rte_phdr_cksum_fix(struct rte_mbuf *m) {
> > +	struct ipv4_hdr *ipv4_hdr;
> > +	struct ipv6_hdr *ipv6_hdr;
> > +	struct tcp_hdr *tcp_hdr;
> > +	struct udp_hdr *udp_hdr;
> > +	uint64_t ol_flags =3D m->ol_flags;
> > +	uint64_t inner_l3_offset =3D m->l2_len;
> > +
> > +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> > +		inner_l3_offset +=3D m->outer_l2_len + m->outer_l3_len;
> > +
> > +	if ((ol_flags & PKT_TX_UDP_CKSUM) =3D=3D PKT_TX_UDP_CKSUM) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum =3D 0;
> > +
> > +			udp_hdr =3D (struct udp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			udp_hdr->dgram_cksum =3D rte_ipv4_phdr_cksum(ipv4_hdr,
> ol_flags);
> > +		} else {
> > +			ipv6_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO udp */
> > +			udp_hdr =3D rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			udp_hdr->dgram_cksum =3D rte_ipv6_phdr_cksum(ipv6_hdr,
> ol_flags);
> > +		}
> > +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> > +			(ol_flags & PKT_TX_TCP_SEG)) {
> > +		if (ol_flags & PKT_TX_IPV4) {
> > +			ipv4_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> > +					inner_l3_offset);
> > +
> > +			if (ol_flags & PKT_TX_IP_CKSUM)
> > +				ipv4_hdr->hdr_checksum =3D 0;
> > +
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr =3D (struct tcp_hdr *)((char *)ipv4_hdr + m-
> >l3_len);
> > +			tcp_hdr->cksum =3D rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> > +		} else {
> > +			ipv6_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> > +					inner_l3_offset);
> > +			/* non-TSO tcp or TSO */
> > +			tcp_hdr =3D rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> > +					inner_l3_offset + m->l3_len);
> > +			tcp_hdr->cksum =3D rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +#endif /* _RTE_PKT_H_ */
> >
>=20
> The function expects that all the network headers are in the first, and
> that each of them is contiguous.
>=20

Yes, I see...

> Also, I had an interesting remark from Stephen [1] on a similar code.
> If the mbuf is a clone, it will modify the data of the direct mbuf, which
> should be read-only. Note that it is likely to happen in a TCP stack,
> because the packet is kept locally in case it has to be retransmitted.
> Cloning a mbuf is more efficient than duplicating it.
>=20
> I plan to fix it in virtio code by "uncloning" the headers.
>=20
> [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html
>=20
>=20
>=20
> Regards,
> Olivier

Thanks,
Tomasz