From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 9D6B86CD2 for ; Wed, 19 Oct 2016 17:42:52 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga102.jf.intel.com with ESMTP; 19 Oct 2016 08:42:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,514,1473145200"; d="scan'208";a="1056156255" Received: from irsmsx101.ger.corp.intel.com ([163.33.3.153]) by fmsmga001.fm.intel.com with ESMTP; 19 Oct 2016 08:42:51 -0700 Received: from irsmsx102.ger.corp.intel.com ([169.254.2.198]) by IRSMSX101.ger.corp.intel.com ([163.33.3.153]) with mapi id 14.03.0248.002; Wed, 19 Oct 2016 16:42:48 +0100 From: "Kulasek, TomaszX" To: Olivier Matz , "dev@dpdk.org" CC: "Ananyev, Konstantin" , "thomas.monjalon@6wind.com" Thread-Topic: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation Thread-Index: AQHSJiz6u7+zlRpGk0absAG+mtA0aKCuQq+AgAGcFqA= Date: Wed, 19 Oct 2016 15:42:48 +0000 Message-ID: <3042915272161B4EB253DA4D77EB373A14F37118@IRSMSX102.ger.corp.intel.com> References: <1476380222-9900-1-git-send-email-tomaszx.kulasek@intel.com> <1476457519-6840-1-git-send-email-tomaszx.kulasek@intel.com> <1476457519-6840-2-git-send-email-tomaszx.kulasek@intel.com> <0e011fea-ddd4-283b-61db-8a7a67b652e2@6wind.com> In-Reply-To: <0e011fea-ddd4-283b-61db-8a7a67b652e2@6wind.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Oct 2016 15:42:53 -0000 Hi Olivier, > -----Original Message----- > From: Olivier Matz [mailto:olivier.matz@6wind.com] > Sent: Tuesday, October 18, 2016 16:57 > To: Kulasek, TomaszX ; dev@dpdk.org > Cc: Ananyev, Konstantin ; > thomas.monjalon@6wind.com > Subject: Re: [dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation >=20 > Hi Tomasz, >=20 > I think the principle of tx_prep() is good, it may for instance help to > remove the function virtio_tso_fix_cksum() from the virtio, and maybe eve= n > change the mbuf TSO/cksum API. >=20 > I have some questions/comments below, I'm sorry it comes very late. >=20 > On 10/14/2016 05:05 PM, Tomasz Kulasek wrote: > > Added API for `rte_eth_tx_prep` > > > > uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, > > struct rte_mbuf **tx_pkts, uint16_t nb_pkts) > > > > Added fields to the `struct rte_eth_desc_lim`: > > > > uint16_t nb_seg_max; > > /**< Max number of segments per whole packet. */ > > > > uint16_t nb_mtu_seg_max; > > /**< Max number of segments per one MTU */ >=20 > Not sure I understand the second one. Is this is case of TSO? >=20 > Is it a usual limitation in different network hardware? > Can this info be retrieved/used by the application? >=20 Yes, limitation of number of segments may differ depend of TSO/non TSO, e.g= . for Fortville NIC. This information is available for application > > > > Created `rte_pkt.h` header with common used functions: > > > > int rte_validate_tx_offload(struct rte_mbuf *m) > > to validate general requirements for tx offload in packet such a > > flag completness. In current implementation this function is called > > optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled. > > > > int rte_phdr_cksum_fix(struct rte_mbuf *m) > > to fix pseudo header checksum for TSO and non-TSO tcp/udp packets > > before hardware tx checksum offload. > > - for non-TSO tcp/udp packets full pseudo-header checksum is > > counted and set. > > - for TSO the IP payload length is not included. >=20 > Why not in rte_net.h? >=20 >=20 > > [...] > > > > @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t > queue_id, > > return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, > > nb_pkts); } > > > > +/** > > + * Process a burst of output packets on a transmit queue of an Etherne= t > device. > > + * > > + * The rte_eth_tx_prep() function is invoked to prepare output > > +packets to be > > + * transmitted on the output queue *queue_id* of the Ethernet device > > +designated > > + * by its *port_id*. > > + * The *nb_pkts* parameter is the number of packets to be prepared > > +which are > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of > > +them > > + * allocated from a pool created with rte_pktmbuf_pool_create(). > > + * For each packet to send, the rte_eth_tx_prep() function performs > > + * the following operations: > > + * > > + * - Check if packet meets devices requirements for tx offloads. >=20 > Do you mean hardware requirements? > Can the application be aware of these requirements? I mean capability > flags, or something in dev_infos? Yes. If some offloads cannot be handled by hardware, it fails. Also if e.g.= number of segments is invalid and so on. >=20 > Maybe the comment could be more precise? >=20 > > + * - Check limitations about number of segments. > > + * > > + * - Check additional requirements when debug is enabled. >=20 > What kind of additional requirements? >=20 We may assume, that application setts right, e.g. ip header version is requ= ired for most of checksum offloads. To not have additional performance over= head, these checks are done only when debug is on. > > + * > > + * - Update and/or reset required checksums when tx offload is set for > packet. > > + * >=20 > By reading this, I think it may not be clear for the user about what > should be set in the mbuf. In mbuf API, it is said: >=20 > * TCP segmentation offload. To enable this offload feature for a > * packet to be transmitted on hardware supporting TSO: > * - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies > * PKT_TX_TCP_CKSUM) > * - set the flag PKT_TX_IPV4 or PKT_TX_IPV6 > * - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksu= m > * to 0 in the packet > * - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segs= z > * - calculate the pseudo header checksum without taking ip_len in > account, > * and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and > * rte_ipv6_phdr_cksum() that can be used as helpers. >=20 >=20 > If I understand well, using tx_prep(), the user will have to do the same > except writing the IP checksum to 0, and without setting the TCP pseudo > header checksum, right? >=20 Right, but this header information is still valid for tx_burst operation do= ne without preparation stage. >=20 > > + * The rte_eth_tx_prep() function returns the number of packets ready > > + to be > > + * sent. A return value equal to *nb_pkts* means that all packets are > > + valid and > > + * ready to be sent. > > + * > > + * @param port_id > > + * The port identifier of the Ethernet device. > > + * @param queue_id > > + * The index of the transmit queue through which output packets must > be > > + * sent. > > + * The value must be in the range [0, nb_tx_queue - 1] previously > supplied > > + * to rte_eth_dev_configure(). > > + * @param tx_pkts > > + * The address of an array of *nb_pkts* pointers to *rte_mbuf* > structures > > + * which contain the output packets. > > + * @param nb_pkts > > + * The maximum number of packets to process. > > + * @return > > + * The number of packets correct and ready to be sent. The return > value can be > > + * less than the value of the *tx_pkts* parameter when some packet > doesn't > > + * meet devices requirements with rte_errno set appropriately. > > + */ >=20 > Can we add the constraint that invalid packets are left untouched? >=20 > I think most of the time there will be a software fallback in that case, > so it would be good to ensure that this function does not change the flag= s > or the packet data. In current implementation, if packet is invalid, its data is never modified= . Only checks are done. The only exception is when checksum needs to be upd= ated or initialized, but it's done after packet validation. If we want to use/restore packet in application it didn't should be changed= in any way for invalid packets. >=20 > Another thing that could be interesting for the caller is to know the > reason of the failure. Maybe the different errno types could be detailed > here. For instance: > - EINVAL: offload flags are not correctly set (i.e. would fail whatever > the hardware) > - ENOTSUP: the offload feature is not supported by the hardware > - ... >=20 Ok. > > + > > +#ifdef RTE_ETHDEV_TX_PREP > > + > > +static inline uint16_t > > +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf > **tx_pkts, > > + uint16_t nb_pkts) > > +{ > > + struct rte_eth_dev *dev =3D &rte_eth_devices[port_id]; > > + > > + if (!dev->tx_pkt_prep) > > + return nb_pkts; > > + > > +#ifdef RTE_LIBRTE_ETHDEV_DEBUG > > + if (queue_id >=3D dev->data->nb_tx_queues) { > > + RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=3D%d\n", queue_id); > > + rte_errno =3D -EINVAL; > > + return 0; > > + } > > +#endif >=20 > Why checking the queue_id but not the port_id? >=20 > Maybe the API comment should also be modified to say that the port_id has > to be valid, because most ethdev functions do the check. >=20 I can add this check when debug is on to made it more complete, and update = an API comment for the default case. > > + > > + return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id], > > + tx_pkts, nb_pkts); > > +} > > + > > +#else > > + > > +static inline uint16_t > > +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t > queue_id, > > + __rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts) { > > + return nb_pkts; > > +} > > + > > +#endif > > + >=20 > nit: I wonder if the #else part should be inside the function instead > (with the proper RTE_SET_USED()), it would avoid to define the prototype > twice. >=20 > > typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t > count, > > void *userdata); > > > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > index 109e666..cfd6284 100644 > > --- a/lib/librte_mbuf/rte_mbuf.h > > +++ b/lib/librte_mbuf/rte_mbuf.h > > @@ -276,6 +276,15 @@ extern "C" { > > */ > > #define PKT_TX_OUTER_IPV4 (1ULL << 59) > > > > +#define PKT_TX_OFFLOAD_MASK ( \ > > + PKT_TX_IP_CKSUM | \ > > + PKT_TX_L4_MASK | \ > > + PKT_TX_OUTER_IP_CKSUM | \ > > + PKT_TX_TCP_SEG | \ > > + PKT_TX_QINQ_PKT | \ > > + PKT_TX_VLAN_PKT | \ > > + PKT_TX_TUNNEL_MASK) > > + >=20 > Could you add an API comment? >=20 > > --- /dev/null > > +++ b/lib/librte_net/rte_pkt.h > > > > [...] > > +/** > > + * Validate general requirements for tx offload in packet. > > + */ >=20 > The API comment does not have the usual format. >=20 > > +static inline int > > +rte_validate_tx_offload(struct rte_mbuf *m) >=20 > should be const struct rte_mbuf *m >=20 > > +{ > > + uint64_t ol_flags =3D m->ol_flags; > > + > > + /* Does packet set any of available offloads? */ > > + if (!(ol_flags & PKT_TX_OFFLOAD_MASK)) > > + return 0; > > + > > + /* IP checksum can be counted only for IPv4 packet */ > > + if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6)) > > + return -EINVAL; > > + > > + if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG)) > > + /* IP type not set */ > > + if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6))) > > + return -EINVAL; > > + > > + if (ol_flags & PKT_TX_TCP_SEG) > > + /* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */ > > + if ((m->tso_segsz =3D=3D 0) || > > + ((ol_flags & PKT_TX_IPV4) && !(ol_flags & > PKT_TX_IP_CKSUM))) > > + return -EINVAL; > > + > > + /* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */ > > + if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags & > PKT_TX_OUTER_IPV4)) > > + return -EINVAL; > > + > > + return 0; > > +} >=20 > It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is set. >=20 Yes, for performance reasons we expect, that application sets these values = right. This is the most of "additional checks" mentioned before. > I'd say this function should go in rte_mbuf.h, because it's purely relate= d > to mbuf flags, and does not rely on packet data or network headers. >=20 >=20 > > + > > +/** > > + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets > > +before > > + * hardware tx checksum. > > + * For non-TSO tcp/udp packets full pseudo-header checksum is counted > and set. > > + * For TSO the IP payload length is not included. > > + */ >=20 > The API comment should be fixed. Ok. >=20 > > +static inline int > > +rte_phdr_cksum_fix(struct rte_mbuf *m) { > > + struct ipv4_hdr *ipv4_hdr; > > + struct ipv6_hdr *ipv6_hdr; > > + struct tcp_hdr *tcp_hdr; > > + struct udp_hdr *udp_hdr; > > + uint64_t ol_flags =3D m->ol_flags; > > + uint64_t inner_l3_offset =3D m->l2_len; > > + > > + if (ol_flags & PKT_TX_OUTER_IP_CKSUM) > > + inner_l3_offset +=3D m->outer_l2_len + m->outer_l3_len; > > + > > + if ((ol_flags & PKT_TX_UDP_CKSUM) =3D=3D PKT_TX_UDP_CKSUM) { > > + if (ol_flags & PKT_TX_IPV4) { > > + ipv4_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, > > + inner_l3_offset); > > + > > + if (ol_flags & PKT_TX_IP_CKSUM) > > + ipv4_hdr->hdr_checksum =3D 0; > > + > > + udp_hdr =3D (struct udp_hdr *)((char *)ipv4_hdr + m- > >l3_len); > > + udp_hdr->dgram_cksum =3D rte_ipv4_phdr_cksum(ipv4_hdr, > ol_flags); > > + } else { > > + ipv6_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *, > > + inner_l3_offset); > > + /* non-TSO udp */ > > + udp_hdr =3D rte_pktmbuf_mtod_offset(m, struct udp_hdr *, > > + inner_l3_offset + m->l3_len); > > + udp_hdr->dgram_cksum =3D rte_ipv6_phdr_cksum(ipv6_hdr, > ol_flags); > > + } > > + } else if ((ol_flags & PKT_TX_TCP_CKSUM) || > > + (ol_flags & PKT_TX_TCP_SEG)) { > > + if (ol_flags & PKT_TX_IPV4) { > > + ipv4_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *, > > + inner_l3_offset); > > + > > + if (ol_flags & PKT_TX_IP_CKSUM) > > + ipv4_hdr->hdr_checksum =3D 0; > > + > > + /* non-TSO tcp or TSO */ > > + tcp_hdr =3D (struct tcp_hdr *)((char *)ipv4_hdr + m- > >l3_len); > > + tcp_hdr->cksum =3D rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags); > > + } else { > > + ipv6_hdr =3D rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *, > > + inner_l3_offset); > > + /* non-TSO tcp or TSO */ > > + tcp_hdr =3D rte_pktmbuf_mtod_offset(m, struct tcp_hdr *, > > + inner_l3_offset + m->l3_len); > > + tcp_hdr->cksum =3D rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags); > > + } > > + } > > + return 0; > > +} > > + > > +#endif /* _RTE_PKT_H_ */ > > >=20 > The function expects that all the network headers are in the first, and > that each of them is contiguous. >=20 Yes, I see... > Also, I had an interesting remark from Stephen [1] on a similar code. > If the mbuf is a clone, it will modify the data of the direct mbuf, which > should be read-only. Note that it is likely to happen in a TCP stack, > because the packet is kept locally in case it has to be retransmitted. > Cloning a mbuf is more efficient than duplicating it. >=20 > I plan to fix it in virtio code by "uncloning" the headers. >=20 > [1] http://dpdk.org/ml/archives/dev/2016-October/048873.html >=20 >=20 >=20 > Regards, > Olivier Thanks, Tomasz