From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by dpdk.org (Postfix) with ESMTP id 3F1972BA7
 for <dev@dpdk.org>; Fri, 20 Jan 2017 12:24:42 +0100 (CET)
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by fmsmga105.fm.intel.com with ESMTP; 20 Jan 2017 03:24:42 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,258,1477983600"; d="scan'208";a="55558310"
Received: from irsmsx152.ger.corp.intel.com ([163.33.192.66])
 by fmsmga005.fm.intel.com with ESMTP; 20 Jan 2017 03:24:41 -0800
Received: from irsmsx105.ger.corp.intel.com ([169.254.7.38]) by
 IRSMSX152.ger.corp.intel.com ([169.254.6.191]) with mapi id 14.03.0248.002;
 Fri, 20 Jan 2017 11:24:40 +0000
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Andrew Rybchenko <arybchenko@solarflare.com>, "Yang, Zhiyong"
 <zhiyong.yang@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "thomas.monjalon@6wind.com" <thomas.monjalon@6wind.com>, "Richardson,
 Bruce" <bruce.richardson@intel.com>
Thread-Topic: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching
 behavior
Thread-Index: AQHScwLvmSQXMwEci0q+iANJA6Yaz6FBKT0AgAAM+lCAAAC5QA==
Date: Fri, 20 Jan 2017 11:24:40 +0000
Message-ID: <2601191342CEEE43887BDE71AB9772583F108959@irsmsx105.ger.corp.intel.com>
References: <1484905876-60165-1-git-send-email-zhiyong.yang@intel.com>
 <d1ebbed6-61ea-da60-4e52-cc0b59b98163@solarflare.com>
 <2601191342CEEE43887BDE71AB9772583F108924@irsmsx105.ger.corp.intel.com>
In-Reply-To: <2601191342CEEE43887BDE71AB9772583F108924@irsmsx105.ger.corp.intel.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching
 behavior
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 20 Jan 2017 11:24:43 -0000

>=20
> From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> Sent: Friday, January 20, 2017 10:26 AM
> To: Yang, Zhiyong <zhiyong.yang@intel.com>; dev@dpdk.org
> Cc: thomas.monjalon@6wind.com; Richardson, Bruce <bruce.richardson@intel.=
com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [RFC] lib/librte_ether: consistent PMD batching b=
ehavior
>=20
> On 01/20/2017 12:51 PM, Zhiyong Yang wrote:
> The rte_eth_tx_burst() function in the file Rte_ethdev.h is invoked to
> transmit output packets on the output queue for DPDK applications as
> follows.
>=20
> static inline uint16_t
> rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
>=20
> Note: The fourth parameter nb_pkts: The number of packets to transmit.
> The rte_eth_tx_burst() function returns the number of packets it actually
> sent. The return value equal to *nb_pkts* means that all packets have bee=
n
> sent, and this is likely to signify that other output packets could be
> immediately transmitted again. Applications that implement a "send as man=
y
> packets to transmit as possible" policy can check this specific case and
> keep invoking the rte_eth_tx_burst() function until a value less than
> *nb_pkts* is returned.
>=20
> When you call TX only once in rte_eth_tx_burst, you may get different
> behaviors from different PMDs. One problem that every DPDK user has to
> face is that they need to take the policy into consideration at the app-
> lication level when using any specific PMD to send the packets whether or
> not it is necessary, which brings usage complexities and makes DPDK users
> easily confused since they have to learn the details on TX function limit
> of specific PMDs and have to handle the different return value: the numbe=
r
> of packets transmitted successfully for various PMDs. Some PMDs Tx func-
> tions have a limit of sending at most 32 packets for every invoking, some
> PMDs have another limit of at most 64 packets once, another ones have imp=
-
> lemented to send as many packets to transmit as possible, etc. This will
> easily cause wrong usage for DPDK users.
>=20
> This patch proposes to implement the above policy in DPDK lib in order to
> simplify the application implementation and avoid the incorrect invoking
> as well. So, DPDK Users don't need to consider the implementation policy
> and to write duplicated code at the application level again when sending
> packets. In addition to it, the users don't need to know the difference o=
f
> specific PMD TX and can transmit the arbitrary number of packets as they
> expect when invoking TX API rte_eth_tx_burst, then check the return value
> to get the number of packets actually sent.
>=20
> How to implement the policy in DPDK lib? Two solutions are proposed below=
.
>=20
> Solution 1:
> Implement the wrapper functions to remove some limits for each specific
> PMDs as i40e_xmit_pkts_simple and ixgbe_xmit_pkts_simple do like that.
>=20
> > IMHO, the solution is a bit better since it:
> >=A01. Does not affect other PMDs at all
> >=A02. Could be a bit faster for the PMDs which require it since has no i=
ndirect
> >=A0=A0=A0 function call on each iteration
> >=A03. No ABI change

I also would prefer solution number 1 for the reasons outlined by Andrew ab=
ove.
Also, IMO current limitation for number of packets to TX in some Intel PMD =
TX routines
are sort of artificial:
- they are not caused by any real HW limitations
- avoiding them at PMD level shouldn't cause any performance or functional =
degradation.
So I don't see any good reason why instead of fixing these limitations in
our own PMDs we are trying to push them to the upper (rte_ethdev) layer.

Konstantin

>=20
>=20
> Solution 2:
> Implement the policy in the function rte_eth_tx_burst() at the ethdev lay=
-
> er in a more consistent batching way. Make best effort to send *nb_pkts*
> packets with bursts of no more than 32 by default since many DPDK TX PMDs
> are using this max TX burst size(32). In addition, one data member which
> defines the max TX burst size such as "uint16_t max_tx_burst_pkts;"will b=
e
> added to rte_eth_dev_data, which drivers can override if they work with
> bursts of 64 or other NB(thanks for Bruce <bruce.richardson@intel.com>'s
> suggestion). This can reduce the performance impacting to the lowest limi=
t.
>=20
> > I see no noticeable difference in performance, so don't mind if this is=
 finally choosen.
> > Just be sure that you update all PMDs to set reasonable default values,=
 or may be
> > even better, set UINT16_MAX in generic place - 0 is a bad default here.
> > (Lost few seconds wondering why nothing is sent and cannot stop)
>=20
>=20
> I prefer the latter between the 2 solutions because it makes DPDK code mo=
re
> consistent and easier and avoids to write too much duplicate logic in DPD=
K
> source code. In addition, I think no or a little performance drop is
> brought by solution 2. But ABI change will be introduced.
>=20
> In fact, the current rte_eth_rx_burst() function is using the similar
> mechanism and faces the same problem as rte_eth_tx_burst().
>=20
> static inline uint16_t
> rte_eth_rx_burst(uint8_t port_id, uint16_t queue_id,
>                  struct rte_mbuf **rx_pkts, const uint16_t nb_pkts);
>=20
> Applications are responsible of implementing the policy "retrieve as many
> received packets as possible", and check this specific case and keep
> invoking the rte_eth_rx_burst() function until a value less than *nb_pkts=
*
> is returned.
>=20
> The patch proposes to apply the above method to rte_eth_rx_burst() as wel=
l.
>=20
> In summary, The purpose of the RFC makes the job easier and more simple f=
or
> driver writers and avoids to write too much duplicate code at the applica=
-
> tion level.
>=20