From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cristian.dumitrescu@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id 81BF15913
 for <dev@dpdk.org>; Wed,  2 Dec 2015 17:48:21 +0100 (CET)
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by orsmga102.jf.intel.com with ESMTP; 02 Dec 2015 08:48:19 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.20,373,1444719600"; d="scan'208";a="698867302"
Received: from irsmsx151.ger.corp.intel.com ([163.33.192.59])
 by orsmga003.jf.intel.com with ESMTP; 02 Dec 2015 08:48:18 -0800
Received: from irsmsx108.ger.corp.intel.com ([169.254.11.23]) by
 IRSMSX151.ger.corp.intel.com ([169.254.4.95]) with mapi id 14.03.0248.002;
 Wed, 2 Dec 2015 16:48:17 +0000
From: "Dumitrescu, Cristian" <cristian.dumitrescu@intel.com>
To: Stephen Hemminger <stephen@networkplumber.org>
Thread-Topic: [PATCH 3/3] rte_sched: eliminate floating point in calculating
 byte clock
Thread-Index: AQHRKtZSchh4r7nFGEq5a2WmoKzUP5636/HQ
Date: Wed, 2 Dec 2015 16:48:17 +0000
Message-ID: <3EB4FA525960D640B5BDFFD6A3D8912647925BD2@IRSMSX108.ger.corp.intel.com>
References: <1448822809-8350-1-git-send-email-stephen@networkplumber.org>
 <1448822809-8350-4-git-send-email-stephen@networkplumber.org>
In-Reply-To: <1448822809-8350-4-git-send-email-stephen@networkplumber.org>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-inteldataclassification: CTP_IC
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiI0MjYxNjgxNS0xOTBjLTQ2NzgtOTU0Yy05YjIyMzYzMmFlNDMiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJmQXFXYkNsMDlubERHb2FaZSszRklCSEsxRGw1b2QwWEtmcmxkTjdScWlZPSJ9
x-originating-ip: [163.33.239.181]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH 3/3] rte_sched: eliminate floating point in
 calculating byte clock
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Dec 2015 16:48:22 -0000



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Sunday, November 29, 2015 8:47 PM
> To: Dumitrescu, Cristian <cristian.dumitrescu@intel.com>
> Cc: dev@dpdk.org; Stephen Hemminger <stephen@networkplumber.org>
> Subject: [PATCH 3/3] rte_sched: eliminate floating point in calculating b=
yte
> clock
>=20
> The old code was doing a floating point divide for each rte_dequeue()
> which is very expensive. Change to using fixed point scaled inverse
> multiply. To maintain equivalent precision, scaled math is used.
> The application ABI is the same.
>=20
> This improved performance from 5Gbit/sec to 10 Gbit/sec when configured
> for 10 Gbit/sec rate.
>=20
> There was some feedback from Cristian that he wanted a better
> solution and was going to give one, but none was provided.
> For 2.2 this is a better solution than existing code, if someone
> has a better version I would love to see it.
>=20
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  lib/librte_sched/rte_sched.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
>=20
> diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
> index 16acd6b..cfae136 100644
> --- a/lib/librte_sched/rte_sched.c
> +++ b/lib/librte_sched/rte_sched.c
> @@ -47,6 +47,7 @@
>  #include "rte_bitmap.h"
>  #include "rte_sched_common.h"
>  #include "rte_approx.h"
> +#include "rte_reciprocal.h"
>=20
>  #ifdef __INTEL_COMPILER
>  #pragma warning(disable:2259) /* conversion may lose significant bits */
> @@ -62,6 +63,11 @@
>  #define RTE_SCHED_PIPE_INVALID                UINT32_MAX
>  #define RTE_SCHED_BMP_POS_INVALID             UINT32_MAX
>=20
> +/* Scaling for cycles_per_byte calculation
> + * Chosen so that minimum rate is 480 bit/sec
> + */
> +#define RTE_SCHED_TIME_SHIFT		      8

Stephen, can you please elaborate why we need to shift the dividend at all =
and why the shift value was picked as 8? Is 8 a hard constraint? How does t=
his affect the scheduling precision/accuracy?

> +
>  struct rte_sched_subport {
>  	/* Token bucket (TB) */
>  	uint64_t tb_time; /* time of last update */
> @@ -215,7 +221,7 @@ struct rte_sched_port {
>  	uint64_t time_cpu_cycles;     /* Current CPU time measured in CPU
> cyles */
>  	uint64_t time_cpu_bytes;      /* Current CPU time measured in bytes
> */
>  	uint64_t time;                /* Current NIC TX time measured in bytes =
*/
> -	double cycles_per_byte;       /* CPU cycles per byte */
> +	struct rte_reciprocal inv_cycles_per_byte; /* CPU cycles per byte */
>=20
>  	/* Scheduling loop detection */
>  	uint32_t pipe_loop;
> @@ -610,7 +616,7 @@ struct rte_sched_port *
>  rte_sched_port_config(struct rte_sched_port_params *params)
>  {
>  	struct rte_sched_port *port =3D NULL;
> -	uint32_t mem_size, bmp_mem_size, n_queues_per_port, i;
> +	uint32_t mem_size, bmp_mem_size, n_queues_per_port, i,
> cycles_per_byte;
>=20
>  	/* Check user parameters. Determine the amount of memory to
> allocate */
>  	mem_size =3D rte_sched_port_get_memory_footprint(params);
> @@ -661,7 +667,10 @@ rte_sched_port_config(struct
> rte_sched_port_params *params)
>  	port->time_cpu_cycles =3D rte_get_tsc_cycles();
>  	port->time_cpu_bytes =3D 0;
>  	port->time =3D 0;
> -	port->cycles_per_byte =3D ((double) rte_get_tsc_hz()) / ((double)
> params->rate);
> +
> +	cycles_per_byte =3D (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT)
> +		/ params->rate;
> +	port->inv_cycles_per_byte =3D rte_reciprocal_value(cycles_per_byte);
>=20
>  	/* Scheduling loop detection */
>  	port->pipe_loop =3D RTE_SCHED_PIPE_INVALID;
> @@ -2088,11 +2097,15 @@ rte_sched_port_time_resync(struct
> rte_sched_port *port)
>  {
>  	uint64_t cycles =3D rte_get_tsc_cycles();
>  	uint64_t cycles_diff =3D cycles - port->time_cpu_cycles;
> -	double bytes_diff =3D ((double) cycles_diff) / port->cycles_per_byte;
> +	uint64_t bytes_diff;
> +
> +	/* Compute elapsed time in bytes */
> +	bytes_diff =3D rte_reciprocal_divide(cycles_diff <<
> RTE_SCHED_TIME_SHIFT,
> +					   port->inv_cycles_per_byte);
>=20
>  	/* Advance port time */
>  	port->time_cpu_cycles =3D cycles;
> -	port->time_cpu_bytes +=3D (uint64_t) bytes_diff;
> +	port->time_cpu_bytes +=3D bytes_diff;
>  	if (port->time < port->time_cpu_bytes)
>  		port->time =3D port->time_cpu_bytes;
>=20
> --
> 2.1.4

Can you provide some insight into how you tested this code and the test vec=
tors you used?