From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cristian.dumitrescu@intel.com>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id DA0207D19
 for <dev@dpdk.org>; Tue,  2 Jan 2018 17:13:38 +0100 (CET)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 02 Jan 2018 08:13:37 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.45,498,1508828400"; d="scan'208";a="163606548"
Received: from irsmsx108.ger.corp.intel.com ([163.33.3.3])
 by orsmga004.jf.intel.com with ESMTP; 02 Jan 2018 08:13:36 -0800
Received: from irsmsx111.ger.corp.intel.com (10.108.20.4) by
 IRSMSX108.ger.corp.intel.com (163.33.3.3) with Microsoft SMTP Server (TLS) id
 14.3.319.2; Tue, 2 Jan 2018 16:13:35 +0000
Received: from irsmsx108.ger.corp.intel.com ([169.254.11.167]) by
 irsmsx111.ger.corp.intel.com ([169.254.2.30]) with mapi id 14.03.0319.002;
 Tue, 2 Jan 2018 16:13:35 +0000
From: "Dumitrescu, Cristian" <cristian.dumitrescu@intel.com>
To: "alangordondewar@gmail.com" <alangordondewar@gmail.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, Alan Dewar <alan.dewar@att.com>
Thread-Topic: [PATCH v2] sched: fix overflow errors in WRR weighting code
Thread-Index: AQHTabpz0SsBie49DEeQ3RrNt/7TP6Ng8TxA
Date: Tue, 2 Jan 2018 16:13:34 +0000
Message-ID: <3EB4FA525960D640B5BDFFD6A3D891267BAF7E9D@IRSMSX108.ger.corp.intel.com>
References: <1512032696-30765-1-git-send-email-alan.dewar@att.com>
In-Reply-To: <1512032696-30765-1-git-send-email-alan.dewar@att.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 11.0.0.116
dlp-reaction: no-action
x-originating-ip: [163.33.239.181]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v2] sched: fix overflow errors in WRR
	weighting code
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Jan 2018 16:13:39 -0000

Hi Alan,

NAK for now.

There is a good reason for truncating the WRR cost to 8-bit value, which is=
 keeping the size of the rte_sched_pipe structure to single cache line (64 =
bytes). This is done for performance reasons at the expense of some accurac=
y loss for the scheduling of the 4x queues per traffic class.

Is there a way to make the improvement while working with 8-bit WRR cost va=
lues in the pipe structure?

> -----Original Message-----
> From: alangordondewar@gmail.com [mailto:alangordondewar@gmail.com]
> Sent: Thursday, November 30, 2017 9:05 AM
> To: Dumitrescu, Cristian <cristian.dumitrescu@intel.com>
> Cc: dev@dpdk.org; Alan Dewar <alan.dewar@att.com>
> Subject: [PATCH v2] sched: fix overflow errors in WRR weighting code
>=20
> From: Alan Dewar <alan.dewar@att.com>
>=20
> Revised patch - this version fixes an issue when a small wrr_cost is
> shifted so far right that its value becomes zero.
>=20
> The WRR code calculates the lowest common denominator between the
> four
> WRR weights as a uint32_t value and divides the LCD by each of the WRR
> weights and casts the results as a uint8_t.  This casting can cause
> the ratios of the computed wrr costs to be wrong.  For example with
> WRR weights of 3, 5, 7 and 11, the LCD is computed to be
> 1155.  The WRR costs get computed as:

Picking prime numbers for the weights is generally a bad idea. If you would=
 pick e.g. 4, 6, 8 and 12 rather than 3, 5, 7 and 11 you would avoid any is=
sues due to the 8-bit truncation.

>=20
>   1155/3 =3D 385, 1155/5 =3D 231, 1155/7 =3D 165, 1155/11 =3D 105.
>=20
> When the value 385 is cast into an uint8_t it ends up as 129.
> Rather than casting straight into a uint8_t, this patch shifts the
> computed WRR costs right so that the largest value is only eight bits
> wide.
>=20
> In grinder_schedule, the packet length is multiplied by the WRR cost
> and added to the grinder's wrr_tokens value.  The grinder's wrr_tokens
> field is a uint16_t, so combination of a packet length of 1500 bytes
> and a wrr cost of 44 will overflow this field on the first packet.
>=20
> This patch increases the width of the grinder's wrr_tokens and
> wrr_mask fields from uint16_t to uint32_t.
>=20

Increasing the size of the grinder fields is OK, but I am not sure whether =
it is really helpful, as the values saved in the pipe structure are 8-bit.


> In grinder_wrr_store, the remaining tokens in the grinder's wrr_tokens
> array are copied to the appropriate pipe's wrr_tokens array.  However
> the pipe's wrr_tokens array is only a uint8_t array so unused tokens
> were quite frequently lost which upsets the balance of traffic across
> the four WRR queues.
>=20
> This patch increases the width of the pipe's wrr_tokens array from
> a uint8_t to uint32_t.

This is not allowed for performance reasons, as having 16-bit or 32-bit WRR=
 cost values in pipe structure would increase the size of the pipe structur=
e from one cache line to two cache lines.

>=20
> Signed-off-by: Alan Dewar <alan.dewar@att.com>
> ---
> v2 - fixed bug in the wrr_cost calculation code that could result
> in a zero wrr_cost
>=20
>  lib/librte_sched/rte_sched.c        | 59 +++++++++++++++++++++++++++++--=
-
> -----
>  lib/librte_sched/rte_sched_common.h | 15 ++++++++++
>  2 files changed, 61 insertions(+), 13 deletions(-)
>=20
> diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
> index 7252f85..324743d 100644
> --- a/lib/librte_sched/rte_sched.c
> +++ b/lib/librte_sched/rte_sched.c
> @@ -130,7 +130,7 @@ struct rte_sched_pipe {
>  	uint32_t tc_credits[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
>=20
>  	/* Weighted Round Robin (WRR) */
> -	uint8_t wrr_tokens[RTE_SCHED_QUEUES_PER_PIPE];
> +	uint32_t wrr_tokens[RTE_SCHED_QUEUES_PER_PIPE];
>=20
>  	/* TC oversubscription */
>  	uint32_t tc_ov_credits;
> @@ -205,8 +205,8 @@ struct rte_sched_grinder {
>  	struct rte_mbuf *pkt;
>=20
>  	/* WRR */
> -	uint16_t wrr_tokens[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> -	uint16_t wrr_mask[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> +	uint32_t wrr_tokens[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> +	uint32_t wrr_mask[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
>  	uint8_t wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
>  };
>=20
> @@ -542,6 +542,17 @@ rte_sched_time_ms_to_bytes(uint32_t time_ms,
> uint32_t rate)
>  	return time;
>  }
>=20
> +static uint32_t rte_sched_reduce_to_byte(uint32_t value)
> +{
> +	uint32_t shift =3D 0;
> +
> +	while (value & 0xFFFFFF00) {
> +		value >>=3D 1;
> +		shift++;
> +	}
> +	return shift;
> +}
> +
>  static void
>  rte_sched_port_config_pipe_profile_table(struct rte_sched_port *port,
> struct rte_sched_port_params *params)
>  {
> @@ -583,6 +594,8 @@ rte_sched_port_config_pipe_profile_table(struct
> rte_sched_port *port, struct rte
>  			uint32_t
> wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
>  			uint32_t lcd, lcd1, lcd2;
>  			uint32_t qindex;
> +			uint32_t low_pos;
> +			uint32_t shift;
>=20
>  			qindex =3D j *
> RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS;
>=20
> @@ -594,12 +607,28 @@ rte_sched_port_config_pipe_profile_table(struct
> rte_sched_port *port, struct rte
>  			lcd1 =3D rte_get_lcd(wrr_cost[0], wrr_cost[1]);
>  			lcd2 =3D rte_get_lcd(wrr_cost[2], wrr_cost[3]);
>  			lcd =3D rte_get_lcd(lcd1, lcd2);
> +			low_pos =3D rte_min_pos_4_u32(wrr_cost);
>=20
>  			wrr_cost[0] =3D lcd / wrr_cost[0];
>  			wrr_cost[1] =3D lcd / wrr_cost[1];
>  			wrr_cost[2] =3D lcd / wrr_cost[2];
>  			wrr_cost[3] =3D lcd / wrr_cost[3];
>=20
> +			shift =3D
> rte_sched_reduce_to_byte(wrr_cost[low_pos]);
> +			wrr_cost[0] >>=3D shift;
> +			wrr_cost[1] >>=3D shift;
> +			wrr_cost[2] >>=3D shift;
> +			wrr_cost[3] >>=3D shift;
> +
> +			if (wrr_cost[0] =3D=3D 0)
> +				wrr_cost[0]++;
> +			if (wrr_cost[1] =3D=3D 0)
> +				wrr_cost[1]++;
> +			if (wrr_cost[2] =3D=3D 0)
> +				wrr_cost[2]++;
> +			if (wrr_cost[3] =3D=3D 0)
> +				wrr_cost[3]++;
> +
>  			dst->wrr_cost[qindex] =3D (uint8_t) wrr_cost[0];
>  			dst->wrr_cost[qindex + 1] =3D (uint8_t) wrr_cost[1];
>  			dst->wrr_cost[qindex + 2] =3D (uint8_t) wrr_cost[2];
> @@ -1941,15 +1970,19 @@ grinder_wrr_load(struct rte_sched_port *port,
> uint32_t pos)
>=20
>  	qindex =3D tc_index * 4;
>=20
> -	grinder->wrr_tokens[0] =3D ((uint16_t) pipe->wrr_tokens[qindex]) <<
> RTE_SCHED_WRR_SHIFT;
> -	grinder->wrr_tokens[1] =3D ((uint16_t) pipe->wrr_tokens[qindex + 1])
> << RTE_SCHED_WRR_SHIFT;
> -	grinder->wrr_tokens[2] =3D ((uint16_t) pipe->wrr_tokens[qindex + 2])
> << RTE_SCHED_WRR_SHIFT;
> -	grinder->wrr_tokens[3] =3D ((uint16_t) pipe->wrr_tokens[qindex + 3])
> << RTE_SCHED_WRR_SHIFT;
> +	grinder->wrr_tokens[0] =3D pipe->wrr_tokens[qindex] <<
> +		RTE_SCHED_WRR_SHIFT;
> +	grinder->wrr_tokens[1] =3D pipe->wrr_tokens[qindex + 1] <<
> +		RTE_SCHED_WRR_SHIFT;
> +	grinder->wrr_tokens[2] =3D pipe->wrr_tokens[qindex + 2] <<
> +		RTE_SCHED_WRR_SHIFT;
> +	grinder->wrr_tokens[3] =3D pipe->wrr_tokens[qindex + 3] <<
> +		RTE_SCHED_WRR_SHIFT;
>=20
> -	grinder->wrr_mask[0] =3D (qmask & 0x1) * 0xFFFF;
> -	grinder->wrr_mask[1] =3D ((qmask >> 1) & 0x1) * 0xFFFF;
> -	grinder->wrr_mask[2] =3D ((qmask >> 2) & 0x1) * 0xFFFF;
> -	grinder->wrr_mask[3] =3D ((qmask >> 3) & 0x1) * 0xFFFF;
> +	grinder->wrr_mask[0] =3D (qmask & 0x1) * 0xFFFFFFFF;
> +	grinder->wrr_mask[1] =3D ((qmask >> 1) & 0x1) * 0xFFFFFFFF;
> +	grinder->wrr_mask[2] =3D ((qmask >> 2) & 0x1) * 0xFFFFFFFF;
> +	grinder->wrr_mask[3] =3D ((qmask >> 3) & 0x1) * 0xFFFFFFFF;
>=20
>  	grinder->wrr_cost[0] =3D pipe_params->wrr_cost[qindex];
>  	grinder->wrr_cost[1] =3D pipe_params->wrr_cost[qindex + 1];
> @@ -1981,14 +2014,14 @@ static inline void
>  grinder_wrr(struct rte_sched_port *port, uint32_t pos)
>  {
>  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> -	uint16_t wrr_tokens_min;
> +	uint32_t wrr_tokens_min;
>=20
>  	grinder->wrr_tokens[0] |=3D ~grinder->wrr_mask[0];
>  	grinder->wrr_tokens[1] |=3D ~grinder->wrr_mask[1];
>  	grinder->wrr_tokens[2] |=3D ~grinder->wrr_mask[2];
>  	grinder->wrr_tokens[3] |=3D ~grinder->wrr_mask[3];
>=20
> -	grinder->qpos =3D rte_min_pos_4_u16(grinder->wrr_tokens);
> +	grinder->qpos =3D rte_min_pos_4_u32(grinder->wrr_tokens);
>  	wrr_tokens_min =3D grinder->wrr_tokens[grinder->qpos];
>=20
>  	grinder->wrr_tokens[0] -=3D wrr_tokens_min;
> diff --git a/lib/librte_sched/rte_sched_common.h
> b/lib/librte_sched/rte_sched_common.h
> index aed144b..a3c6bc2 100644
> --- a/lib/librte_sched/rte_sched_common.h
> +++ b/lib/librte_sched/rte_sched_common.h
> @@ -77,6 +77,21 @@ rte_min_pos_4_u16(uint16_t *x)
>  	return pos0;
>  }
>=20
> +static inline uint32_t
> +rte_min_pos_4_u32(uint32_t *x)
> +{
> +	uint32_t pos0 =3D 0;
> +	uint32_t pos1 =3D 2;
> +
> +	if (x[1] <=3D x[0])
> +		pos0 =3D 1;
> +	if (x[3] <=3D x[2])
> +		pos1 =3D 3;
> +	if (x[pos1] <=3D x[pos0])
> +		pos0 =3D pos1;
> +
> +	return pos0;
> +}
>  #endif
>=20
>  /*
> --
> 2.1.4

Regards,
Cristian