From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ad759e@intl.att.com>
Received: from mx0a-00191d01.pphosted.com (mx0b-00191d01.pphosted.com
 [67.231.157.136]) by dpdk.org (Postfix) with ESMTP id EDB241B1DB
 for <dev@dpdk.org>; Wed,  3 Jan 2018 14:46:12 +0100 (CET)
Received: from pps.filterd (m0049463.ppops.net [127.0.0.1])
 by m0049463.ppops.net-00191d01. (8.16.0.21/8.16.0.21) with SMTP id
 w03DissW006314; Wed, 3 Jan 2018 08:46:12 -0500
Received: from alpi155.enaf.aldc.att.com (sbcsmtp7.sbc.com [144.160.229.24])
 by m0049463.ppops.net-00191d01. with ESMTP id 2f8y3mh894-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 03 Jan 2018 08:46:11 -0500
Received: from enaf.aldc.att.com (localhost [127.0.0.1])
 by alpi155.enaf.aldc.att.com (8.14.5/8.14.5) with ESMTP id w03DkAub003773;
 Wed, 3 Jan 2018 08:46:11 -0500
Received: from mlpi409.sfdc.sbc.com (mlpi409.sfdc.sbc.com [130.9.128.241])
 by alpi155.enaf.aldc.att.com (8.14.5/8.14.5) with ESMTP id w03Dk5Vf003710
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Wed, 3 Jan 2018 08:46:06 -0500
Received: from zlp27129.vci.att.com (zlp27129.vci.att.com [135.66.87.42]) by
 mlpi409.sfdc.sbc.com (RSA Interceptor); Wed, 3 Jan 2018 13:45:55 GMT
Received: from zlp27129.vci.att.com (zlp27129.vci.att.com [127.0.0.1])
 by zlp27129.vci.att.com (Service) with ESMTP id 7A6234000419;
 Wed,  3 Jan 2018 13:45:55 +0000 (GMT)
Received: from gbcdccas02.intl.att.com (unknown [135.76.180.10])
 by zlp27129.vci.att.com (Service) with ESMTPS id 0A6D940006B6;
 Wed,  3 Jan 2018 13:45:55 +0000 (GMT)
Received: from GBCDCMBX03.intl.att.com ([135.76.31.134]) by
 gbcdccas02.intl.att.com ([135.76.180.10]) with mapi id 14.03.0361.001; Wed, 3
 Jan 2018 13:45:53 +0000
From: "Dewar, Alan" <ad759e@intl.att.com>
To: "'Dumitrescu, Cristian'" <cristian.dumitrescu@intel.com>,
 "'alangordondewar@gmail.com'" <alangordondewar@gmail.com>
CC: "'dev@dpdk.org'" <dev@dpdk.org>, "'Alan Dewar'" <alan.dewar@att.com>
Thread-Topic: [PATCH v2] sched: fix overflow errors in WRR weighting code
Thread-Index: AQHTabpTTdtkz1Ipb0+EfuFMNflu6aNg9dIAgAFeNFA=
Date: Wed, 3 Jan 2018 13:45:52 +0000
Message-ID: <3F9268EEC0E43747A5FFFC6B48EF0321FBE61D@gbcdcmbx03.intl.att.com>
References: <1512032696-30765-1-git-send-email-alan.dewar@att.com>
 <3EB4FA525960D640B5BDFFD6A3D891267BAF7E9D@IRSMSX108.ger.corp.intel.com>
In-Reply-To: <3EB4FA525960D640B5BDFFD6A3D891267BAF7E9D@IRSMSX108.ger.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [135.160.174.41]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-RSA-Inspected: yes
X-RSA-Classifications: public
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
 definitions=2018-01-03_09:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_policy_notspam policy=outbound_policy
 score=0
 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0
 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0
 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.0.1-1711220000 definitions=main-1801030193
X-Mailman-Approved-At: Thu, 04 Jan 2018 10:34:23 +0100
Subject: Re: [dpdk-dev] [PATCH v2] sched: fix overflow errors in WRR
	weighting code
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Jan 2018 13:46:13 -0000

Hi Cristian,

Responses in-line below...

> -----Original Message-----
> From: Dumitrescu, Cristian [mailto:cristian.dumitrescu@intel.com]=20
> Sent: Tuesday, January 02, 2018 4:14 PM
> To: alangordondewar@gmail.com
> Cc: dev@dpdk.org; Alan Dewar <alan.dewar@att.com>
> Subject: RE: [PATCH v2] sched: fix overflow errors in WRR weighting code
>
> Hi Alan,
>
> NAK for now.
>
> There is a good reason for truncating the WRR cost to 8-bit value, which =
is keeping the size of the rte_sched_pipe structure to single cache line (6=
4 bytes). This is done for performance reasons at the expense of some accur=
acy loss for the scheduling of the 4x queues per traffic class.
>

Ah, I wasn't aware of this restriction, but I fully understand why you don'=
t want to increase the size of the rte_sched_pipe structure.

> Is there a way to make the improvement while working with 8-bit WRR cost =
values in the pipe structure?

Off the top of my head, I think that we might be able to improve the behavi=
or of WRR queues a little, but not as well as we could with 32-bit wrr_cost=
 array in the rte_sched_pipe structure.  I'll need to play around with it a=
nd see what happens.

>
> > -----Original Message-----
> > From: alangordondewar@gmail.com [mailto:alangordondewar@gmail.com]
> > Sent: Thursday, November 30, 2017 9:05 AM
> > To: Dumitrescu, Cristian <cristian.dumitrescu@intel.com>
> > Cc: dev@dpdk.org; Alan Dewar <alan.dewar@att.com>
> > Subject: [PATCH v2] sched: fix overflow errors in WRR weighting code
> >=20
> > From: Alan Dewar <alan.dewar@att.com>
> >=20
> > Revised patch - this version fixes an issue when a small wrr_cost is=20
> > shifted so far right that its value becomes zero.
> >=20
> > The WRR code calculates the lowest common denominator between the four=
=20
> > WRR weights as a uint32_t value and divides the LCD by each of the WRR=
=20
> > weights and casts the results as a uint8_t.  This casting can cause
> > the ratios of the computed wrr costs to be wrong.  For example with=20
> > WRR weights of 3, 5, 7 and 11, the LCD is computed to be 1155.  The=20
> > WRR costs get computed as:
>
> Picking prime numbers for the weights is generally a bad idea. If you wou=
ld pick e.g. 4, 6, 8 and 12 rather than 3, 5, 7 and 11 you would avoid any =
issues due to the 8-bit truncation.
>

I selected 3, 5, 7 and 11 as a pathological case to illustrate part of the =
problem, I think that it is unlikely that customers will actually use these=
 numbers.
However the DPDK doesn't impose any restrictions on what values wrr_weights=
 can have any values between 1 and 255 are allowed.  In the product where w=
e are using the DPDK as part of a software dataplane, we currently restrict=
ed wrr_weights to values between 1 and 100.    However we have no control o=
ver the values that our customers select within that range.=20

To impose any further restriction would not be backwards compatible with pr=
evious versions of our software.   This non-backwards compatibility is not =
acceptable to customers when they upgrade the software on their routers wit=
h the newer version, and their previously accepted configuration is rejecte=
d by the new version of the software.


> >=20
> >   1155/3 =3D 385, 1155/5 =3D 231, 1155/7 =3D 165, 1155/11 =3D 105.
> >=20
> > When the value 385 is cast into an uint8_t it ends up as 129.
> > Rather than casting straight into a uint8_t, this patch shifts the=20
> > computed WRR costs right so that the largest value is only eight bits=20
> > wide.
> >=20
> > In grinder_schedule, the packet length is multiplied by the WRR cost=20
> > and added to the grinder's wrr_tokens value.  The grinder's wrr_tokens=
=20
> > field is a uint16_t, so combination of a packet length of 1500 bytes=20
> > and a wrr cost of 44 will overflow this field on the first packet.
> >=20
> > This patch increases the width of the grinder's wrr_tokens and=20
> > wrr_mask fields from uint16_t to uint32_t.
> >=20
>
> Increasing the size of the grinder fields is OK, but I am not sure whethe=
r it is really helpful, as the values saved in the pipe structure are 8-bit=
.
>

I will investigate this further, but I don't think that we will see much be=
nefit.

>
> > In grinder_wrr_store, the remaining tokens in the grinder's wrr_tokens=
=20
> > array are copied to the appropriate pipe's wrr_tokens array.  However=20
> > the pipe's wrr_tokens array is only a uint8_t array so unused tokens=20
> > were quite frequently lost which upsets the balance of traffic across=20
> > the four WRR queues.
> >=20
> > This patch increases the width of the pipe's wrr_tokens array from a=20
> > uint8_t to uint32_t.
>
> This is not allowed for performance reasons, as having 16-bit or 32-bit W=
RR cost values in pipe structure would increase the size of the pipe struct=
ure from one cache line to two cache lines.

Understood, it might be interesting to try to determine what the performanc=
e hit is.

Regards
Alan

>
> >=20
> > Signed-off-by: Alan Dewar <alan.dewar@att.com>
> > ---
> > v2 - fixed bug in the wrr_cost calculation code that could result in a=
=20
> > zero wrr_cost
> >=20
> >  lib/librte_sched/rte_sched.c        | 59 +++++++++++++++++++++++++++++=
---
> > -----
> >  lib/librte_sched/rte_sched_common.h | 15 ++++++++++
> >  2 files changed, 61 insertions(+), 13 deletions(-)
> >=20
> > diff --git a/lib/librte_sched/rte_sched.c=20
> > b/lib/librte_sched/rte_sched.c index 7252f85..324743d 100644
> > --- a/lib/librte_sched/rte_sched.c
> > +++ b/lib/librte_sched/rte_sched.c
> > @@ -130,7 +130,7 @@ struct rte_sched_pipe {
> >  	uint32_t tc_credits[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> >=20
> >  	/* Weighted Round Robin (WRR) */
> > -	uint8_t wrr_tokens[RTE_SCHED_QUEUES_PER_PIPE];
> > +	uint32_t wrr_tokens[RTE_SCHED_QUEUES_PER_PIPE];
> >=20
> >  	/* TC oversubscription */
> >  	uint32_t tc_ov_credits;
> > @@ -205,8 +205,8 @@ struct rte_sched_grinder {
> >  	struct rte_mbuf *pkt;
> >=20
> >  	/* WRR */
> > -	uint16_t wrr_tokens[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > -	uint16_t wrr_mask[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > +	uint32_t wrr_tokens[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > +	uint32_t wrr_mask[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> >  	uint8_t wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> >  };
> >=20
> > @@ -542,6 +542,17 @@ rte_sched_time_ms_to_bytes(uint32_t time_ms,=20
> > uint32_t rate)
> >  	return time;
> >  }
> >=20
> > +static uint32_t rte_sched_reduce_to_byte(uint32_t value) {
> > +	uint32_t shift =3D 0;
> > +
> > +	while (value & 0xFFFFFF00) {
> > +		value >>=3D 1;
> > +		shift++;
> > +	}
> > +	return shift;
> > +}
> > +
> >  static void
> >  rte_sched_port_config_pipe_profile_table(struct rte_sched_port *port,=
=20
> > struct rte_sched_port_params *params)  { @@ -583,6 +594,8 @@=20
> > rte_sched_port_config_pipe_profile_table(struct
> > rte_sched_port *port, struct rte
> >  			uint32_t
> > wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> >  			uint32_t lcd, lcd1, lcd2;
> >  			uint32_t qindex;
> > +			uint32_t low_pos;
> > +			uint32_t shift;
> >=20
> >  			qindex =3D j *
> > RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS;
> >=20
> > @@ -594,12 +607,28 @@ rte_sched_port_config_pipe_profile_table(struct
> > rte_sched_port *port, struct rte
> >  			lcd1 =3D rte_get_lcd(wrr_cost[0], wrr_cost[1]);
> >  			lcd2 =3D rte_get_lcd(wrr_cost[2], wrr_cost[3]);
> >  			lcd =3D rte_get_lcd(lcd1, lcd2);
> > +			low_pos =3D rte_min_pos_4_u32(wrr_cost);
> >=20
> >  			wrr_cost[0] =3D lcd / wrr_cost[0];
> >  			wrr_cost[1] =3D lcd / wrr_cost[1];
> >  			wrr_cost[2] =3D lcd / wrr_cost[2];
> >  			wrr_cost[3] =3D lcd / wrr_cost[3];
> >=20
> > +			shift =3D
> > rte_sched_reduce_to_byte(wrr_cost[low_pos]);
> > +			wrr_cost[0] >>=3D shift;
> > +			wrr_cost[1] >>=3D shift;
> > +			wrr_cost[2] >>=3D shift;
> > +			wrr_cost[3] >>=3D shift;
> > +
> > +			if (wrr_cost[0] =3D=3D 0)
> > +				wrr_cost[0]++;
> > +			if (wrr_cost[1] =3D=3D 0)
> > +				wrr_cost[1]++;
> > +			if (wrr_cost[2] =3D=3D 0)
> > +				wrr_cost[2]++;
> > +			if (wrr_cost[3] =3D=3D 0)
> > +				wrr_cost[3]++;
> > +
> >  			dst->wrr_cost[qindex] =3D (uint8_t) wrr_cost[0];
> >  			dst->wrr_cost[qindex + 1] =3D (uint8_t) wrr_cost[1];
> >  			dst->wrr_cost[qindex + 2] =3D (uint8_t) wrr_cost[2]; @@ -1941,15=20
> > +1970,19 @@ grinder_wrr_load(struct rte_sched_port *port, uint32_t=20
> > pos)
> >=20
> >  	qindex =3D tc_index * 4;
> >=20
> > -	grinder->wrr_tokens[0] =3D ((uint16_t) pipe->wrr_tokens[qindex]) <<
> > RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[1] =3D ((uint16_t) pipe->wrr_tokens[qindex + 1])
> > << RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[2] =3D ((uint16_t) pipe->wrr_tokens[qindex + 2])
> > << RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[3] =3D ((uint16_t) pipe->wrr_tokens[qindex + 3])
> > << RTE_SCHED_WRR_SHIFT;
> > +	grinder->wrr_tokens[0] =3D pipe->wrr_tokens[qindex] <<
> > +		RTE_SCHED_WRR_SHIFT;
> > +	grinder->wrr_tokens[1] =3D pipe->wrr_tokens[qindex + 1] <<
> > +		RTE_SCHED_WRR_SHIFT;
> > +	grinder->wrr_tokens[2] =3D pipe->wrr_tokens[qindex + 2] <<
> > +		RTE_SCHED_WRR_SHIFT;
> > +	grinder->wrr_tokens[3] =3D pipe->wrr_tokens[qindex + 3] <<
> > +		RTE_SCHED_WRR_SHIFT;
> >=20
> > -	grinder->wrr_mask[0] =3D (qmask & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[1] =3D ((qmask >> 1) & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[2] =3D ((qmask >> 2) & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[3] =3D ((qmask >> 3) & 0x1) * 0xFFFF;
> > +	grinder->wrr_mask[0] =3D (qmask & 0x1) * 0xFFFFFFFF;
> > +	grinder->wrr_mask[1] =3D ((qmask >> 1) & 0x1) * 0xFFFFFFFF;
> > +	grinder->wrr_mask[2] =3D ((qmask >> 2) & 0x1) * 0xFFFFFFFF;
> > +	grinder->wrr_mask[3] =3D ((qmask >> 3) & 0x1) * 0xFFFFFFFF;
> >=20
> >  	grinder->wrr_cost[0] =3D pipe_params->wrr_cost[qindex];
> >  	grinder->wrr_cost[1] =3D pipe_params->wrr_cost[qindex + 1]; @@=20
> > -1981,14 +2014,14 @@ static inline void  grinder_wrr(struct=20
> > rte_sched_port *port, uint32_t pos)  {
> >  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> > -	uint16_t wrr_tokens_min;
> > +	uint32_t wrr_tokens_min;
> >=20
> >  	grinder->wrr_tokens[0] |=3D ~grinder->wrr_mask[0];
> >  	grinder->wrr_tokens[1] |=3D ~grinder->wrr_mask[1];
> >  	grinder->wrr_tokens[2] |=3D ~grinder->wrr_mask[2];
> >  	grinder->wrr_tokens[3] |=3D ~grinder->wrr_mask[3];
> >=20
> > -	grinder->qpos =3D rte_min_pos_4_u16(grinder->wrr_tokens);
> > +	grinder->qpos =3D rte_min_pos_4_u32(grinder->wrr_tokens);
> >  	wrr_tokens_min =3D grinder->wrr_tokens[grinder->qpos];
> >=20
> >  	grinder->wrr_tokens[0] -=3D wrr_tokens_min; diff --git=20
> > a/lib/librte_sched/rte_sched_common.h
> > b/lib/librte_sched/rte_sched_common.h
> > index aed144b..a3c6bc2 100644
> > --- a/lib/librte_sched/rte_sched_common.h
> > +++ b/lib/librte_sched/rte_sched_common.h
> > @@ -77,6 +77,21 @@ rte_min_pos_4_u16(uint16_t *x)
> >  	return pos0;
> >  }
> >=20
> > +static inline uint32_t
> > +rte_min_pos_4_u32(uint32_t *x)
> > +{
> > +	uint32_t pos0 =3D 0;
> > +	uint32_t pos1 =3D 2;
> > +
> > +	if (x[1] <=3D x[0])
> > +		pos0 =3D 1;
> > +	if (x[3] <=3D x[2])
> > +		pos1 =3D 3;
> > +	if (x[pos1] <=3D x[pos0])
> > +		pos0 =3D pos1;
> > +
> > +	return pos0;
> > +}
> >  #endif
> >=20
> >  /*
> > --
> > 2.1.4
>
> Regards,
> Cristian