From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 4A07DA0471
	for <public@inbox.dpdk.org>; Wed, 17 Jul 2019 16:49:24 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 5D1EB1BE17;
	Wed, 17 Jul 2019 16:49:23 +0200 (CEST)
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id 6BCF11BE0D
 for <dev@dpdk.org>; Wed, 17 Jul 2019 16:49:21 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 17 Jul 2019 07:49:20 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.64,274,1559545200"; d="scan'208";a="191283648"
Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96])
 by fmsmga004.fm.intel.com with ESMTP; 17 Jul 2019 07:49:19 -0700
Received: from irsmsx103.ger.corp.intel.com ([169.254.3.45]) by
 IRSMSX154.ger.corp.intel.com ([169.254.12.160]) with mapi id 14.03.0439.000;
 Wed, 17 Jul 2019 15:49:18 +0100
From: "Singh, Jasvinder" <jasvinder.singh@intel.com>
To: "Dumitrescu, Cristian" <cristian.dumitrescu@intel.com>, "dev@dpdk.org"
 <dev@dpdk.org>
CC: "Tovar, AbrahamX" <abrahamx.tovar@intel.com>, "Krakowiak, LukaszX"
 <lukaszx.krakowiak@intel.com>
Thread-Topic: [PATCH v4 01/11] sched: remove wrr from strict priority tc queues
Thread-Index: AQHVO2gF8Eb5CF6PqkOBnuz8supfx6bO5Saw
Date: Wed, 17 Jul 2019 14:49:17 +0000
Message-ID: <54CBAA185211B4429112C315DA58FF6D3FD947AF@IRSMSX103.ger.corp.intel.com>
References: <20190711102659.59001-2-jasvinder.singh@intel.com>
 <20190712095729.159767-1-jasvinder.singh@intel.com>
 <20190712095729.159767-2-jasvinder.singh@intel.com>
 <3EB4FA525960D640B5BDFFD6A3D891268E8EEDA2@IRSMSX108.ger.corp.intel.com>
In-Reply-To: <3EB4FA525960D640B5BDFFD6A3D891268E8EEDA2@IRSMSX108.ger.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiOGYzZTg2NzctMTQ1Mi00M2E3LWExMGQtMzkxMDBiOWZiOGEzIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiMUdham9nbTV4R0d2UUpcL3dZc29GZUJWRGprVXNPYkUrb1dEaDRlZlpZUHNUOEJtV2ZLS251THNhXC8zUWNNb0xNIn0=
x-ctpclassification: CTP_NT
dlp-product: dlpe-windows
dlp-version: 11.0.200.100
dlp-reaction: no-action
x-originating-ip: [163.33.239.180]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v4 01/11] sched: remove wrr from strict
 priority tc queues
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>


<snip>
> > +version =3D 3
> >  sources =3D files('rte_sched.c', 'rte_red.c', 'rte_approx.c')  headers
> > =3D files('rte_sched.h', 'rte_sched_common.h',
> >  		'rte_red.h', 'rte_approx.h')
> > diff --git a/lib/librte_sched/rte_sched.c
> > b/lib/librte_sched/rte_sched.c index bc06bc3f4..b1f521794 100644
> > --- a/lib/librte_sched/rte_sched.c
> > +++ b/lib/librte_sched/rte_sched.c
> > @@ -37,6 +37,8 @@
> >
> >  #define RTE_SCHED_TB_RATE_CONFIG_ERR          (1e-7)
> >  #define RTE_SCHED_WRR_SHIFT                   3
> > +#define RTE_SCHED_TRAFFIC_CLASS_BE
> > (RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE - 1)
> > +#define RTE_SCHED_MAX_QUEUES_PER_TC
> > RTE_SCHED_BE_QUEUES_PER_PIPE
> >  #define RTE_SCHED_GRINDER_PCACHE_SIZE         (64 /
> > RTE_SCHED_QUEUES_PER_PIPE)
> >  #define RTE_SCHED_PIPE_INVALID                UINT32_MAX
> >  #define RTE_SCHED_BMP_POS_INVALID             UINT32_MAX
> > @@ -84,8 +86,9 @@ struct rte_sched_pipe_profile {
> >  	uint32_t
> > tc_credits_per_period[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> >  	uint8_t tc_ov_weight;
> >
> > -	/* Pipe queues */
> > -	uint8_t  wrr_cost[RTE_SCHED_QUEUES_PER_PIPE];
> > +	/* Pipe best-effort traffic class queues */
> > +	uint8_t n_be_queues;
>=20
> The n_be_queues is the same for all pipes within the same port, so it doe=
s not
> make sense to save this per-port value in each pipe profile. At the very =
least,
> let's move it to the port data structure, please.
>=20
> In fact, a better solution (that also simplifies the implementation) is t=
o enforce
> the same queue size for all BE queues, as it does not make sense to have
> queues within the same traffic class of different size (see my comment in=
 the
> other patch where you update the API). So n_be_queues should always be 4,
> therefore no need for this variable.
>=20
=20
Thanks for your time and comments. I have removed n_be_queues in v5.
=20
> > +	uint8_t  wrr_cost[RTE_SCHED_BE_QUEUES_PER_PIPE];
> >  };
> >
> >  struct rte_sched_pipe {
> > @@ -100,8 +103,10 @@ struct rte_sched_pipe {
> >  	uint64_t tc_time; /* time of next update */
> >  	uint32_t tc_credits[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> >
> > +	uint8_t n_be_queues; /* Best effort traffic class queues */
>=20
> Same comment here, even more important, as we need to strive reducing the
> size of this struct for performance reasons.
>=20
> > +
> >  	/* Weighted Round Robin (WRR) */
> > -	uint8_t wrr_tokens[RTE_SCHED_QUEUES_PER_PIPE];
> > +	uint8_t wrr_tokens[RTE_SCHED_BE_QUEUES_PER_PIPE];
> >
> >  	/* TC oversubscription */
> >  	uint32_t tc_ov_credits;
> > @@ -153,16 +158,16 @@ struct rte_sched_grinder {
> >  	uint32_t tc_index;
> >  	struct rte_sched_queue
> > *queue[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> >  	struct rte_mbuf **qbase[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> > -	uint32_t qindex[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> > -	uint16_t qsize;
> > +	uint32_t qindex[RTE_SCHED_MAX_QUEUES_PER_TC];
> > +	uint16_t qsize[RTE_SCHED_MAX_QUEUES_PER_TC];
> >  	uint32_t qmask;
> >  	uint32_t qpos;
> >  	struct rte_mbuf *pkt;
> >
> >  	/* WRR */
> > -	uint16_t wrr_tokens[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > -	uint16_t wrr_mask[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > -	uint8_t wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > +	uint16_t wrr_tokens[RTE_SCHED_BE_QUEUES_PER_PIPE];
> > +	uint16_t wrr_mask[RTE_SCHED_BE_QUEUES_PER_PIPE];
> > +	uint8_t wrr_cost[RTE_SCHED_BE_QUEUES_PER_PIPE];
> >  };
> >
> >  struct rte_sched_port {
> > @@ -301,7 +306,6 @@ pipe_profile_check(struct rte_sched_pipe_params
> > *params,
> >  		if (params->wrr_weights[i] =3D=3D 0)
> >  			return -16;
> >  	}
> > -
> >  	return 0;
> >  }
> >
> > @@ -483,7 +487,7 @@ rte_sched_port_log_pipe_profile(struct
> > rte_sched_port *port, uint32_t i)
> >  		"    Token bucket: period =3D %u, credits per period =3D %u, size =
=3D
> > %u\n"
> >  		"    Traffic classes: period =3D %u, credits per period =3D [%u, %u,
> > %u, %u]\n"
> >  		"    Traffic class 3 oversubscription: weight =3D %hhu\n"
> > -		"    WRR cost: [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu,
> > %hhu, %hhu], [%hhu, %hhu, %hhu, %hhu], [%hhu, %hhu, %hhu, %hhu]\n",
> > +		"    WRR cost: [%hhu, %hhu, %hhu, %hhu]\n",
> >  		i,
> >
> >  		/* Token bucket */
> > @@ -502,10 +506,7 @@ rte_sched_port_log_pipe_profile(struct
> > rte_sched_port *port, uint32_t i)
> >  		p->tc_ov_weight,
> >
> >  		/* WRR */
> > -		p->wrr_cost[ 0], p->wrr_cost[ 1], p->wrr_cost[ 2], p-
> > >wrr_cost[ 3],
> > -		p->wrr_cost[ 4], p->wrr_cost[ 5], p->wrr_cost[ 6], p-
> > >wrr_cost[ 7],
> > -		p->wrr_cost[ 8], p->wrr_cost[ 9], p->wrr_cost[10], p-
> > >wrr_cost[11],
> > -		p->wrr_cost[12], p->wrr_cost[13], p->wrr_cost[14], p-
> > >wrr_cost[15]);
> > +		p->wrr_cost[0], p->wrr_cost[1], p->wrr_cost[2], p-
> > >wrr_cost[3]);
> >  }
> >
> >  static inline uint64_t
> > @@ -519,10 +520,12 @@ rte_sched_time_ms_to_bytes(uint32_t time_ms,
> > uint32_t rate)  }
> >
> >  static void
> > -rte_sched_pipe_profile_convert(struct rte_sched_pipe_params *src,
> > +rte_sched_pipe_profile_convert(struct rte_sched_port *port,
> > +	struct rte_sched_pipe_params *src,
> >  	struct rte_sched_pipe_profile *dst,
> >  	uint32_t rate)
> >  {
> > +	uint32_t wrr_cost[RTE_SCHED_BE_QUEUES_PER_PIPE];
> >  	uint32_t i;
> >
> >  	/* Token Bucket */
> > @@ -553,18 +556,36 @@ rte_sched_pipe_profile_convert(struct
> > rte_sched_pipe_params *src,
> >  	dst->tc_ov_weight =3D src->tc_ov_weight;  #endif
> >
> > -	/* WRR */
> > -	for (i =3D 0; i < RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE; i++) {
> > -		uint32_t
> > wrr_cost[RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS];
> > -		uint32_t lcd, lcd1, lcd2;
> > -		uint32_t qindex;
> > +	/* WRR queues */
> > +	for (i =3D 0; i < RTE_SCHED_BE_QUEUES_PER_PIPE; i++)
> > +		if (port->qsize[i])
> > +			dst->n_be_queues++;
> > +
> > +	if (dst->n_be_queues =3D=3D 1)
> > +		dst->wrr_cost[0] =3D src->wrr_weights[0];
> > +
> > +	if (dst->n_be_queues =3D=3D 2) {
> > +		uint32_t lcd;
> > +
> > +		wrr_cost[0] =3D src->wrr_weights[0];
> > +		wrr_cost[1] =3D src->wrr_weights[1];
> > +
> > +		lcd =3D rte_get_lcd(wrr_cost[0], wrr_cost[1]);
> > +
> > +		wrr_cost[0] =3D lcd / wrr_cost[0];
> > +		wrr_cost[1] =3D lcd / wrr_cost[1];
> >
> > -		qindex =3D i * RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS;
> > +		dst->wrr_cost[0] =3D (uint8_t) wrr_cost[0];
> > +		dst->wrr_cost[1] =3D (uint8_t) wrr_cost[1];
> > +	}
> >
> > -		wrr_cost[0] =3D src->wrr_weights[qindex];
> > -		wrr_cost[1] =3D src->wrr_weights[qindex + 1];
> > -		wrr_cost[2] =3D src->wrr_weights[qindex + 2];
> > -		wrr_cost[3] =3D src->wrr_weights[qindex + 3];
> > +	if (dst->n_be_queues =3D=3D 4) {
>=20
> See the above comment, it is better and simpler to enforce n_be_queues =
=3D=3D 4,
> which simplifies this code a loot, as it keeps only this branch and remov=
es the
> need for the above two.
>=20
Fixed  in v5.

> > +		uint32_t lcd1, lcd2, lcd;
> > +
> > +		wrr_cost[0] =3D src->wrr_weights[0];
> > +		wrr_cost[1] =3D src->wrr_weights[1];
> > +		wrr_cost[2] =3D src->wrr_weights[2];
> > +		wrr_cost[3] =3D src->wrr_weights[3];
> >
> >  		lcd1 =3D rte_get_lcd(wrr_cost[0], wrr_cost[1]);
> >  		lcd2 =3D rte_get_lcd(wrr_cost[2], wrr_cost[3]); @@ -575,10
> +596,10 @@
> > rte_sched_pipe_profile_convert(struct
> > rte_sched_pipe_params *src,
> >  		wrr_cost[2] =3D lcd / wrr_cost[2];
> >  		wrr_cost[3] =3D lcd / wrr_cost[3];
> >
> > -		dst->wrr_cost[qindex] =3D (uint8_t) wrr_cost[0];
> > -		dst->wrr_cost[qindex + 1] =3D (uint8_t) wrr_cost[1];
> > -		dst->wrr_cost[qindex + 2] =3D (uint8_t) wrr_cost[2];
> > -		dst->wrr_cost[qindex + 3] =3D (uint8_t) wrr_cost[3];
> > +		dst->wrr_cost[0] =3D (uint8_t) wrr_cost[0];
> > +		dst->wrr_cost[1] =3D (uint8_t) wrr_cost[1];
> > +		dst->wrr_cost[2] =3D (uint8_t) wrr_cost[2];
> > +		dst->wrr_cost[3] =3D (uint8_t) wrr_cost[3];
> >  	}
> >  }
> >
> > @@ -592,7 +613,7 @@ rte_sched_port_config_pipe_profile_table(struct
> > rte_sched_port *port,
> >  		struct rte_sched_pipe_params *src =3D params->pipe_profiles
> > + i;
> >  		struct rte_sched_pipe_profile *dst =3D port->pipe_profiles + i;
> >
> > -		rte_sched_pipe_profile_convert(src, dst, params->rate);
> > +		rte_sched_pipe_profile_convert(port, src, dst, params-
> > >rate);
> >  		rte_sched_port_log_pipe_profile(port, i);
> >  	}
> >
> > @@ -976,7 +997,7 @@ rte_sched_port_pipe_profile_add(struct
> > rte_sched_port *port,
> >  		return status;
> >
> >  	pp =3D &port->pipe_profiles[port->n_pipe_profiles];
> > -	rte_sched_pipe_profile_convert(params, pp, port->rate);
> > +	rte_sched_pipe_profile_convert(port, params, pp, port->rate);
> >
> >  	/* Pipe profile not exists */
> >  	for (i =3D 0; i < port->n_pipe_profiles; i++) @@ -1715,6 +1736,7 @@
> > grinder_schedule(struct rte_sched_port *port, uint32_t pos)
> >  	struct rte_sched_queue *queue =3D grinder->queue[grinder->qpos];
> >  	struct rte_mbuf *pkt =3D grinder->pkt;
> >  	uint32_t pkt_len =3D pkt->pkt_len + port->frame_overhead;
> > +	int be_tc_active;
> >
> >  	if (!grinder_credits_check(port, pos))
> >  		return 0;
> > @@ -1725,13 +1747,18 @@ grinder_schedule(struct rte_sched_port *port,
> > uint32_t pos)
> >  	/* Send packet */
> >  	port->pkts_out[port->n_pkts_out++] =3D pkt;
> >  	queue->qr++;
> > -	grinder->wrr_tokens[grinder->qpos] +=3D pkt_len * grinder-
> > >wrr_cost[grinder->qpos];
> > +
> > +	be_tc_active =3D (grinder->tc_index =3D=3D
> > RTE_SCHED_TRAFFIC_CLASS_BE);
> > +	grinder->wrr_tokens[grinder->qpos] +=3D
> > +		pkt_len * grinder->wrr_cost[grinder->qpos] * be_tc_active;
> > +
>=20
> Integer multiplication is very expensive, you can easily avoid it by doin=
g
> bitwise-and with a mask whose values are either 0 or all-ones.
>

Replace multiplication with bitwise & operation in v5.


> >  	if (queue->qr =3D=3D queue->qw) {
> >  		uint32_t qindex =3D grinder->qindex[grinder->qpos];
> >
> >  		rte_bitmap_clear(port->bmp, qindex);
> >  		grinder->qmask &=3D ~(1 << grinder->qpos);
> > -		grinder->wrr_mask[grinder->qpos] =3D 0;
> > +		if (be_tc_active)
> > +			grinder->wrr_mask[grinder->qpos] =3D 0;
> >  		rte_sched_port_set_queue_empty_timestamp(port,
> > qindex);
> >  	}
> >
> > @@ -1877,7 +1904,7 @@ grinder_next_tc(struct rte_sched_port *port,
> > uint32_t pos)
> >
> >  	grinder->tc_index =3D (qindex >> 2) & 0x3;
> >  	grinder->qmask =3D grinder->tccache_qmask[grinder->tccache_r];
> > -	grinder->qsize =3D qsize;
> > +	grinder->qsize[grinder->tc_index] =3D qsize;
> >
> >  	grinder->qindex[0] =3D qindex;
> >  	grinder->qindex[1] =3D qindex + 1;
> > @@ -1962,26 +1989,15 @@ grinder_wrr_load(struct rte_sched_port *port,
> > uint32_t pos)
> >  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> >  	struct rte_sched_pipe *pipe =3D grinder->pipe;
> >  	struct rte_sched_pipe_profile *pipe_params =3D grinder-
> > >pipe_params;
> > -	uint32_t tc_index =3D grinder->tc_index;
> >  	uint32_t qmask =3D grinder->qmask;
> > -	uint32_t qindex;
> > -
> > -	qindex =3D tc_index * 4;
> > -
> > -	grinder->wrr_tokens[0] =3D ((uint16_t) pipe->wrr_tokens[qindex]) <<
> > RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[1] =3D ((uint16_t) pipe->wrr_tokens[qindex + 1])
> > << RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[2] =3D ((uint16_t) pipe->wrr_tokens[qindex + 2])
> > << RTE_SCHED_WRR_SHIFT;
> > -	grinder->wrr_tokens[3] =3D ((uint16_t) pipe->wrr_tokens[qindex + 3])
> > << RTE_SCHED_WRR_SHIFT;
> > -
> > -	grinder->wrr_mask[0] =3D (qmask & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[1] =3D ((qmask >> 1) & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[2] =3D ((qmask >> 2) & 0x1) * 0xFFFF;
> > -	grinder->wrr_mask[3] =3D ((qmask >> 3) & 0x1) * 0xFFFF;
> > +	uint32_t i;
> >
> > -	grinder->wrr_cost[0] =3D pipe_params->wrr_cost[qindex];
> > -	grinder->wrr_cost[1] =3D pipe_params->wrr_cost[qindex + 1];
> > -	grinder->wrr_cost[2] =3D pipe_params->wrr_cost[qindex + 2];
> > -	grinder->wrr_cost[3] =3D pipe_params->wrr_cost[qindex + 3];
> > +	for (i =3D 0; i < pipe->n_be_queues; i++) {
> > +		grinder->wrr_tokens[i] =3D
> > +			((uint16_t) pipe->wrr_tokens[i]) <<
> > RTE_SCHED_WRR_SHIFT;
> > +		grinder->wrr_mask[i] =3D ((qmask >> i) & 0x1) * 0xFFFF;
> > +		grinder->wrr_cost[i] =3D pipe_params->wrr_cost[i];
> > +	}
> >  }
> >
> >  static inline void
> > @@ -1989,19 +2005,12 @@ grinder_wrr_store(struct rte_sched_port *port,
> > uint32_t pos)  {
> >  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> >  	struct rte_sched_pipe *pipe =3D grinder->pipe;
> > -	uint32_t tc_index =3D grinder->tc_index;
> > -	uint32_t qindex;
> > -
> > -	qindex =3D tc_index * 4;
> > +	uint32_t i;
> >
> > -	pipe->wrr_tokens[qindex] =3D (grinder->wrr_tokens[0] & grinder-
> > >wrr_mask[0])
> > -		>> RTE_SCHED_WRR_SHIFT;
> > -	pipe->wrr_tokens[qindex + 1] =3D (grinder->wrr_tokens[1] & grinder-
> > >wrr_mask[1])
> > -		>> RTE_SCHED_WRR_SHIFT;
> > -	pipe->wrr_tokens[qindex + 2] =3D (grinder->wrr_tokens[2] & grinder-
> > >wrr_mask[2])
> > -		>> RTE_SCHED_WRR_SHIFT;
> > -	pipe->wrr_tokens[qindex + 3] =3D (grinder->wrr_tokens[3] & grinder-
> > >wrr_mask[3])
> > -		>> RTE_SCHED_WRR_SHIFT;
> > +	for (i =3D 0; i < pipe->n_be_queues; i++)
> > +		pipe->wrr_tokens[i] =3D
> > +			(grinder->wrr_tokens[i] & grinder->wrr_mask[i]) >>
> > +				RTE_SCHED_WRR_SHIFT;
> >  }
> >
> >  static inline void
> > @@ -2040,22 +2049,31 @@ static inline void
> > grinder_prefetch_tc_queue_arrays(struct rte_sched_port *port, uint32_t
> > pos)
> >  {
> >  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> > -	uint16_t qsize, qr[4];
> > +	struct rte_sched_pipe *pipe =3D grinder->pipe;
> > +	struct rte_sched_queue *queue;
> > +	uint32_t i;
> > +	uint16_t qsize, qr[RTE_SCHED_MAX_QUEUES_PER_TC];
> >
> > -	qsize =3D grinder->qsize;
> > -	qr[0] =3D grinder->queue[0]->qr & (qsize - 1);
> > -	qr[1] =3D grinder->queue[1]->qr & (qsize - 1);
> > -	qr[2] =3D grinder->queue[2]->qr & (qsize - 1);
> > -	qr[3] =3D grinder->queue[3]->qr & (qsize - 1);
> > +	grinder->qpos =3D 0;
> > +	if (grinder->tc_index < RTE_SCHED_TRAFFIC_CLASS_BE) {
> > +		queue =3D grinder->queue[0];
> > +		qsize =3D grinder->qsize[0];
> > +		qr[0] =3D queue->qr & (qsize - 1);
> >
> > -	rte_prefetch0(grinder->qbase[0] + qr[0]);
> > -	rte_prefetch0(grinder->qbase[1] + qr[1]);
> > +		rte_prefetch0(grinder->qbase[0] + qr[0]);
> > +		return;
> > +	}
> > +
> > +	for (i =3D 0; i < pipe->n_be_queues; i++) {
> > +		queue =3D grinder->queue[i];
> > +		qsize =3D grinder->qsize[i];
> > +		qr[i] =3D queue->qr & (qsize - 1);
> > +
> > +		rte_prefetch0(grinder->qbase[i] + qr[i]);
> > +	}
> >
> >  	grinder_wrr_load(port, pos);
> >  	grinder_wrr(port, pos);
> > -
> > -	rte_prefetch0(grinder->qbase[2] + qr[2]);
> > -	rte_prefetch0(grinder->qbase[3] + qr[3]);
> >  }
> >
> >  static inline void
> > @@ -2064,7 +2082,7 @@ grinder_prefetch_mbuf(struct rte_sched_port
> > *port, uint32_t pos)
> >  	struct rte_sched_grinder *grinder =3D port->grinder + pos;
> >  	uint32_t qpos =3D grinder->qpos;
> >  	struct rte_mbuf **qbase =3D grinder->qbase[qpos];
> > -	uint16_t qsize =3D grinder->qsize;
> > +	uint16_t qsize =3D grinder->qsize[qpos];
> >  	uint16_t qr =3D grinder->queue[qpos]->qr & (qsize - 1);
> >
> >  	grinder->pkt =3D qbase[qr];
> > @@ -2118,18 +2136,24 @@ grinder_handle(struct rte_sched_port *port,
> > uint32_t pos)
> >
> >  	case e_GRINDER_READ_MBUF:
> >  	{
> > -		uint32_t result =3D 0;
> > +		uint32_t wrr_active, result =3D 0;
> >
> >  		result =3D grinder_schedule(port, pos);
> >
> > +		wrr_active =3D (grinder->tc_index =3D=3D
> > RTE_SCHED_TRAFFIC_CLASS_BE);
> > +
> >  		/* Look for next packet within the same TC */
> >  		if (result && grinder->qmask) {
> > -			grinder_wrr(port, pos);
> > +			if (wrr_active)
> > +				grinder_wrr(port, pos);
> > +
> >  			grinder_prefetch_mbuf(port, pos);
> >
> >  			return 1;
> >  		}
> > -		grinder_wrr_store(port, pos);
> > +
> > +		if (wrr_active)
> > +			grinder_wrr_store(port, pos);
> >
> >  		/* Look for another active TC within same pipe */
> >  		if (grinder_next_tc(port, pos)) {
> > diff --git a/lib/librte_sched/rte_sched.h
> > b/lib/librte_sched/rte_sched.h index d61dda9f5..2a935998a 100644
> > --- a/lib/librte_sched/rte_sched.h
> > +++ b/lib/librte_sched/rte_sched.h
> > @@ -66,6 +66,22 @@ extern "C" {
> >  #include "rte_red.h"
> >  #endif
> >
> > +/** Maximum number of queues per pipe.
> > + * Note that the multiple queues (power of 2) can only be assigned to
> > + * lowest priority (best-effort) traffic class. Other higher priority
> > +traffic
> > + * classes can only have one queue.
> > + * Can not change.
> > + *
> > + * @see struct rte_sched_port_params
> > + */
> > +#define RTE_SCHED_QUEUES_PER_PIPE    16
> > +
> > +/** Number of WRR queues for best-effort traffic class per pipe.
> > + *
> > + * @see struct rte_sched_pipe_params
> > + */
> > +#define RTE_SCHED_BE_QUEUES_PER_PIPE    4
> > +
> >  /** Number of traffic classes per pipe (as well as subport).
> >   * Cannot be changed.
> >   */
> > @@ -74,11 +90,6 @@ extern "C" {
> >  /** Number of queues per pipe traffic class. Cannot be changed. */
> >  #define RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS    4
> >
> > -/** Number of queues per pipe. */
> > -#define RTE_SCHED_QUEUES_PER_PIPE             \
> > -	(RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE *     \
> > -	RTE_SCHED_QUEUES_PER_TRAFFIC_CLASS)
> > -
> >  /** Maximum number of pipe profiles that can be defined per port.
> >   * Compile-time configurable.
> >   */
> > @@ -165,7 +176,7 @@ struct rte_sched_pipe_params {  #endif
> >
> >  	/* Pipe queues */
> > -	uint8_t  wrr_weights[RTE_SCHED_QUEUES_PER_PIPE]; /**< WRR
> > weights */
> > +	uint8_t  wrr_weights[RTE_SCHED_BE_QUEUES_PER_PIPE]; /**<
> > WRR weights */
> >  };
> >
> >  /** Queue statistics */
> > --
> > 2.21.0