From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8B484A0C41; Wed, 17 Nov 2021 11:33:19 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4956A41190; Wed, 17 Nov 2021 11:33:19 +0100 (CET) Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) by mails.dpdk.org (Postfix) with ESMTP id 1948D41153 for ; Wed, 17 Nov 2021 11:33:18 +0100 (CET) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id B6D3E5C0151; Wed, 17 Nov 2021 05:33:16 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Wed, 17 Nov 2021 05:33:16 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding:content-type; s=fm2; bh= 8bb54iQcHHbzUYef0ziQy5iAIHp8tMViDFxyJKyCNec=; b=qcraPFMQ1sSm3eHb 5+stQwbSLBg8/pzYkpetS6CkYSs7IXcmkae1SFksvf+sNnxaXnTi68BnW4CHfJya nkQ+8WcBimmXwMkOrb7eLfdGGC+7B4E+HEPFErZb1vGMBAu511Cap9uDxEDjYGu4 ET5t7i7ocmkVKHpY6/8IxibiDZKx86UzUyaiyTadNi330UbK+dApv0dUwBwzrYTT gE+qZw2sW2UGt9J32offhVyJjkaPTQe7KuEDXIa8BPpiDZfahu5Vsj4s1G+4sfHn 4vDH/jyQYAaPzBY8AcQOmjnxV5bg9b/GgfziE/c17yv2s/CUKpQDbVYtOPBwMnko hbvfWQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=8bb54iQcHHbzUYef0ziQy5iAIHp8tMViDFxyJKyCN ec=; b=TgRLW1lZTTfVLqMWLEsbNJWOhFgiX78cOlecdXKz/d1bk7y0alURyhiI3 187Mm0mht+PSlVvCq5eQWv30Qnf3Qpl1POvMoIAyEnJ7PeAJA0xM7LUT594NVdlK n9H4WdrrMTTgBZiOTQTvLXY0eOya4fIYVciDSp3Qgt/w8J3yIyuG0jqaRAjLrlWg dZLc2jB/J7A8Ue4oqgb4+Y5+39xd7BENfEfegXKQuoVnb7EE9caf5FOydRLjLDAg 742JyXMrzN/Ui+ym/uguQpLZNxxyYAsWh4a9ApVD59L4HVbFWVkQmYPqiD7yOY4T 3aIfTFKhQc9lmxJ9mZSeHYEjXnJnw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvuddrfeeggdduvdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkjghfggfgtgesthfuredttddtvdenucfhrhhomhepvfhhohhmrghs ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf frrghtthgvrhhnpedugefgvdefudfftdefgeelgffhueekgfffhfeujedtteeutdejueei iedvffegheenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhroh hmpehthhhomhgrshesmhhonhhjrghlohhnrdhnvght X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 17 Nov 2021 05:33:15 -0500 (EST) From: Thomas Monjalon To: "Dumitrescu, Cristian" , "Singh, Jasvinder" , "Ananyev, Konstantin" Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH v2 2/2] qos: rearrange enqueue procedure Date: Wed, 17 Nov 2021 11:33:14 +0100 Message-ID: <3756923.HVIXXKutY6@thomas> In-Reply-To: References: <20210316170723.22036-1-konstantin.ananyev@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi all, What is the conclusion for this patch and the number 1/2? 04/04/2021 01:53, Ananyev, Konstantin: > > Hi guys, > > > > > In many usage scenarios input mbufs for rte_sched_port_enqueue() are > > > not > > > > yet in the CPU cache(s). That causes quite significant stalls due to memory > > > > latency. Current implementation tries to migitate it using SW pipeline and > > > SW > > > > prefetch techniques, but stalls are still present. > > > > Rework rte_sched_port_enqueue() to do actual fetch of all mbufs > > > metadata > > > > as a first stage of that function. > > > > That helps to minimise load stalls at further stages of enqueue() and > > > > improves overall enqueue performance. > > > > With examples/qos_sched I observed: > > > > on ICX box: up to 30% cycles reduction > > > > on CSX AND BDX: 20-15% cycles reduction > > > > I also run tests with mbufs already in the cache (one core doing RX, QOS > > > and > > > > TX). > > > > With such scenario, on all mentioned above IA boxes no performance drop > > > > was observed. > > > > > > > > Signed-off-by: Konstantin Ananyev > > > > --- > > > > v2: fix clang and checkpatch complains > > > > --- > > > > lib/librte_sched/rte_sched.c | 219 +++++------------------------------ > > > > 1 file changed, 31 insertions(+), 188 deletions(-) > > > > > > > > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c > > > index > > > > 7c5688068..41ef147e0 100644 > > > > --- a/lib/librte_sched/rte_sched.c > > > > +++ b/lib/librte_sched/rte_sched.c > > > > @@ -1861,24 +1861,23 @@ debug_check_queue_slab(struct > > > > rte_sched_subport *subport, uint32_t bmp_pos, #endif /* > > > > RTE_SCHED_DEBUG */ > > > > > > > > static inline struct rte_sched_subport * -rte_sched_port_subport(struct > > > > rte_sched_port *port, > > > > - struct rte_mbuf *pkt) > > > > +sched_port_subport(const struct rte_sched_port *port, struct > > > > +rte_mbuf_sched sch) > > > > { > > > > - uint32_t queue_id = rte_mbuf_sched_queue_get(pkt); > > > > + uint32_t queue_id = sch.queue_id; > > > > uint32_t subport_id = queue_id >> (port- > > > > >n_pipes_per_subport_log2 + 4); > > > > > > > > return port->subports[subport_id]; > > > > } > > > > > > > > static inline uint32_t > > > > -rte_sched_port_enqueue_qptrs_prefetch0(struct rte_sched_subport > > > > *subport, > > > > - struct rte_mbuf *pkt, uint32_t subport_qmask) > > > > +sched_port_enqueue_qptrs_prefetch0(const struct rte_sched_subport > > > > *subport, > > > > + struct rte_mbuf_sched sch, uint32_t subport_qmask) > > > > { > > > > struct rte_sched_queue *q; > > > > #ifdef RTE_SCHED_COLLECT_STATS > > > > struct rte_sched_queue_extra *qe; > > > > #endif > > > > - uint32_t qindex = rte_mbuf_sched_queue_get(pkt); > > > > + uint32_t qindex = sch.queue_id; > > > > uint32_t subport_queue_id = subport_qmask & qindex; > > > > > > > > q = subport->queue + subport_queue_id; @@ -1971,197 +1970,41 > > > > @@ int rte_sched_port_enqueue(struct rte_sched_port *port, struct > > > > rte_mbuf **pkts, > > > > uint32_t n_pkts) > > > > { > > > > - struct rte_mbuf *pkt00, *pkt01, *pkt10, *pkt11, *pkt20, *pkt21, > > > > - *pkt30, *pkt31, *pkt_last; > > > > - struct rte_mbuf **q00_base, **q01_base, **q10_base, > > > > **q11_base, > > > > - **q20_base, **q21_base, **q30_base, **q31_base, > > > > **q_last_base; > > > > - struct rte_sched_subport *subport00, *subport01, *subport10, > > > > *subport11, > > > > - *subport20, *subport21, *subport30, *subport31, > > > > *subport_last; > > > > - uint32_t q00, q01, q10, q11, q20, q21, q30, q31, q_last; > > > > - uint32_t r00, r01, r10, r11, r20, r21, r30, r31, r_last; > > > > - uint32_t subport_qmask; > > > > uint32_t result, i; > > > > + struct rte_mbuf_sched sch[n_pkts]; > > > > + struct rte_sched_subport *subports[n_pkts]; > > > > + struct rte_mbuf **q_base[n_pkts]; > > > > + uint32_t q[n_pkts]; > > > > + > > > > + const uint32_t subport_qmask = > > > > + (1 << (port->n_pipes_per_subport_log2 + 4)) - 1; > > > > > > > > result = 0; > > > > - subport_qmask = (1 << (port->n_pipes_per_subport_log2 + 4)) - 1; > > > > > > > > - /* > > > > - * Less then 6 input packets available, which is not enough to > > > > - * feed the pipeline > > > > - */ > > > > - if (unlikely(n_pkts < 6)) { > > > > - struct rte_sched_subport *subports[5]; > > > > - struct rte_mbuf **q_base[5]; > > > > - uint32_t q[5]; > > > > - > > > > - /* Prefetch the mbuf structure of each packet */ > > > > - for (i = 0; i < n_pkts; i++) > > > > - rte_prefetch0(pkts[i]); > > > > - > > > > - /* Prefetch the subport structure for each packet */ > > > > - for (i = 0; i < n_pkts; i++) > > > > - subports[i] = rte_sched_port_subport(port, pkts[i]); > > > > - > > > > - /* Prefetch the queue structure for each queue */ > > > > - for (i = 0; i < n_pkts; i++) > > > > - q[i] = > > > > rte_sched_port_enqueue_qptrs_prefetch0(subports[i], > > > > - pkts[i], subport_qmask); > > > > - > > > > - /* Prefetch the write pointer location of each queue */ > > > > - for (i = 0; i < n_pkts; i++) { > > > > - q_base[i] = > > > > rte_sched_subport_pipe_qbase(subports[i], q[i]); > > > > - rte_sched_port_enqueue_qwa_prefetch0(port, > > > > subports[i], > > > > - q[i], q_base[i]); > > > > - } > > > > + /* Prefetch the mbuf structure of each packet */ > > > > + for (i = 0; i < n_pkts; i++) > > > > + sch[i] = pkts[i]->hash.sched; > > > > > > > > > > Hi Konstantin, thanks for the patch. In above case, all packets are touched > > > straight with any prefetch. If we consider the input burst size of 64 pkts, it > > > means 512 bytes of packet addresses (8 cache-lines) which is likely to be > > > available in cache. For larger size burst, e.g. 128 or 256, there might be > > > instances when some addresses are not available the cache, may stall core. > > > How about adding explicit prefetch before starting to iterate through the > > > packets if that helps? > > I don't think we need any prefetch here. > pkts[] is a sequential array, HW prefetcher should be able to do good job here. > Again in majority of use-cases pkts[] contents will already present in the cache. > Though there is a valid concern here: n_pkts can be big, in that case we probably > don't want to store too much on the stack and read too much from pkts[]. > It is better to work in some fixed chunks (64 or so). > I can prepare v2 with these changes, if there still is an interest in this patch. > > > Exactly. Konstantin, you might not be a fan of prefetches, but the current enqueue implementation (as well as the dequeue) uses a prefetch > > state machine. Please keep the prefetch state machine in the scalar code. > > It is not about our own preferences. > From my measurements new version is faster and it is definitely simpler. > > > Even if the examples/qos_sched might not show an advantage, > > this is just a sample app and there are some more relevant use-cases as well. > > Well, I hope that examples/qos_sched reflects at least some real-world use-cases for QOS library. > Otherwise why do we have it inside DPDK codebase? > About 'more relevant use-cases': if you do know such, can you try them with the patch? > I would really appreciate that. > In fact, it is an ask not only to Cristian, but to all other interested parties: > if your app does use librte_sched - please try this patch and provide the feedback. > If some tests would flag a regression - I am absolutely ok to drop the patch. > Konstantin