From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 76F19A04E7; Sun, 1 Nov 2020 17:21:15 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DA5D42BFE; Sun, 1 Nov 2020 17:21:13 +0100 (CET) Received: from new4-smtp.messagingengine.com (new4-smtp.messagingengine.com [66.111.4.230]) by dpdk.org (Postfix) with ESMTP id D5D052BF9 for ; Sun, 1 Nov 2020 17:21:10 +0100 (CET) Received: from compute2.internal (compute2.nyi.internal [10.202.2.42]) by mailnew.nyi.internal (Postfix) with ESMTP id 28F025800AB; Sun, 1 Nov 2020 11:21:10 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Sun, 01 Nov 2020 11:21:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding:content-type; s=fm2; bh= Wl7kTVJLXWsdGMjrbXxVRkFzMhu4lVJuWk7TE728lT0=; b=gIB82l6xe/lp1rsd cWIQHZBBw7NJOnBWkhW3FCYo+KcKU/zCDn3fdD1tCGJ0lf0pJVXuePVwNQI1f5sx KgZgdVs9MK8cRkk3ItkljK0rJ7/Oj4zM3FGcnKK2GFmugymuDeIfphb/g1ucAxT7 n851KtFnXSYdC5X0cwKE2svO+TI6Ryce69aAdW62wCmPXLSmxeXUgHBeSTKfi6I2 KrhNW2ACCA/JKqwB1GqIDPZr86MpTofMhDn2VDcc74g04mKMzMrW9O3ERgcd1aNx I+L+Qve6bY8V4uXz5AU1QS2BqjeXhOsEYb+ef2DtPyWKMcGZqmHKtu3WteCRoMK0 tSqKwA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=Wl7kTVJLXWsdGMjrbXxVRkFzMhu4lVJuWk7TE728l T0=; b=WGpnaXQrecPb5EZZir93u313VZ3poJrYSZEy4Ln5/DUbUbKPMHo00ChhB gB77xXLZyCloo067MlJnoQbx29/eiqwfcdbtmMVuKL3/YVQdsQsarD/uq9leFGAv CfGb0DsUIE7SRAG5/4tPIz88OZk/g2ZyW0eFf6EFmzAGB76nbD3miAyEw1nJ2ZT6 +MUfQV2Z3Y1HNQchSrmtz4MnxIHxsoUs15l08OZrGflvcp2QBmcO5zOub3hNPWKN WqlpwLDUl+vkYWgDsOmk3ha89cjt2IitfzugNTep0w19HQ1c6MRv8a6ofKozQdS1 f3GrR4YgbkX8qDfl2qg2WW22HDMRA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedujedrleelgdeltdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkjghfggfgtgesthhqredttddtudenucfhrhhomhepvfhhohhmrghs ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf frrghtthgvrhhnpeefgeffiefhfeettdfhvdfgteekffffudekvedtvedtvdfgveeuudev gedvgeegtdenucfkphepjeejrddufeegrddvtdefrddukeegnecuvehluhhsthgvrhfuih iivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepthhhohhmrghssehmohhnjhgrlhho nhdrnhgvth X-ME-Proxy: Received: from xps.localnet (184.203.134.77.rev.sfr.net [77.134.203.184]) by mail.messagingengine.com (Postfix) with ESMTPA id CE58E328005E; Sun, 1 Nov 2020 11:21:06 -0500 (EST) From: Thomas Monjalon To: Morten =?ISO-8859-1?Q?Br=F8rup?= Cc: dev@dpdk.org, Ajit Khaparde , "Ananyev, Konstantin" , Andrew Rybchenko , dev@dpdk.org, "Yigit, Ferruh" , david.marchand@redhat.com, "Richardson, Bruce" , olivier.matz@6wind.com, jerinj@marvell.com, viacheslavo@nvidia.com, honnappa.nagarahalli@arm.com, maxime.coquelin@redhat.com, stephen@networkplumber.org, hemant.agrawal@nxp.com, viacheslavo@nvidia.com, Matan Azrad , Shahaf Shuler , hemant.agrawal@nxp.com Date: Sun, 01 Nov 2020 17:21:05 +0100 Message-ID: <11884450.PhTIdSRZFC@thomas> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35C613CB@smartserver.smartshare.dk> References: <20201029092751.3837177-1-thomas@monjalon.net> <3458411.u7HY7OY5Un@thomas> <98CBD80474FA8B44BF855DF32C47DC35C613CB@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" Subject: Re: [dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst half X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" That's very interesting food for thoughts. I hope we will have a good community discussion on this list during this week to make some decisions. 01/11/2020 10:12, Morten Br=F8rup: > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > Sent: Saturday, October 31, 2020 9:41 PM > >=20 > > 31/10/2020 19:20, Morten Br=F8rup: > > > Thomas, > > > > > > Adding my thoughts to the already detailed feedback on this important > > patch... > > > > > > The first cache line is not inherently "hotter" than the second. The > > hotness depends on their usage. > > > > > > The mbuf cacheline1 marker has the following comment: > > > /* second cache line - fields only used in slow path or on TX */ > > > > > > In other words, the second cache line is intended not to be touched in > > fast path RX. > > > > > > I do not think this is true anymore. Not even with simple non-scatter= ed > > RX. And regression testing probably didn't catch this, because the tests > > perform TX after RX, so the cache miss moved from TX to RX and became a > > cache hit in TX instead. (I may be wrong about this claim, but it's not > > important for the discussion.) > > > > > > I think the right question for this patch is: Can we achieve this - n= ot > > using the second cache line for fast path RX - again by putting the rig= ht > > fields in the first cache line? > > > > > > Probably not in all cases, but perhaps for some... > > > > > > Consider the application scenarios. > > > > > > When a packet is received, one of three things happens to it: > > > 1. It is immediately transmitted on one or more ports. > > > 2. It is immediately discarded, e.g. by a firewall rule. > > > 3. It is put in some sort of queue, e.g. a ring for the next pipeline > > stage, or in a QoS queue. > > > > > > 1. If the packet is immediately transmitted, the m->tx_offload field = in > > the second cache line will be touched by the application and TX function > > anyway, so we don't need to optimize the mbuf layout for this scenario. > > > > > > 2. The second scenario touches m->pool no matter how it is implemente= d. > > The application can avoid touching m->next by using rte_mbuf_raw_free(), > > knowing that the mbuf came directly from RX and thus no other fields ha= ve > > been touched. In this scenario, we want m->pool in the first cache line. > > > > > > 3. Now, let's consider the third scenario, where RX is followed by > > enqueue into a ring. If the application does nothing but put the packet > > into a ring, we don't need to move anything into the first cache line. = But > > applications usually does more... So it is application specific what wo= uld > > be good to move to the first cache line: > > > > > > A. If the application does not use segmented mbufs, and performs anal= ysis > > and preparation for transmission in the initial pipeline stages, and on= ly > > the last pipeline stage performs TX, we could move m->tx_offload to the > > first cache line, which would keep the second cache line cold until the > > actual TX happens in the last pipeline stage - maybe even after the pac= ket > > has waited in a QoS queue for a long time, and its cache lines have gone > > cold. > > > > > > B. If the application uses segmented mbufs on RX, it might make sense > > moving m->next to the first cache line. (We don't use segmented mbufs, = so > > I'm not sure about this.) > > > > > > > > > However, reality perhaps beats theory: > > > > > > Looking at the E1000 PMD, it seems like even its non-scattered RX > > function, eth_igb_recv_pkts(), sets m->next. If it only kept its own fr= ee > > pool pre-initialized instead... I haven't investigated other PMDs, exce= pt > > briefly looking at the mlx5 PMD, and it seems like it doesn't touch m->= next > > in RX. > > > > > > I haven't looked deeper into how m->pool is being used by RX in PMDs,= but > > I suppose that it isn't touched in RX. > > > > > > > > > If only we had a performance test where RX was not immediately follow= ed > > by TX, but the packets were passed through a large queue in-between, so= RX > > cache misses were not free of charge because they transform TX cache mi= sses > > into cache hits instead... > > > > > > > > > Whatever you choose, I am sure that most applications will find it mo= re > > useful than the timestamp. :-) > >=20 > > Thanks for the thoughts Morten. > > I believe we need benchmarks of different scenarios with different driv= ers. > > >=20 > If we are only allowed to modify the mbuf structure this one more time, w= e should look forward, not backwards! >=20 > If we move m->tx_offload to the first cache line, applications using simp= le, non-scattered packet mbufs would never even need to touch the second ca= che line, except for freeing the mbuf (which needs to read m->pool). >=20 > And this leads to my next suggestion... >=20 > One thing has always puzzled me: Why do we use 64 bits to indicate which = memory pool an mbuf belongs to? The portid only uses 16 bits and an indirec= tion index. Why don't we use the same kind of indirection index for mbuf po= ols? >=20 > I can easily imagine using one mbuf pool (or perhaps a few pools) per CPU= socket (or per physical memory bus closest to an attached NIC), but not mo= re than 256 mbuf memory pools in total. So, let's introduce an mbufpoolid l= ike the portid, and cut this mbuf field down from 64 to 8 bits. >=20 > If we also cut down m->pkt_len from 32 to 24 bits, we can get the 8 bit m= buf pool index into the first cache line at no additional cost. >=20 > In other words: This would free up another 64 bit field in the mbuf struc= ture! >=20 >=20 > And even though the m->next pointer for scattered packets resides in the = second cache line, the libraries and application knows that m->next is NULL= when m->nb_segs is 1. This proves that my suggestion would make touching t= he second cache line unnecessary (in simple cases), even for re-initializin= g the mbuf. >=20 >=20 > And now I will proceed out on a tangent with two more independent thought= s, so feel free to ignore. >=20 > Consider a multi CPU socket system with one mbuf pool per CPU socket, the= NICs attached to each CPU socket use an RX mbuf pool with RAM on the same = CPU socket. I would imagine that (re-)initializing these mbufs could be fas= ter if performed only on a CPU on the same socket. If this is the case, mbu= fs should be re-initialized as part of the RX preparation at ingress, not a= s part of the mbuf free at egress. >=20 > Perhaps some microarchitectures are faster to compare nb_segs=3D=3D0 than= nb_segs=3D=3D1. If so, nb_segs could be redefined to mean number of additi= onal segments, rather than number of segments.