From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 19ACD436D4; Tue, 12 Dec 2023 19:09:53 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C41D742E4F; Tue, 12 Dec 2023 19:09:52 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 4D45842E2D; Tue, 12 Dec 2023 19:09:51 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 1039E22D27; Tue, 12 Dec 2023 19:09:51 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01DA2D26.65F74DDB" Subject: RE: [PATCH] app/dma-perf: replace pktmbuf with mempool objects X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Tue, 12 Dec 2023 19:09:50 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F0C1@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH] app/dma-perf: replace pktmbuf with mempool objects Thread-Index: AQHaLOc+P5veE9z1rka5g8pnqslJW7ClhhQAgAAxxICAAAqrAIAABemAgAAXjzyAABHz4A== References: <20231212103746.1910-1-vipin.varghese@amd.com> <98CBD80474FA8B44BF855DF32C47DC35E9F0BF@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9F0C0@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Varghese, Vipin" , "Bruce Richardson" Cc: "Yigit, Ferruh" , , , , "P, Thiyagarajan" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This is a multi-part message in MIME format. ------_=_NextPart_001_01DA2D26.65F74DDB Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable =20 From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com]=20 Sent: Tuesday, 12 December 2023 18.14 Sharing a few critical points based on my exposure to the dma-perf = application below =20 On Tue, Dec 12, 2023 at 04:16:20PM +0100, Morten Br=F8rup wrote: > +TO: Bruce, please stop me if I'm completely off track here. > > > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com] Sent: Tuesday, 12 > > December 2023 15.38 > > > > On 12/12/2023 11:40 AM, Morten Br=F8rup wrote: > > >> From: Vipin Varghese [mailto:vipin.varghese@amd.com] Sent: = Tuesday, > > >> 12 December 2023 11.38 > > >> > > >> Replace pktmbuf pool with mempool, this allows increase in MOPS > > >> especially in lower buffer size. Using Mempool, allows to reduce = the > > >> extra CPU cycles. > > > > > > I get the point of this change: It tests the performance of = copying > > raw memory objects using respectively rte_memcpy and DMA, without = the > > mbuf indirection overhead. > > > > > > However, I still consider the existing test relevant: The = performance > > of copying packets using respectively rte_memcpy and DMA. > > > > > > > This is DMA performance test application and packets are not used, > > using pktmbuf just introduces overhead to the main focus of the > > application. > > > > I am not sure if pktmuf selected intentionally for this test > > application, but I assume it is there because of historical reasons. > > I think pktmbuf was selected intentionally, to provide more accurate > results for application developers trying to determine when to use > rte_memcpy and when to use DMA. Much like the "copy breakpoint" in = Linux > Ethernet drivers is used to determine which code path to take for each > received packet. =20 yes Ferruh, this is the right understanding. In DPDK example we already = have=20 dma-forward application which makes use of pktmbuf payload to copy over new pktmbuf payload area.=20 =20 by moving to mempool, we are actually now focusing on source and = destination buffers. This allows to create mempool objects with 2MB and 1GB src-dst areas. = Thus allowing to focus src to dst copy. With pktmbuf we were not able to achieve the = same. =20 > > Most applications will be working with pktmbufs, so these applications > will also experience the pktmbuf overhead. Performance testing with = the > same overhead as the application will be better to help the = application > developer determine when to use rte_memcpy and when to use DMA when > working with pktmbufs. =20 Morten thank you for the input, but as shared above DPDK example dma-fwd = does=20 justice to such scenario. inline to test-compress-perf & = test-crypto-perf IMHO test-dma-perf should focus on getting best values of dma engine and memcpy = comparision. > > (Furthermore, for the pktmbuf tests, I wonder if copying performance > could also depend on IOVA mode and RTE_IOVA_IN_MBUF.) > > Nonetheless, there may also be use cases where raw mempool objects are > being copied by rte_memcpy or DMA, so adding tests for these use cases > are useful. > > > @Bruce, you were also deeply involved in the DMA library, and probably > have more up-to-date practical experience with it. Am I right that > pktmbuf overhead in these tests provides more "real life use"-like > results? Or am I completely off track with my thinking here, i.e. the > pktmbuf overhead is only noise? > I'm actually not that familiar with the dma-test application, so can't comment on the specific overhead involved here. In the general case, if = we are just talking about the overhead of dereferencing the mbufs then I = would expect the overhead to be negligible. However, if we are looking to = include the cost of allocation and freeing of buffers, I'd try to avoid that as = it is a cost that would have to be paid for both SW copies and HW copies, = so should not count when calculating offload cost. =20 Bruce, as per test-dma-perf there is no repeated pktmbuf-alloc or = pktmbuf-free.=20 Hence I disagree that the overhead discussed for pkmbuf here is not = related to alloc and free. But the cost as per my investigation goes into fetching the cacheline = and performing mtod on each iteration. /Bruce I can rewrite the logic to make use pktmbuf objects by sending the src = and dst with pre-computed=20 mtod to avoid the overhead. But this will not resolve the 2MB and 1GB = huge page copy alloc failures. IMHO, I believe in similar lines to other perf application, dma-perf = application should focus on acutal device performance over application application performance. =20 [MB:] OK, Vipin has multiple good arguments for this patch. I am convinced, = let's proceed with it. =20 Acked-by: Morten Br=F8rup =20 ------_=_NextPart_001_01DA2D26.65F74DDB Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

 

From:= = Varghese, Vipin [mailto:Vipin.Varghese@amd.com]
Sent: = Tuesday, 12 December 2023 = 18.14

Sharing a few = critical points based on my exposure to the dma-perf application = below

 

<Snipped>

On Tue, Dec 12, 2023 at 04:16:20PM +0100, Morten = Br=F8rup wrote:
> +TO: Bruce, please stop me if I'm completely off = track here.
>
> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com<= /a>] Sent: Tuesday, 12
> > December 2023 15.38
> = >
> > On 12/12/2023 11:40 AM, Morten Br=F8rup wrote:
> = > >> From: Vipin Varghese [
mailto:vipin.varghese@amd.co= m] Sent: Tuesday,
> > >> 12 December 2023 = 11.38
> > >>
> > >> Replace pktmbuf pool = with mempool, this allows increase in MOPS
> > >> = especially in lower buffer size. Using Mempool, allows to reduce = the
> > >> extra CPU cycles.
> > >
> = > > I get the point of this change: It tests the performance of = copying
> > raw memory objects using respectively rte_memcpy = and DMA, without the
> > mbuf indirection overhead.
> = > >
> > > However, I still consider the existing test = relevant: The performance
> > of copying packets using = respectively rte_memcpy and DMA.
> > >
> >
> = > This is DMA performance test application and packets are not = used,
> > using pktmbuf just introduces overhead to the main = focus of the
> > application.
> >
> > I am = not sure if pktmuf selected intentionally for this test
> > = application, but I assume it is there because of historical = reasons.
>
> I think pktmbuf was selected intentionally, to = provide more accurate
> results for application developers trying = to determine when to use
> rte_memcpy and when to use DMA. Much = like the "copy breakpoint" in Linux
> Ethernet drivers = is used to determine which code path to take for each
> received = packet.

 

yes Ferruh, this is the right understanding. In DPDK example we already = have 

dma-forward application which makes use of pktmbuf payload to copy = over

new pktmbuf payload area. 

 

by moving to mempool, we are actually now focusing on source and = destination buffers.

This allows to create mempool objects with 2MB and 1GB src-dst areas. = Thus allowing

to focus src to dst copy. With pktmbuf we were not able to achieve the = same.

 


>
> Most = applications will be working with pktmbufs, so these = applications
> will also experience the pktmbuf overhead. = Performance testing with the
> same overhead as the application = will be better to help the application
> developer determine when = to use rte_memcpy and when to use DMA when
> working with = pktmbufs.

 

Morten thank you for = the input, but as shared above DPDK example dma-fwd = does 

justice to such scenario. inline to = test-compress-perf & test-crypto-perf IMHO = test-dma-perf

should focus on getting best values of dma = engine and memcpy comparision.


>
> = (Furthermore, for the pktmbuf tests, I wonder if copying = performance
> could also depend on IOVA mode and = RTE_IOVA_IN_MBUF.)
>
> Nonetheless, there may also be use = cases where raw mempool objects are
> being copied by rte_memcpy = or DMA, so adding tests for these use cases
> are = useful.
>
>
> @Bruce, you were also deeply involved in = the DMA library, and probably
> have more up-to-date practical = experience with it. Am I right that
> pktmbuf overhead in these = tests provides more "real life use"-like
> results? Or = am I completely off track with my thinking here, i.e. the
> = pktmbuf overhead is only noise?
>
I'm actually not that = familiar with the dma-test application, so can't
comment on the = specific overhead involved here. In the general case, if we
are just = talking about the overhead of dereferencing the mbufs then I = would
expect the overhead to be negligible. However, if we are = looking to include
the cost of allocation and freeing of buffers, I'd = try to avoid that as it
is a cost that would have to be paid for both = SW copies and HW copies, so
should not count when calculating offload = cost.

 

Bruce, as per = test-dma-perf there is no repeated pktmbuf-alloc or = pktmbuf-free. 

Hence I disagree that = the overhead discussed for pkmbuf here is not related to alloc and = free.

But the cost as per my investigation goes = into fetching the cacheline and performing mtod = on

each = iteration.

/Bruce

I can rewrite the = logic to make use pktmbuf objects by sending the src and dst with = pre-computed 

mtod to avoid the = overhead. But this will not resolve the 2MB and 1GB huge page copy alloc = failures.

IMHO, I believe in similar lines to other = perf application, dma-perf application should focus on acutal = device

performance over application application = performance.

 

[MB:]

OK, Vipin has multiple good arguments for this patch. I am convinced, = let’s proceed with it.

 

Acked-by: Morten Br=F8rup = <mb@smartsharesystems.com>

 

------_=_NextPart_001_01DA2D26.65F74DDB--