RE: [PATCH v3] app/dma-perf: introduce dma-perf application

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Jiang, Cheng1" <cheng1.jiang@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
	"mb@smartsharesystems.com" <mb@smartsharesystems.com>,
	"dev@dpdk.org" <dev@dpdk.org>, "Hu, Jiayu" <jiayu.hu@intel.com>,
	"Ding, Xuan" <xuan.ding@intel.com>,
	"Ma, WenwuX" <wenwux.ma@intel.com>,
	"Wang, YuanX" <yuanx.wang@intel.com>,
	"He, Xingguang" <xingguang.he@intel.com>
Subject: RE: [PATCH v3] app/dma-perf: introduce dma-perf application
Date: Mon, 6 Feb 2023 14:20:13 +0000	[thread overview]
Message-ID: <SN7PR11MB70197FA951CF388F7BB965C4DCDA9@SN7PR11MB7019.namprd11.prod.outlook.com> (raw)
In-Reply-To: <Y9eL0stPr5fxED4C@bricha3-MOBL.ger.corp.intel.com>

Hi Bruce,

Replies are inline,

Thank,
Cheng

> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Monday, January 30, 2023 5:20 PM
> To: Jiang, Cheng1 <cheng1.jiang@intel.com>
> Cc: thomas@monjalon.net; mb@smartsharesystems.com; dev@dpdk.org;
> Hu, Jiayu <jiayu.hu@intel.com>; Ding, Xuan <xuan.ding@intel.com>; Ma,
> WenwuX <wenwux.ma@intel.com>; Wang, YuanX
> <yuanx.wang@intel.com>; He, Xingguang <xingguang.he@intel.com>
> Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
> 
> On Sat, Jan 28, 2023 at 01:32:05PM +0000, Jiang, Cheng1 wrote:
> > Hi Bruce,
> >
> > Sorry for the late reply. We are in the Spring Festival holiday last week.
> > Thanks for your comments.
> > Replies are inline.
> >
> > Thanks,
> > Cheng
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce <bruce.richardson@intel.com>
> > > Sent: Wednesday, January 18, 2023 12:52 AM
> > > To: Jiang, Cheng1 <cheng1.jiang@intel.com>
> > > Cc: thomas@monjalon.net; mb@smartsharesystems.com;
> dev@dpdk.org; Hu,
> > > Jiayu <jiayu.hu@intel.com>; Ding, Xuan <xuan.ding@intel.com>; Ma,
> > > WenwuX <wenwux.ma@intel.com>; Wang, YuanX
> <yuanx.wang@intel.com>;
> > > He, Xingguang <xingguang.he@intel.com>
> > > Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
> > >
> > > On Tue, Jan 17, 2023 at 12:05:26PM +0000, Cheng Jiang wrote:
> > > > There are many high-performance DMA devices supported in DPDK
> now,
> > > > and these DMA devices can also be integrated into other modules of
> > > > DPDK as accelerators, such as Vhost. Before integrating DMA into
> > > > applications, developers need to know the performance of these DMA
> > > > devices in various scenarios and the performance of CPUs in the
> > > > same scenario, such as different buffer lengths. Only in this way
> > > > can we know the target performance of the application accelerated
> > > > by using them. This patch introduces a high-performance testing
> > > > tool, which supports comparing the performance of CPU and DMA in
> > > > different scenarios automatically with a pre-set config file.
> > > > Memory Copy performance test are
> > > supported for now.
> > > >
> > > > Signed-off-by: Cheng Jiang <cheng1.jiang@intel.com>
> > > > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > > > Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> > > > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > > > ---
> > >
> > > More input based off trying running the application, including some
> > > thoughts on the testing methodology below.
> > >
> > >
> > > > +static void
> > > > +output_result(uint8_t scenario_id, uint32_t lcore_id, uint16_t
> > > > +dev_id,
> > > uint64_t ave_cycle,
> > > > +			uint32_t buf_size, uint32_t nr_buf, uint32_t memory,
> > > > +			float bandwidth, uint64_t ops, bool is_dma) {
> > > > +	if (is_dma)
> > > > +		printf("lcore %u, DMA %u:\n"
> > > > +				"average cycles: %" PRIu64 ","
> > > > +				" buffer size: %u, nr_buf: %u,"
> > > > +				" memory: %uMB, frequency: %" PRIu64
> ".\n",
> > > > +				lcore_id,
> > > > +				dev_id,
> > > > +				ave_cycle,
> > > > +				buf_size,
> > > > +				nr_buf,
> > > > +				memory,
> > > > +				rte_get_timer_hz());
> > > > +	else
> > > > +		printf("lcore %u\n"
> > > > +			"average cycles: %" PRIu64 ","
> > > > +			" buffer size: %u, nr_buf: %u,"
> > > > +			" memory: %uMB, frequency: %" PRIu64 ".\n",
> > > > +			lcore_id,
> > > > +			ave_cycle,
> > > > +			buf_size,
> > > > +			nr_buf,
> > > > +			memory,
> > > > +			rte_get_timer_hz());
> > > > +
> > >
> > > The term "average cycles" is unclear here - is it average cycles per
> > > test iteration, or average cycles per buffer copy?
> >
> > The average cycles stands for average cycles per buffer copy, I'll clarify it in
> the next version.
> >
> > >
> > >
> > > > +	printf("Average bandwidth: %.3lfGbps, OPS: %" PRIu64 "\n",
> > > > +bandwidth, ops);
> > > > +
> > >
> > > <snip>
> > >
> > > > +
> > > > +static inline void
> > > > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t
> > > > +kick_batch,
> > > uint32_t buf_size,
> > > > +			uint16_t mpool_iter_step, struct rte_mbuf **srcs,
> > > struct rte_mbuf
> > > > +**dsts) {
> > > > +	int64_t async_cnt = 0;
> > > > +	int nr_cpl = 0;
> > > > +	uint32_t index;
> > > > +	uint16_t offset;
> > > > +	uint32_t i;
> > > > +
> > > > +	for (offset = 0; offset < mpool_iter_step; offset++) {
> > > > +		for (i = 0; index = i * mpool_iter_step + offset, index <
> > > > +nr_buf;
> > > i++) {
> > > > +			if (unlikely(rte_dma_copy(dev_id,
> > > > +						0,
> > > > +						srcs[index]->buf_iova +
> > > srcs[index]->data_off,
> > > > +						dsts[index]->buf_iova +
> > > dsts[index]->data_off,
> > > > +						buf_size,
> > > > +						0) < 0)) {
> > > > +				rte_dma_submit(dev_id, 0);
> > > > +				while (rte_dma_burst_capacity(dev_id, 0) ==
> 0)
> > > {
> > > > +					nr_cpl = rte_dma_completed(dev_id,
> 0,
> > > MAX_DMA_CPL_NB,
> > > > +								NULL, NULL);
> > > > +					async_cnt -= nr_cpl;
> > > > +				}
> > > > +				if (rte_dma_copy(dev_id,
> > > > +						0,
> > > > +						srcs[index]->buf_iova +
> > > srcs[index]->data_off,
> > > > +						dsts[index]->buf_iova +
> > > dsts[index]->data_off,
> > > > +						buf_size,
> > > > +						0) < 0) {
> > > > +					printf("enqueue fail again at %u\n",
> > > index);
> > > > +					printf("space:%d\n",
> > > rte_dma_burst_capacity(dev_id, 0));
> > > > +					rte_exit(EXIT_FAILURE, "DMA
> enqueue
> > > failed\n");
> > > > +				}
> > > > +			}
> > > > +			async_cnt++;
> > > > +
> > > > +			/**
> > > > +			 * When '&' is used to wrap an index, mask must be a
> > > power of 2.
> > > > +			 * That is, kick_batch must be 2^n.
> > > > +			 */
> > > > +			if (unlikely((async_cnt % kick_batch) == 0)) {
> > > > +				rte_dma_submit(dev_id, 0);
> > > > +				/* add a poll to avoid ring full */
> > > > +				nr_cpl = rte_dma_completed(dev_id, 0,
> > > MAX_DMA_CPL_NB, NULL, NULL);
> > > > +				async_cnt -= nr_cpl;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		rte_dma_submit(dev_id, 0);
> > > > +		while (async_cnt > 0) {
> > > > +			nr_cpl = rte_dma_completed(dev_id, 0,
> > > MAX_DMA_CPL_NB, NULL, NULL);
> > > > +			async_cnt -= nr_cpl;
> > > > +		}
> > >
> > > I have a couple of concerns about the methodology for testing the HW
> > > DMA performance. For example, the inclusion of that final block
> > > means that we are including the latency of the copy operation in the
> result.
> > >
> > > If the objective of the test application is to determine if it is
> > > cheaper for software to offload a copy operation to HW or do it in
> > > SW, then the primary concern is the HW offload cost. That offload
> > > cost should remain constant irrespective of the size of the copy -
> > > since all you are doing is writing a descriptor and reading a
> > > completion result. However, seeing the results of running the app, I
> > > notice that the reported average cycles increases as the packet size
> > > increases, which would tend to indicate that we are not giving a realistic
> measurement of offload cost.
> >
> > We are trying to compare the time required to complete a certain
> > amount of work using DMA with the time required to complete it using
> > CPU. I think in addition to the offload cost, the capability of the DMA itself
> is also an important factor to be considered.
> > The offload cost should be constant , but when DMA copies memory of
> > different lengths, the time costs are different. So the reported average
> cycles increases as the packet size increases.
> > Therefore, this test result includes both offload cost and DMA
> > operation cost. To some extent, it should be a relative realistic
> measurement result.
> >
> > Do you think it makes sense to you?
> >
> 
> Hi,
> 
> Yes, I get your point about the job latency being different when the
> packet/copy sizes increase, but on the other hand, as I state above the actual
> cycle cost to the application should not increase. If any application is doing
> what this test app is doing, just sitting around waiting for job completion (in
> the fast path), then it is likely that the programmer should look at improving
> the offload into the app.
> 
> The main issue here is that by outputting a single number, you are mixing
> two separate values - both offload cost and job latency. If you want to show
> the effects of larger/smaller packets on both, then you should output both
> values separately. For most applications where you will offload copies and do
> other work while the copy is being done, the offload cost is of primary
> concern. For some applications the latency figure may also be important, but
> in those cases the user will want to see the latency called out explicitly, not
> just mixed up in a single figure with offload cost.

Sure, makes sense to me, thanks.

> 
> > >
> > > The trouble then becomes how to do so in a more realistic manner.
> > > The most accurate way I can think of in a unit test like this is to
> > > offload <queue_size> entries to the device and measure the cycles
> > > taken there. Then wait until such time as all copies are completed
> > > (to eliminate the latency time, which in a real- world case would be
> > > spent by a core doing something else), and then do a second
> > > measurement of the time taken to process all the completions. In the
> > > same way as for a SW copy, any time not spent in memcpy is not copy
> > > time, for HW copies any time spent not writing descriptors or reading
> completions is not part of the offload cost.
> >
> > Agreed, we are thinking about adding offload cost as one of test results in
> the future.
> >
> > >
> > > That said, doing the above is still not fully realistic, as a
> > > real-world app will likely still have some amount of other overhead,
> > > for example, polling occasionally for completions in between doing
> > > other work (though one would expect this to be relatively cheap).
> > > Similarly, if the submission queue fills, the app may have to delay
> > > waiting for space to submit jobs, and therefore see some of the HW copy
> latency.
> > >
> > > Therefore, I think the most realistic way to measure this is to look
> > > at the rate of operations while processing is being done in the
> > > middle of the test. For example, if we have a simple packet
> > > processing application, running the application just doing RX and TX
> > > and measuring the rate allows us to determine the basic packet I/O
> > > cost. Adding in an offload to HW for each packet and again measuring
> > > the rate, will then allow us to compute the true offload copy cost
> > > of the operation, and should give us a number that remains flat even
> > > as packet size increases. For previous work done on vhost with DMA
> > > acceleration, I believe we saw exactly that - while SW PPS reduced as
> packet size increased, with HW copies the PPS remained constant even as
> packet size increased.
> > >
> > > The challenge to my mind, is therefore how to implement this in a
> > > suitable unit- test style way, to fit into the framework you have
> > > given here. I would suggest that the actual performance measurement
> > > needs to be done - not on a total time - but on a fixed time basis
> > > within each test. For example, when doing HW copies, 1ms into each
> > > test run, we need to snapshot the completed entries, and then say
> > > 1ms later measure the number that have been completed since. In this
> > > way, we avoid the initial startup latency while we wait for jobs to start
> completing, and we avoid the final latency as we await the last job to
> complete.
> > > We would also include time for some potentially empty polls, and if
> > > a queue size is too small see that reflected in the performance too.
> >
> > I understand your concerns, but I think maybe we are not discussing the
> same performance number here.
> > We are trying to test the maximum bandwidth of DMA, and what you said
> is how to measure the offload cost more accurately if I understand it
> correctly.
> > I think these two performance data are both important. Maybe we can
> > add your test methodology as one of performance aspect for DMA in the
> future, I need to reconsider it and get back to you later.
> >
> 
> Max bandwidth of HW is a third and separate number from that of offload-
> cost and latency. Again, it should be measured and reported separately if you
> want the app to provide it.

OK, got it. We will try to implement such test method in the next version.

Thanks.
Cheng

> 
> Regards,
> 
> /Bruce

next prev parent reply	other threads:[~2023-02-06 14:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-20  1:06 [PATCH] " Cheng Jiang
2023-01-17  1:56 ` [PATCH v2] " Cheng Jiang
2023-01-17 13:00   ` Bruce Richardson
2023-01-17 13:54     ` Jiang, Cheng1
2023-01-17 14:03       ` Bruce Richardson
2023-01-18  1:46         ` Jiang, Cheng1
2023-01-17 12:05 ` [PATCH v3] " Cheng Jiang
2023-01-17 15:44   ` Bruce Richardson
2023-01-19  7:18     ` Jiang, Cheng1
2023-01-17 16:51   ` Bruce Richardson
2023-01-28 13:32     ` Jiang, Cheng1
2023-01-30  9:20       ` Bruce Richardson
2023-02-06 14:20         ` Jiang, Cheng1 [this message]
2023-01-31  5:27     ` Hu, Jiayu
2023-04-20  7:22 [PATCH] " Cheng Jiang
2023-05-17  7:31 ` [PATCH v3] " Cheng Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN7PR11MB70197FA951CF388F7BB965C4DCDA9@SN7PR11MB7019.namprd11.prod.outlook.com \
    --to=cheng1.jiang@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=jiayu.hu@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=xingguang.he@intel.com \
    --cc=xuan.ding@intel.com \
    --cc=yuanx.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).