Re: [PATCH v3] app/dma-perf: introduce dma-perf application

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: Cheng Jiang <cheng1.jiang@intel.com>
Cc: <thomas@monjalon.net>, <mb@smartsharesystems.com>, <dev@dpdk.org>,
	<jiayu.hu@intel.com>, <xuan.ding@intel.com>,
	<wenwux.ma@intel.com>, <yuanx.wang@intel.com>,
	<xingguang.he@intel.com>
Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
Date: Tue, 17 Jan 2023 16:51:46 +0000	[thread overview]
Message-ID: <Y8bSItTJo3QLWwBy@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <20230117120526.39375-1-cheng1.jiang@intel.com>

On Tue, Jan 17, 2023 at 12:05:26PM +0000, Cheng Jiang wrote:
> There are many high-performance DMA devices supported in DPDK now, and
> these DMA devices can also be integrated into other modules of DPDK as
> accelerators, such as Vhost. Before integrating DMA into applications,
> developers need to know the performance of these DMA devices in various
> scenarios and the performance of CPUs in the same scenario, such as
> different buffer lengths. Only in this way can we know the target
> performance of the application accelerated by using them. This patch
> introduces a high-performance testing tool, which supports comparing the
> performance of CPU and DMA in different scenarios automatically with a
> pre-set config file. Memory Copy performance test are supported for now.
> 
> Signed-off-by: Cheng Jiang <cheng1.jiang@intel.com>
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> Signed-off-by: Yuan Wang <yuanx.wang@intel.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---

More input based off trying running the application, including some
thoughts on the testing methodology below.

> +static void
> +output_result(uint8_t scenario_id, uint32_t lcore_id, uint16_t dev_id, uint64_t ave_cycle,
> +			uint32_t buf_size, uint32_t nr_buf, uint32_t memory,
> +			float bandwidth, uint64_t ops, bool is_dma)
> +{
> +	if (is_dma)
> +		printf("lcore %u, DMA %u:\n"
> +				"average cycles: %" PRIu64 ","
> +				" buffer size: %u, nr_buf: %u,"
> +				" memory: %uMB, frequency: %" PRIu64 ".\n",
> +				lcore_id,
> +				dev_id,
> +				ave_cycle,
> +				buf_size,
> +				nr_buf,
> +				memory,
> +				rte_get_timer_hz());
> +	else
> +		printf("lcore %u\n"
> +			"average cycles: %" PRIu64 ","
> +			" buffer size: %u, nr_buf: %u,"
> +			" memory: %uMB, frequency: %" PRIu64 ".\n",
> +			lcore_id,
> +			ave_cycle,
> +			buf_size,
> +			nr_buf,
> +			memory,
> +			rte_get_timer_hz());
> +

The term "average cycles" is unclear here - is it average cycles per test
iteration, or average cycles per buffer copy?

> +	printf("Average bandwidth: %.3lfGbps, OPS: %" PRIu64 "\n", bandwidth, ops);
> +

<snip>

> +
> +static inline void
> +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t kick_batch, uint32_t buf_size,
> +			uint16_t mpool_iter_step, struct rte_mbuf **srcs, struct rte_mbuf **dsts)
> +{
> +	int64_t async_cnt = 0;
> +	int nr_cpl = 0;
> +	uint32_t index;
> +	uint16_t offset;
> +	uint32_t i;
> +
> +	for (offset = 0; offset < mpool_iter_step; offset++) {
> +		for (i = 0; index = i * mpool_iter_step + offset, index < nr_buf; i++) {
> +			if (unlikely(rte_dma_copy(dev_id,
> +						0,
> +						srcs[index]->buf_iova + srcs[index]->data_off,
> +						dsts[index]->buf_iova + dsts[index]->data_off,
> +						buf_size,
> +						0) < 0)) {
> +				rte_dma_submit(dev_id, 0);
> +				while (rte_dma_burst_capacity(dev_id, 0) == 0) {
> +					nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB,
> +								NULL, NULL);
> +					async_cnt -= nr_cpl;
> +				}
> +				if (rte_dma_copy(dev_id,
> +						0,
> +						srcs[index]->buf_iova + srcs[index]->data_off,
> +						dsts[index]->buf_iova + dsts[index]->data_off,
> +						buf_size,
> +						0) < 0) {
> +					printf("enqueue fail again at %u\n", index);
> +					printf("space:%d\n", rte_dma_burst_capacity(dev_id, 0));
> +					rte_exit(EXIT_FAILURE, "DMA enqueue failed\n");
> +				}
> +			}
> +			async_cnt++;
> +
> +			/**
> +			 * When '&' is used to wrap an index, mask must be a power of 2.
> +			 * That is, kick_batch must be 2^n.
> +			 */
> +			if (unlikely((async_cnt % kick_batch) == 0)) {
> +				rte_dma_submit(dev_id, 0);
> +				/* add a poll to avoid ring full */
> +				nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, NULL);
> +				async_cnt -= nr_cpl;
> +			}
> +		}
> +
> +		rte_dma_submit(dev_id, 0);
> +		while (async_cnt > 0) {
> +			nr_cpl = rte_dma_completed(dev_id, 0, MAX_DMA_CPL_NB, NULL, NULL);
> +			async_cnt -= nr_cpl;
> +		}

I have a couple of concerns about the methodology for testing the HW DMA
performance. For example, the inclusion of that final block means that we
are including the latency of the copy operation in the result.

If the objective of the test application is to determine if it is cheaper
for software to offload a copy operation to HW or do it in SW, then the
primary concern is the HW offload cost. That offload cost should remain
constant irrespective of the size of the copy - since all you are doing is
writing a descriptor and reading a completion result. However, seeing the
results of running the app, I notice that the reported average cycles
increases as the packet size increases, which would tend to indicate that
we are not giving a realistic measurement of offload cost.

The trouble then becomes how to do so in a more realistic manner. The most
accurate way I can think of in a unit test like this is to offload
<queue_size> entries to the device and measure the cycles taken there. Then
wait until such time as all copies are completed (to eliminate the latency
time, which in a real-world case would be spent by a core doing something
else), and then do a second measurement of the time taken to process all
the completions. In the same way as for a SW copy, any time not spent in
memcpy is not copy time, for HW copies any time spent not writing
descriptors or reading completions is not part of the offload cost.

That said, doing the above is still not fully realistic, as a real-world
app will likely still have some amount of other overhead, for example,
polling occasionally for completions in between doing other work (though
one would expect this to be relatively cheap).  Similarly, if the
submission queue fills, the app may have to delay waiting for space to
submit jobs, and therefore see some of the HW copy latency.

Therefore, I think the most realistic way to measure this is to look at the
rate of operations while processing is being done in the middle of the
test. For example, if we have a simple packet processing application,
running the application just doing RX and TX and measuring the rate allows
us to determine the basic packet I/O cost. Adding in an offload to HW for
each packet and again measuring the rate, will then allow us to compute the
true offload copy cost of the operation, and should give us a number that
remains flat even as packet size increases. For previous work done on vhost
with DMA acceleration, I believe we saw exactly that - while SW PPS reduced
as packet size increased, with HW copies the PPS remained constant even as
packet size increased.

The challenge to my mind, is therefore how to implement this in a suitable
unit-test style way, to fit into the framework you have given here. I would
suggest that the actual performance measurement needs to be done - not on a
total time - but on a fixed time basis within each test. For example, when
doing HW copies, 1ms into each test run, we need to snapshot the completed
entries, and then say 1ms later measure the number that have been completed
since. In this way, we avoid the initial startup latency while we wait for
jobs to start completing, and we avoid the final latency as we await the
last job to complete. We would also include time for some potentially empty
polls, and if a queue size is too small see that reflected in the
performance too.

Thoughts, input from others?

/Bruce

next prev parent reply	other threads:[~2023-01-17 16:52 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-20  1:06 [PATCH] " Cheng Jiang
2023-01-17  1:56 ` [PATCH v2] " Cheng Jiang
2023-01-17 13:00   ` Bruce Richardson
2023-01-17 13:54     ` Jiang, Cheng1
2023-01-17 14:03       ` Bruce Richardson
2023-01-18  1:46         ` Jiang, Cheng1
2023-01-17 12:05 ` [PATCH v3] " Cheng Jiang
2023-01-17 15:44   ` Bruce Richardson
2023-01-19  7:18     ` Jiang, Cheng1
2023-01-17 16:51   ` Bruce Richardson [this message]
2023-01-28 13:32     ` Jiang, Cheng1
2023-01-30  9:20       ` Bruce Richardson
2023-02-06 14:20         ` Jiang, Cheng1
2023-01-31  5:27     ` Hu, Jiayu
2023-04-20  7:22 [PATCH] " Cheng Jiang
2023-05-17  7:31 ` [PATCH v3] " Cheng Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8bSItTJo3QLWwBy@bricha3-MOBL.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=cheng1.jiang@intel.com \
    --cc=dev@dpdk.org \
    --cc=jiayu.hu@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=xingguang.he@intel.com \
    --cc=xuan.ding@intel.com \
    --cc=yuanx.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).