DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>,
	"Jiang, Cheng1" <cheng1.jiang@intel.com>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
	"mb@smartsharesystems.com" <mb@smartsharesystems.com>,
	"dev@dpdk.org" <dev@dpdk.org>, "Ding, Xuan" <xuan.ding@intel.com>,
	"Ma, WenwuX" <wenwux.ma@intel.com>,
	"Wang, YuanX" <yuanx.wang@intel.com>,
	"He, Xingguang" <xingguang.he@intel.com>
Subject: RE: [PATCH v3] app/dma-perf: introduce dma-perf application
Date: Tue, 31 Jan 2023 05:27:45 +0000	[thread overview]
Message-ID: <CY5PR11MB64872EFDC726EB34DC1755B492D09@CY5PR11MB6487.namprd11.prod.outlook.com> (raw)
In-Reply-To: <Y8bSItTJo3QLWwBy@bricha3-MOBL.ger.corp.intel.com>

Hi Bruce,

> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Wednesday, January 18, 2023 12:52 AM
> To: Jiang, Cheng1 <cheng1.jiang@intel.com>
> Cc: thomas@monjalon.net; mb@smartsharesystems.com; dev@dpdk.org;
> Hu, Jiayu <jiayu.hu@intel.com>; Ding, Xuan <xuan.ding@intel.com>; Ma,
> WenwuX <wenwux.ma@intel.com>; Wang, YuanX
> <yuanx.wang@intel.com>; He, Xingguang <xingguang.he@intel.com>
> Subject: Re: [PATCH v3] app/dma-perf: introduce dma-perf application
> 
+
> > +static inline void
> > +do_dma_mem_copy(uint16_t dev_id, uint32_t nr_buf, uint16_t
> kick_batch, uint32_t buf_size,
> > +			uint16_t mpool_iter_step, struct rte_mbuf **srcs,
> struct rte_mbuf
> > +**dsts) {
> > +	int64_t async_cnt = 0;
> > +	int nr_cpl = 0;
> > +	uint32_t index;
> > +	uint16_t offset;
> > +	uint32_t i;
> > +
> > +	for (offset = 0; offset < mpool_iter_step; offset++) {
> > +		for (i = 0; index = i * mpool_iter_step + offset, index < nr_buf;
> i++) {
> > +			if (unlikely(rte_dma_copy(dev_id,
> > +						0,
> > +						srcs[index]->buf_iova +
> srcs[index]->data_off,
> > +						dsts[index]->buf_iova +
> dsts[index]->data_off,
> > +						buf_size,
> > +						0) < 0)) {
> > +				rte_dma_submit(dev_id, 0);
> > +				while (rte_dma_burst_capacity(dev_id, 0) ==
> 0) {
> > +					nr_cpl = rte_dma_completed(dev_id,
> 0, MAX_DMA_CPL_NB,
> > +								NULL, NULL);
> > +					async_cnt -= nr_cpl;
> > +				}
> > +				if (rte_dma_copy(dev_id,
> > +						0,
> > +						srcs[index]->buf_iova +
> srcs[index]->data_off,
> > +						dsts[index]->buf_iova +
> dsts[index]->data_off,
> > +						buf_size,
> > +						0) < 0) {
> > +					printf("enqueue fail again at %u\n",
> index);
> > +					printf("space:%d\n",
> rte_dma_burst_capacity(dev_id, 0));
> > +					rte_exit(EXIT_FAILURE, "DMA
> enqueue failed\n");
> > +				}
> > +			}
> > +			async_cnt++;
> > +
> > +			/**
> > +			 * When '&' is used to wrap an index, mask must be a
> power of 2.
> > +			 * That is, kick_batch must be 2^n.
> > +			 */
> > +			if (unlikely((async_cnt % kick_batch) == 0)) {
> > +				rte_dma_submit(dev_id, 0);
> > +				/* add a poll to avoid ring full */
> > +				nr_cpl = rte_dma_completed(dev_id, 0,
> MAX_DMA_CPL_NB, NULL, NULL);
> > +				async_cnt -= nr_cpl;
> > +			}
> > +		}
> > +
> > +		rte_dma_submit(dev_id, 0);
> > +		while (async_cnt > 0) {
> > +			nr_cpl = rte_dma_completed(dev_id, 0,
> MAX_DMA_CPL_NB, NULL, NULL);
> > +			async_cnt -= nr_cpl;
> > +		}
> 
> I have a couple of concerns about the methodology for testing the HW DMA
> performance. For example, the inclusion of that final block means that we
> are including the latency of the copy operation in the result.

The waiting time will not exceed the time of completing <SW queue size> jobs.
We also introduce initial startup time of waiting DMA starts to complete jobs.
But if the total jobs are massive, we can ignore the impact of them. However,
the problem is that we cannot guarantee it in current implementation, especially
when memory footprint is small.

> 
> If the objective of the test application is to determine if it is cheaper for
> software to offload a copy operation to HW or do it in SW, then the primary
> concern is the HW offload cost. That offload cost should remain constant
> irrespective of the size of the copy - since all you are doing is writing a
> descriptor and reading a completion result. However, seeing the results of
> running the app, I notice that the reported average cycles increases as the
> packet size increases, which would tend to indicate that we are not giving a
> realistic measurement of offload cost.

I agree with the point that offload cost is very important. When using DMA
in an async manner, for N jobs, total DMA processing cycles, total offload cost
cycles, and total core higher-level processing cycles determine the final
performance. The total offload cost can be roughly calculated by the average
offload cost times the number of offloading operations. And as a benchmark tool,
we can provide the average offload cost (i.e., total cycle of one submit, one kick
and one poll function call) which is irrespective of copy size. For example, for
1024 jobs and batch size 16, the total offload cost is 64 (1024/32) times average
offload cost.

Repeat 64 times {
	submit 16 jobs; // always success
	kick;
	poll; // 16 jobs are done
}

As you pointed out, the above estimation method may become not accurate if
applications call poll function occasionally or wait for space when fail to submit.
But offload cost is still a very important value for applications, as it shows the
basic cost of using DPDK dmadev library and it also provides a rough estimation
for applications.

> 
> The trouble then becomes how to do so in a more realistic manner. The most
> accurate way I can think of in a unit test like this is to offload <queue_size>
> entries to the device and measure the cycles taken there. Then wait until
> such time as all copies are completed (to eliminate the latency time, which in
> a real-world case would be spent by a core doing something else), and then
> do a second measurement of the time taken to process all the completions.
> In the same way as for a SW copy, any time not spent in memcpy is not copy
> time, for HW copies any time spent not writing descriptors or reading
> completions is not part of the offload cost.
> 
> That said, doing the above is still not fully realistic, as a real-world app will
> likely still have some amount of other overhead, for example, polling
> occasionally for completions in between doing other work (though one
> would expect this to be relatively cheap).  Similarly, if the submission queue
> fills, the app may have to delay waiting for space to submit jobs, and
> therefore see some of the HW copy latency.
> 
> Therefore, I think the most realistic way to measure this is to look at the rate
> of operations while processing is being done in the middle of the test. For
> example, if we have a simple packet processing application, running the
> application just doing RX and TX and measuring the rate allows us to
> determine the basic packet I/O cost. Adding in an offload to HW for each
> packet and again measuring the rate, will then allow us to compute the true
> offload copy cost of the operation, and should give us a number that remains
> flat even as packet size increases. For previous work done on vhost with
> DMA acceleration, I believe we saw exactly that - while SW PPS reduced as
> packet size increased, with HW copies the PPS remained constant even as
> packet size increased.
> 
> The challenge to my mind, is therefore how to implement this in a suitable
> unit-test style way, to fit into the framework you have given here. I would
> suggest that the actual performance measurement needs to be done - not
> on a total time - but on a fixed time basis within each test. For example,
> when doing HW copies, 1ms into each test run, we need to snapshot the
> completed entries, and then say 1ms later measure the number that have
> been completed since. In this way, we avoid the initial startup latency while
> we wait for jobs to start completing, and we avoid the final latency as we
> await the last job to complete. We would also include time for some
> potentially empty polls, and if a queue size is too small see that reflected in
> the performance too.

The method you mentioned above is like how we measure NIC RX/TX PPS,
where the main thread is in charge of snapshotting the completed jobs for a fixed
time for all worker threads. But the trouble is when to finish one test, as current
framework runs N test cases till all are completed. We may fix the testing time per
case, like 1s, and in each test, the core repeatedly feeds N jobs to DMA until running
out of the time.

Lastly, I want to point out is that the current DMA throughput is tested in an async
manner. The result will be different when using a sync manner. So the benchmark
tool need to tell user it in the final results.

Thanks,
Jiayu
> 
> Thoughts, input from others?
> 
> /Bruce

  parent reply	other threads:[~2023-01-31  5:27 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-20  1:06 [PATCH] " Cheng Jiang
2023-01-17  1:56 ` [PATCH v2] " Cheng Jiang
2023-01-17 13:00   ` Bruce Richardson
2023-01-17 13:54     ` Jiang, Cheng1
2023-01-17 14:03       ` Bruce Richardson
2023-01-18  1:46         ` Jiang, Cheng1
2023-01-17 12:05 ` [PATCH v3] " Cheng Jiang
2023-01-17 15:44   ` Bruce Richardson
2023-01-19  7:18     ` Jiang, Cheng1
2023-01-17 16:51   ` Bruce Richardson
2023-01-28 13:32     ` Jiang, Cheng1
2023-01-30  9:20       ` Bruce Richardson
2023-02-06 14:20         ` Jiang, Cheng1
2023-01-31  5:27     ` Hu, Jiayu [this message]
2023-04-20  7:22 [PATCH] " Cheng Jiang
2023-05-17  7:31 ` [PATCH v3] " Cheng Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY5PR11MB64872EFDC726EB34DC1755B492D09@CY5PR11MB6487.namprd11.prod.outlook.com \
    --to=jiayu.hu@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=cheng1.jiang@intel.com \
    --cc=dev@dpdk.org \
    --cc=mb@smartsharesystems.com \
    --cc=thomas@monjalon.net \
    --cc=wenwux.ma@intel.com \
    --cc=xingguang.he@intel.com \
    --cc=xuan.ding@intel.com \
    --cc=yuanx.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).