DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] cost of reading tsc register
@ 2015-04-20 14:37 Ravi Kumar Iyer
  2015-04-20 15:37 ` Stephen Hemminger
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Ravi Kumar Iyer @ 2015-04-20 14:37 UTC (permalink / raw)
  To: dev

Hi,
We were doing some code optimizations , running DPDK based applications, and chanced upon the rte_rdtsc function [ to read tsc timestamp register value ] consuming cpu cycles of the order of 100clock cycles with a delta of upto 40cycles at times [ 60-140 cycles]

We are actually building up a cpu intensive application which is also very clock cycle sensitive and this is impacting our implementation.

To validate the same using a small/vanilla application we wrote a small code and tested on a single core.
Has anyone else faced a similar issue or are we doing something really atrocious here.

Below is the pseudo snip of the same:


<snip start>
uint64_t g_tsc_cost[8] __rte_cache_aligned;

void test_tsc_cost()
{
    uint8_t i = 0;
    for (i = 0; i < 8 ; i++)
    {
        g_tsc_cost[i] = rte_rdtsc();
      }
}
int
main(int argc, char **argv)
{

    int ret;
    unsigned lcore_id;

    ret = rte_eal_init(argc, argv);
    if (ret < 0)
        rte_panic("Cannot init EAL\n");

    memset(g_tsc_cost,0,64); /* warm the cache */

    uint64_t sc = rte_rdtsc(); /* start count */
    test_tsc_cost();
    uint64_t ec = rte_rdtsc(); /* end count */

    printf("\n Total cost = %lu\n",(ec-sc));

    uint8_t i = 0;

    for (i = 0; i < 8 ; i++)
    {
        printf("\n g_tsc_cost[%d]=%lu",i,g_tsc_cost[i]);
       /* here the values printed are 60-140 units apart */

    }
    return 0;
}
<snip end>

Just to compare, On few bare metal implementations of non-intel processors, we are seeing the similar code print values with a delta of 3-4 cycles and thus its becoming a bit difficult to digest as well.  Grateful for any help/guidance here.

Thanks
ravi




"DISCLAIMER: This message is proprietary to Aricent and is intended solely for the use of the individual to whom it is addressed. It may contain privileged or confidential information and should not be circulated or used for any purpose other than for what it is intended. If you have received this message in error, please notify the originator immediately. If you are not the intended recipient, you are notified that you are strictly prohibited from using, copying, altering, or disclosing the contents of this message. Aricent accepts no responsibility for loss or damage arising from the use of the information transmitted by this email including damage from virus."

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] cost of reading tsc register
  2015-04-20 14:37 [dpdk-dev] cost of reading tsc register Ravi Kumar Iyer
@ 2015-04-20 15:37 ` Stephen Hemminger
  2015-04-20 16:21 ` Matthew Hall
  2015-04-22  7:53 ` Pawel Wodkowski
  2 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2015-04-20 15:37 UTC (permalink / raw)
  To: Ravi Kumar Iyer; +Cc: dev

On Mon, 20 Apr 2015 14:37:53 +0000
Ravi Kumar Iyer <Ravi.Iyer@aricent.com> wrote:

> Hi,
> We were doing some code optimizations , running DPDK based applications, and chanced upon the rte_rdtsc function [ to read tsc timestamp register value ] consuming cpu cycles of the order of 100clock cycles with a delta of upto 40cycles at times [ 60-140 cycles]
> 
> We are actually building up a cpu intensive application which is also very clock cycle sensitive and this is impacting our implementation.
> 
> To validate the same using a small/vanilla application we wrote a small code and tested on a single core.
> Has anyone else faced a similar issue or are we doing something really atrocious here.
> 
> Below is the pseudo snip of the same:
> 
> 
> <snip start>
> uint64_t g_tsc_cost[8] __rte_cache_aligned;
> 
> void test_tsc_cost()
> {
>     uint8_t i = 0;
>     for (i = 0; i < 8 ; i++)
>     {
>         g_tsc_cost[i] = rte_rdtsc();
>       }
> }
> int
> main(int argc, char **argv)
> {
> 
>     int ret;
>     unsigned lcore_id;
> 
>     ret = rte_eal_init(argc, argv);
>     if (ret < 0)
>         rte_panic("Cannot init EAL\n");
> 
>     memset(g_tsc_cost,0,64); /* warm the cache */
> 
>     uint64_t sc = rte_rdtsc(); /* start count */
>     test_tsc_cost();
>     uint64_t ec = rte_rdtsc(); /* end count */
> 
>     printf("\n Total cost = %lu\n",(ec-sc));
> 
>     uint8_t i = 0;
> 
>     for (i = 0; i < 8 ; i++)
>     {
>         printf("\n g_tsc_cost[%d]=%lu",i,g_tsc_cost[i]);
>        /* here the values printed are 60-140 units apart */
> 
>     }
>     return 0;
> }
> <snip end>
> 
> Just to compare, On few bare metal implementations of non-intel processors, we are seeing the similar code print values with a delta of 3-4 cycles and thus its becoming a bit difficult to digest as well.  Grateful for any help/guidance here.

TSC instruction has it's quirks. As far as I can tel.
 1. It kills instruction pipelining
 2. It is as expensive as a cache miss
 3. counter values are not stable on some CPU's

In general, it is best to avoid getting dependent on it in real code.
Intel seems to only test on current generation Intel CPU's in their
lab and on bare metal. Don't read too much into the demo applications.

To get reasonable performance, I gave up on TSC and used approximate
loop cycles for tuning.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] cost of reading tsc register
  2015-04-20 14:37 [dpdk-dev] cost of reading tsc register Ravi Kumar Iyer
  2015-04-20 15:37 ` Stephen Hemminger
@ 2015-04-20 16:21 ` Matthew Hall
  2015-04-22  7:53 ` Pawel Wodkowski
  2 siblings, 0 replies; 4+ messages in thread
From: Matthew Hall @ 2015-04-20 16:21 UTC (permalink / raw)
  To: Ravi Kumar Iyer; +Cc: dev

On Mon, Apr 20, 2015 at 02:37:53PM +0000, Ravi Kumar Iyer wrote:
> We were doing some code optimizations , running DPDK based applications, and chanced upon the rte_rdtsc function [ to read tsc timestamp register value ] consuming cpu cycles of the order of 100clock cycles with a delta of upto 40cycles at times [ 60-140 cycles]
> 
> We are actually building up a cpu intensive application which is also very clock cycle sensitive and this is impacting our implementation.
> 
> To validate the same using a small/vanilla application we wrote a small code and tested on a single core.
> Has anyone else faced a similar issue or are we doing something really atrocious here.

What happened when you tried rte_rdtsc_precise ?

Matthew.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] cost of reading tsc register
  2015-04-20 14:37 [dpdk-dev] cost of reading tsc register Ravi Kumar Iyer
  2015-04-20 15:37 ` Stephen Hemminger
  2015-04-20 16:21 ` Matthew Hall
@ 2015-04-22  7:53 ` Pawel Wodkowski
  2 siblings, 0 replies; 4+ messages in thread
From: Pawel Wodkowski @ 2015-04-22  7:53 UTC (permalink / raw)
  To: dev

On 2015-04-20 16:37, Ravi Kumar Iyer wrote:
> Hi,
> We were doing some code optimizations , running DPDK based applications, and chanced upon the rte_rdtsc function [ to read tsc timestamp register value ] consuming cpu cycles of the order of 100clock cycles with a delta of upto 40cycles at times [ 60-140 cycles]
>
> We are actually building up a cpu intensive application which is also very clock cycle sensitive and this is impacting our implementation.
>
> To validate the same using a small/vanilla application we wrote a small code and tested on a single core.
> Has anyone else faced a similar issue or are we doing something really atrocious here.
>
> Below is the pseudo snip of the same:
>
>
...
>      for (i = 0; i < 8 ; i++)
>      {
>          g_tsc_cost[i] = rte_rdtsc();
>        }
...
>
>      uint64_t sc = rte_rdtsc(); /* start count */
>      test_tsc_cost();
>      uint64_t ec = rte_rdtsc(); /* end count */
>

I am no an expert in this topic but I can share you knowledge I got 
during lib jobstats implementation (I think you can find it useful in 
your case with small modification in getting the time).

The rte_rdtsc() (it is wrapper to asm rdtsc instruction) is pretty 
useless in this particular use case. This instruction is pipelined and 
because of this you wont get precise time.

The same is true for rte_rdtsc_precise(). This one is memory barrier 
followed by rte_rdtsc(). I was surprised that compiler in most cases 
remove the memory barrier on '-Os' and '-O3', so final code might not be 
different than rte_rdtsc().

There is no perfect solution for your problem.
Assuming you want measure pure code execution time you need to use the 
... CPUID instruction :D together with RDTSC and RTDSCP. Yes, this not a 
joke. The CPUID is some kind of barrier to the out-of-order execution. 
Writing this in pseudo code:

static inline uint64_t
rte_rdtscp(void)
{
	union {
		uint64_t tsc_64;
		struct {
			uint32_t lo_32;
			uint32_t hi_32;
		};
	} tsc;

#ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
	/* What ever is needed here */
#endif

	asm volatile("rdtscp" :
		     "=a" (tsc.lo_32),
		     "=d" (tsc.hi_32));
	return tsc.tsc_64;
}

uint_64_t
timestamp_start(void) {
	/* Execution barier */
	asm CPUID;
	return rte_rdtsc(); /* without 'p' */
}

uint_64_t
timestamp_get(void) {
	/* Execution barier */
	uint64_t time = rte_rdtscp(); /* without 'p' */
	asm CPUID
	return time;
}

void
do_some_task(void)
{
	g_tsc_cost[i] = timestamp_get();
}

/* warmup cache */
timestamp_start();
timestamp_start();
timestamp_start();

start_time = timestamp_start();
do_some_task();
end_time = timestamp_get()
...

And some words about performance here:
If you want use it many times in code and measure intervals less than 
few thousands of cycles you will kill your application becouse of 
processor stall at CPUID and RDTSCP instruction so use it wisely.
During l2_fwd_jobstats example implementation I tested those cases.
With original rte_rdtscp() app was able to handle about 64B packets with 
2x7.5GiB traffic/per core. When I used CPUID and RDTSCP to get 
"accurate" timestamps I got max 2x4.5GiB. So again: use it wisely.

And one word that is totally my opinion I came up: those CPUs are no 
designed to do very precise time measurements, because there is no easy 
way to implement it without getting significant performance penalty.

>
> Just to compare, On few bare metal implementations of non-intel processors, we are seeing the similar code print values with a delta of 3-4 cycles and thus its becoming a bit difficult to digest as well.  Grateful for any help/guidance here.
>

I think you should also isolate the CPU from scheduling and use IRQ 
affinity to remove any unwanted interference form system.

-- 
Pawel

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-04-22  7:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-20 14:37 [dpdk-dev] cost of reading tsc register Ravi Kumar Iyer
2015-04-20 15:37 ` Stephen Hemminger
2015-04-20 16:21 ` Matthew Hall
2015-04-22  7:53 ` Pawel Wodkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).