From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 29ACCC2FE for ; Wed, 22 Apr 2015 09:53:49 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga102.jf.intel.com with ESMTP; 22 Apr 2015 00:53:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.11,622,1422950400"; d="scan'208";a="717325993" Received: from unknown (HELO [10.217.248.56]) ([10.217.248.56]) by orsmga002.jf.intel.com with ESMTP; 22 Apr 2015 00:53:47 -0700 Message-ID: <55375375.6000605@intel.com> Date: Wed, 22 Apr 2015 09:53:25 +0200 From: Pawel Wodkowski User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: dev@dpdk.org References: <115e8a38d223487488d22a99f53cc926@GURMBXV03.AD.ARICENT.COM> In-Reply-To: <115e8a38d223487488d22a99f53cc926@GURMBXV03.AD.ARICENT.COM> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] cost of reading tsc register X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Apr 2015 07:53:49 -0000 On 2015-04-20 16:37, Ravi Kumar Iyer wrote: > Hi, > We were doing some code optimizations , running DPDK based applications, and chanced upon the rte_rdtsc function [ to read tsc timestamp register value ] consuming cpu cycles of the order of 100clock cycles with a delta of upto 40cycles at times [ 60-140 cycles] > > We are actually building up a cpu intensive application which is also very clock cycle sensitive and this is impacting our implementation. > > To validate the same using a small/vanilla application we wrote a small code and tested on a single core. > Has anyone else faced a similar issue or are we doing something really atrocious here. > > Below is the pseudo snip of the same: > > ... > for (i = 0; i < 8 ; i++) > { > g_tsc_cost[i] = rte_rdtsc(); > } ... > > uint64_t sc = rte_rdtsc(); /* start count */ > test_tsc_cost(); > uint64_t ec = rte_rdtsc(); /* end count */ > I am no an expert in this topic but I can share you knowledge I got during lib jobstats implementation (I think you can find it useful in your case with small modification in getting the time). The rte_rdtsc() (it is wrapper to asm rdtsc instruction) is pretty useless in this particular use case. This instruction is pipelined and because of this you wont get precise time. The same is true for rte_rdtsc_precise(). This one is memory barrier followed by rte_rdtsc(). I was surprised that compiler in most cases remove the memory barrier on '-Os' and '-O3', so final code might not be different than rte_rdtsc(). There is no perfect solution for your problem. Assuming you want measure pure code execution time you need to use the ... CPUID instruction :D together with RDTSC and RTDSCP. Yes, this not a joke. The CPUID is some kind of barrier to the out-of-order execution. Writing this in pseudo code: static inline uint64_t rte_rdtscp(void) { union { uint64_t tsc_64; struct { uint32_t lo_32; uint32_t hi_32; }; } tsc; #ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT /* What ever is needed here */ #endif asm volatile("rdtscp" : "=a" (tsc.lo_32), "=d" (tsc.hi_32)); return tsc.tsc_64; } uint_64_t timestamp_start(void) { /* Execution barier */ asm CPUID; return rte_rdtsc(); /* without 'p' */ } uint_64_t timestamp_get(void) { /* Execution barier */ uint64_t time = rte_rdtscp(); /* without 'p' */ asm CPUID return time; } void do_some_task(void) { g_tsc_cost[i] = timestamp_get(); } /* warmup cache */ timestamp_start(); timestamp_start(); timestamp_start(); start_time = timestamp_start(); do_some_task(); end_time = timestamp_get() ... And some words about performance here: If you want use it many times in code and measure intervals less than few thousands of cycles you will kill your application becouse of processor stall at CPUID and RDTSCP instruction so use it wisely. During l2_fwd_jobstats example implementation I tested those cases. With original rte_rdtscp() app was able to handle about 64B packets with 2x7.5GiB traffic/per core. When I used CPUID and RDTSCP to get "accurate" timestamps I got max 2x4.5GiB. So again: use it wisely. And one word that is totally my opinion I came up: those CPUs are no designed to do very precise time measurements, because there is no easy way to implement it without getting significant performance penalty. > > Just to compare, On few bare metal implementations of non-intel processors, we are seeing the similar code print values with a delta of 3-4 cycles and thus its becoming a bit difficult to digest as well. Grateful for any help/guidance here. > I think you should also isolate the CPU from scheduling and use IRQ affinity to remove any unwanted interference form system. -- Pawel