From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-out1.informatik.tu-muenchen.de (mail-out1.informatik.tu-muenchen.de [131.159.0.8]) by dpdk.org (Postfix) with ESMTP id 3DCA1C37C for ; Mon, 11 May 2015 02:15:00 +0200 (CEST) Received: from Charizard-WiFi.fritz.box (p5DCD6D06.dip0.t-ipconnect.de [93.205.109.6]) by mail.net.in.tum.de (Postfix) with ESMTPSA id 5853E198E81C for ; Mon, 11 May 2015 02:14:59 +0200 (CEST) Message-ID: <554FF482.9080103@net.in.tum.de> Date: Mon, 11 May 2015 02:14:58 +0200 From: Paul Emmerich User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: dev@dpdk.org Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: [dpdk-dev] TX performance regression caused by the mbuf cachline split X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 May 2015 00:15:00 -0000 Hi, this is a follow-up to my post from 3 weeks ago [1]. I'm starting a new thread here since I now got a completely new test setup for improved reproducibility. Background for anyone that didn't catch my last post: I'm investigating a performance regression in my packet generator [2] that occurs since I tried to upgrade from DPDK 1.7.1 to 1.8 or 2.0. DPDK 1.7.1 is about 25% faster than 2.0 in my application. I suspected that this is due to the new 2-cacheline mbufs, which I now confirmed with a bisect. My old test setup was based on the l2fwd example and required an external packet generator and was kind of hard to reproduce. I built a simple tx benchmark application that simply sends nonsensical packets with a sequence number as fast as possible on two ports with a single single core. You can download the benchmark app at [3]. Hardware setup: CPU: E5-2620 v3 underclocked to 1.2 GHz RAM: 4x 8 GB 1866 MHz DDR4 memory NIC: X540-T2 Baseline test results: DPDK simple tx full-featured tx 1.7.1 14.1 Mpps 10.7 Mpps 2.0.0 11.0 Mpps 9.3 Mpps DPDK 1.7.1 is 28%/15% faster than 2.0 with simple/full-featured tx in this benchmark. I then did a few runs of git bisect to identify commits that caused a significant drop in performance. You can find the script that I used to quickly test the performance of a version at [4]. Commit simple full-featured 7869536f3f8edace05043be6f322b835702b201c 13.9 10.4 mbuf: flatten struct vlan_macip The commit log explains that there is a perf regression and that it cannot be avoided to be future-compatible. The log claims < 5% which is consistent with my test results (old code is 4% faster). I guess that is okay and cannot be avoided. Commit simple full-featured 08b563ffb19d8baf59dd84200f25bc85031d18a7 12.8 10.4 mbuf: replace data pointer by an offset This affects the simple tx path significantly. This performance regression is probably simply be caused by the (temporarily) disabled vector tx code that is mentioned in the commit log. Not investigated further. Commit simple full-featured f867492346bd271742dd34974e9cf8ac55ddb869 10.7 9.1 mbuf: split mbuf across two cache lines. This one is the real culprit. The commit log does not mention any performance evaluations and a quick scan of the mailing list also doesn't reveal any evaluations of the impact of this change. It looks like the main problem for tx is that the mempool pointer is in the second cacheline. I think the new mbuf structure is too bloated. It forces you to pay for features that you don't need or don't want. I understand that it needs to support all possible filters and offload features. But it's kind of hard to justify 25% difference in performance for a framework that sets performance above everything (Does it? I Picked that up from the discussion in the "Beyond DPDK 2.0" thread). I've counted 56 bytes in use in the first cacheline in v2.0.0. Would it be possible to move the pool pointer and tx offload fields to the first cacheline? We would just need to free up 8 bytes. One candidate would be the seqn field, does it really have to be in the first cache line? Another candidate is the size of the ol_flags field? Do we really need 64 flags? Sharing bits between rx and tx worked fine. I naively tried to move the pool pointer into the first cache line in the v2.0.0 tag and the performance actually decreased, I'm not yet sure why this happens. There are probably assumptions about the cacheline locations and prefetching in the code that would need to be adjusted. Another possible solution would be a more dynamic approach to mbufs: the mbuf struct could be made configurable to fit the requirements of the application. This would probably require code generation or a lot of ugly preprocessor hacks and add a lot of complexity to the code. The question would be if DPDK really values performance above everything else. Paul P.S.: I'm kind of disappointed by the lack of regression tests for the performance. I think that such tests should be an integral part of a framework with the explicit goal to be fast. For example, the main page at dpdk.org claims a performance of "usually less than 80 cycles" for a rx or tx operation. This claim is no longer true :( Touching the layout of a core data structure like the mbuf shouldn't be done without carefully evaluating the performance impacts. But this discussion probably belongs in the "Beyond DPDK 2.0" thread. P.P.S.: Benchmarking an rx-only application (e.g. traffic analysis) would also be interesting, but that's not really on my todo list right now. Mixed rx/tx like forwarding is also affected as discussed in my last thread [1]). [1] http://dpdk.org/ml/archives/dev/2015-April/016921.html [2] https://github.com/emmericp/MoonGen [3] https://github.com/emmericp/dpdk-tx-performance [4] https://gist.github.com/emmericp/02c5885908c3cb5ac5b7