From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-out1.informatik.tu-muenchen.de (mail-out1.informatik.tu-muenchen.de [131.159.0.8]) by dpdk.org (Postfix) with ESMTP id 3C5301288 for ; Tue, 12 May 2015 00:32:06 +0200 (CEST) Received: from Charizard-WiFi.fritz.box (p5DCD6AA3.dip0.t-ipconnect.de [93.205.106.163]) by mail.net.in.tum.de (Postfix) with ESMTPSA id 8C03B1944B46 for ; Tue, 12 May 2015 00:32:05 +0200 (CEST) Message-ID: <55512DE5.7010800@net.in.tum.de> Date: Tue, 12 May 2015 00:32:05 +0200 From: Paul Emmerich User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: dev@dpdk.org References: <554FF482.9080103@net.in.tum.de> In-Reply-To: <554FF482.9080103@net.in.tum.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 May 2015 22:32:06 -0000 Paul Emmerich: > I naively tried to move the pool pointer into the first cache line in > the v2.0.0 tag and the performance actually decreased, I'm not yet sure > why this happens. There are probably assumptions about the cacheline > locations and prefetching in the code that would need to be adjusted. This happens because the next-pointer in the mbuf is touched almost everywhere, even for mbufs with only one segment because it is used to determine if there is another segment (instead of using the nb_segs field). I guess a solution for me would be to use a custom layout that is optimized for tx. I can shrink ol_flags to 32 bits and move the seqn and hash fields to the second cache line. A quick-and-dirty test shows that this even gives me a slightly higher performance than DPDK 1.7 in the full-featured tx path. This is probably going to break the vector rx/tx path, but I can't use that anyways since I always need offloading features (timestamping and checksums). I'll have to see how this affects the rx path. But I value tx performance over rx performance. My rx logic is usually very simple. This solution is kind of ugly. I would prefer to be able to use an unmodified version of DPDK :/ By the way, I think there is something wrong with this assumption in commit f867492346bd271742dd34974e9cf8ac55ddb869: > The general approach that we are looking to take is to focus the first > cache line on fields that are updated on RX , so that receive only deals > with one cache line. I think this might be wrong due to the next pointer. I'll probably build a simple rx-only benchmark in a few weeks or so. I suspect that it will also be significantly slower. But that should be fixable. Paul