From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by dpdk.org (Postfix) with ESMTP id E73205688 for ; Wed, 26 Aug 2015 20:49:45 +0200 (CEST) Received: by wicja10 with SMTP id ja10so53154943wic.1 for ; Wed, 26 Aug 2015 11:49:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=B19O3iEP88o7i3RlqVc+qz9EtVA/O8Q3hi77HhuPAX4=; b=A8SYGRMJcDxubCZ2F6UrLFEgwNAfMXgDEIp7nIV3NC/MNHzXS99zgFe8iCdXSOreOh CwMDZqmpVYmXh46CMhtVXCjSL7EsOnwkCItqPNlI3negRbBuBGn4hbELsHwZzA5w+kE0 XdNXe/CccRqTbDkCh5kdVOaTP6RzDTULbvRAX4Hi2FtL6HichP/xlW2Kp0+MaCrStDBm X2ktpLwp7WDCf+9O2rQ+9fhm6CBgAad5FL/fn5FOSOUFKJTIO5P152LSkDJmMvUWmIO8 uNrTTfqORQnhmKuQPlbxogYuuXbmWU4NSTrtp32b3k7jDsz+3T8vxbRGl8S42AxbQ5H3 NeMQ== X-Gm-Message-State: ALoCoQlsRQhze2Zg4+nLtmWibkmtuiYZEGV+KSeFm1n4IVsYLRK0GBRVeUEw6fk91+MaIh5a+thC X-Received: by 10.180.160.210 with SMTP id xm18mr5745936wib.93.1440614985759; Wed, 26 Aug 2015 11:49:45 -0700 (PDT) Received: from [192.168.0.101] ([90.152.119.35]) by smtp.googlemail.com with ESMTPSA id ma4sm5125349wjb.38.2015.08.26.11.49.44 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 26 Aug 2015 11:49:44 -0700 (PDT) Message-ID: <55DE0A49.8060803@linaro.org> Date: Wed, 26 Aug 2015 19:49:45 +0100 From: Zoltan Kiss User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0 MIME-Version: 1.0 To: "dev@dpdk.org" , dev@openvswitch.org References: <55D76854.5010306@linaro.org> In-Reply-To: <55D76854.5010306@linaro.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Aug 2015 18:49:46 -0000 Hi, I've checked it further, based on Stephen's suggestion I've tried perf top as well. The results were the same, it spends a lot of time in that part of the code, and there are high number of branch load misses (BR_MISS_PRED_RETIRED) around there too. I've also started to strip down miniflow_extract() to remove parts which are not relevant to this very simple testcase. I've removed the metadata checking branches and the "size < sizeof(struct eth_header)". I've removed the size check from emc_processing, and placed log messages in flow_extract and netdev_flow_key_from_flow, to make sure the excessive time spent in miniflow_extract is not because these two are somehow calling it. That way I've closed out all of the branches preceding this instruction. Oddly the high sample number now moved down a few instructions: ... dp_packet_reset_offsets 5113eb: b8 ff ff ff ff mov $0xffffffff,%eax 5113f0: 66 89 8f 86 00 00 00 mov %cx,0x86(%rdi) 5113f7: c6 87 81 00 00 00 00 movb $0x0,0x81(%rdi) 5113fe: 66 89 87 82 00 00 00 mov %ax,0x82(%rdi) data_pull 511405: 48 8d 4d 0c lea 0xc(%rbp),%rcx dp_packet_reset_offsets 511409: 66 89 97 84 00 00 00 mov %dx,0x84(%rdi) memcpy 511410: 48 8b 45 00 mov 0x0(%rbp),%rax 511414: 48 89 46 18 mov %rax,0x18(%rsi) This last instruction moves the first 8 bytes of the MAC address (coming from 0x0(%rbp)) to 0x18(%rsi), which is basically memory pointed by parameter "struct miniflow *dst". It is allocated on the stack by emc_processing. I couldn't find any branch which can cause this miss, but then I've checked the PMD stats: pmd thread numa_id 0 core_id 1: emc hits:4395834176 megaflow hits:1 miss:1 lost:0 polling cycles:166083129380 (16.65%) processing cycles:831536059972 (83.35%) avg cycles per packet: 226.95 (997619189352/4395834178) avg processing cycles per packet: 189.16 (831536059972/4395834178) So everything hits EMC, when I measured the change of that counter for 10 seconds, the result was around ~13.3 Mpps too. The cycle statistics shows that it should be able to handle more than 15M packets per second, yet it doesn't receive that much, while with the non-vector PMD it can max out the link. Any more suggestions? Regards, Zoltan On 21/08/15 19:05, Zoltan Kiss wrote: > Hi, > > I've set up a simple packet forwarding perf test on a dual-port 10G > 82599ES: one port receives 64 byte UDP packets, the other sends it out, > one core used. I've used latest OVS with DPDK 2.1, and the first result > was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last > year with the same test. The first thing I've changed was to revert back > to the old behaviour about this issue: > > http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731 > > So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM. > That increased the performance to 13.5, but to figure out what's wrong > started to play with the receive functions. First I've disabled vector > PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So > then I've enabled scattered RX, and with > ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which > is I guess as close as possible to the 14.2 line rate (on my HW at > least, with one core) > Does anyone has a good explanation about why the vector PMD performs so > significantly worse? I would expect that on a 3.2 GHz i5-4570 one core > should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a > difference. > I've tried to look into it with oprofile, but the results were quite > strange: 35% of the samples were from miniflow_extract, the part where > parse_vlan calls data_pull to jump after the MAC addresses. The oprofile > snippet (1M samples): > > 511454 19 0.0037 flow.c:511 > 511458 149 0.0292 dp-packet.h:266 > 51145f 4264 0.8357 dp-packet.h:267 > 511466 18 0.0035 dp-packet.h:268 > 51146d 43 0.0084 dp-packet.h:269 > 511474 172 0.0337 flow.c:511 > 51147a 4320 0.8467 string3.h:51 > 51147e 358763 70.3176 flow.c:99 > 511482 2 3.9e-04 string3.h:51 > 511485 3060 0.5998 string3.h:51 > 511488 1693 0.3318 string3.h:51 > 51148c 2933 0.5749 flow.c:326 > 511491 47 0.0092 flow.c:326 > > And the corresponding disassembled code: > > 511454: 49 83 f9 0d cmp r9,0xd > 511458: c6 83 81 00 00 00 00 mov BYTE PTR [rbx+0x81],0x0 > 51145f: 66 89 83 82 00 00 00 mov WORD PTR [rbx+0x82],ax > 511466: 66 89 93 84 00 00 00 mov WORD PTR [rbx+0x84],dx > 51146d: 66 89 8b 86 00 00 00 mov WORD PTR [rbx+0x86],cx > 511474: 0f 86 af 01 00 00 jbe 511629 > > 51147a: 48 8b 45 00 mov rax,QWORD PTR [rbp+0x0] > 51147e: 4c 8d 5d 0c lea r11,[rbp+0xc] > 511482: 49 89 00 mov QWORD PTR [r8],rax > 511485: 8b 45 08 mov eax,DWORD PTR [rbp+0x8] > 511488: 41 89 40 08 mov DWORD PTR [r8+0x8],eax > 51148c: 44 0f b7 55 0c movzx r10d,WORD PTR [rbp+0xc] > 511491: 66 41 81 fa 81 00 cmp r10w,0x81 > > My only explanation to this so far is that I misunderstand something > about the oprofile results. > > Regards, > > Zoltan