From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 9B9B57D0D for ; Thu, 4 May 2017 18:35:01 +0200 (CEST) Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 04 May 2017 09:35:00 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.38,287,1491289200"; d="scan'208";a="83672916" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by orsmga004.jf.intel.com with ESMTP; 04 May 2017 09:34:59 -0700 Received: from fmsmsx114.amr.corp.intel.com (10.18.116.8) by FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS) id 14.3.319.2; Thu, 4 May 2017 09:34:59 -0700 Received: from fmsmsx113.amr.corp.intel.com ([169.254.13.235]) by FMSMSX114.amr.corp.intel.com ([169.254.6.206]) with mapi id 14.03.0319.002; Thu, 4 May 2017 09:34:59 -0700 From: "Wiles, Keith" To: Matt Laswell CC: "dev@dpdk.org" Thread-Topic: [dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC Thread-Index: AQHSxNb7ouTsxLjiXEKddMLpif2a+qHk1GcA Date: Thu, 4 May 2017 16:34:58 +0000 Message-ID: <7B7539B0-8CB0-4DDB-B329-D11F295A2604@intel.com> References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.254.17.150] Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] Occasional instability in RSS Hashes/Queues from X540 NIC X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 May 2017 16:35:03 -0000 > On May 4, 2017, at 8:04 AM, Matt Laswell wrote: >=20 > Hey Folks, >=20 > I'm seeing some strange behavior with regard to the RSS hash values in my > applications and was hoping somebody might have some pointers on where to > look. In my application, I'm using RSS to divide work among multiple > cores, each of which services a single RX queue. When dealing with a > single long-lived TCP connection, I occasionally see packets going to the > wrong core. That is, almost all of the packets in the connection go to > core 5 in this case, but every once in a while, one goes to core 0 instea= d. >=20 > Upon further investigation, I find two problems are occurring. The first > is that problem packets have the RSS hash value in the mbuf incorrectly s= et > to zero. They are therefore put in queue zero, where they are read by co= re > zero. Other packets from the same connection that occur immediately befo= re > and after the packet in question have the correct hash value and therefor= e > go to a different core. The second problem is that we sometimes see > packets in which the RSS hash in the mbuf appears correct, but the packet= s > are incorrectly put into queue zero. As with the first, this results in > the wrong core getting the packet. Either one of these confuses the stat= e > tracking we're doing per-core. >=20 > A few details: >=20 > - Using an Intel X540-AT2 NIC and the igb_uio driver > - DPDK 16.04 > - A particular packet in our workflow always encounters this problem. > - Retransmissions of the packet in question also encounter the problem > - The packet is IPv4, with header length of 20 (so no options), no > fragmentation. > - The only differences I can see in the IP header between packets that > get the right hash value and those that get the wrong one are in the IP= ID, > total length, and checksum fields. > - Using ETH_RSS_IPV4 > - The packet is TCP with about 100 bytes of payload - it's not a jumbo > or a runt > - We fill the key in with 0x6d5a to get symmetric hashing of both sides > of the connection > - We only configure RSS information at boot; things like the key or > header fields are not being changed dynamically > - Traffic load is light when the problem occurs >=20 > Is anybody aware of an errata, either in the NIC or the PMD's configurati= on > of it that might explain something like this? Failing that, if you ran > into this sort of behavior, how would you approach finding the reason for > the error? Every failure mode I can think of would tend to affect all of > the packets in the connection consistently, even if incorrectly. Just to add more information to this email, can you provide hexdumps of the= packets to help someone maybe spot the problem? Need the previous OK packet plus the one after it and the failing packets y= ou are seeing. I do not know why this is happening as I do not know of any errata to expla= in this issue. >=20 > Thanks in advance for any ideas. >=20 > -- > Matt Laswell > laswell@infinite.io Regards, Keith