From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id CC35A7FFD
 for <dev@dpdk.org>; Fri, 10 Oct 2014 12:51:44 +0200 (CEST)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
 by orsmga103.jf.intel.com with ESMTP; 10 Oct 2014 03:56:26 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.04,691,1406617200"; d="scan'208";a="586632229"
Received: from bricha3-mobl.ger.corp.intel.com (HELO
 bricha3-mobl.ir.intel.com) ([10.243.20.24])
 by orsmga001.jf.intel.com with SMTP; 10 Oct 2014 03:59:07 -0700
Received: by bricha3-mobl.ir.intel.com (sSMTP sendmail emulation);
 Fri, 10 Oct 2014 11:59:06 +0001
Date: Fri, 10 Oct 2014 11:59:06 +0100
From: Bruce Richardson <bruce.richardson@intel.com>
To: Matthew Hall <mhall@mhcomputing.net>
Message-ID: <20141010105906.GA12696@BRICHA3-MOBL>
References: <3AEA2BF9852C6F48A459DA490692831FE21954@IRSMSX109.ger.corp.intel.com>
 <20141008224111.GC29243@mhcomputing.net>
 <20141008225540.GA15850@hmsreliant.think-freely.org>
 <20141008230728.GA29712@mhcomputing.net>
 <20141009091421.GB14308@BRICHA3-MOBL>
 <20141009171135.GA8620@mhcomputing.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20141009171135.GA8620@mhcomputing.net>
Organization: Intel Shannon Ltd.
User-Agent: Mutt/1.5.22 (2013-10-16)
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH RFC] librte_reorder: new reorder library
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Oct 2014 10:51:45 -0000

On Thu, Oct 09, 2014 at 10:11:35AM -0700, Matthew Hall wrote:
> On Thu, Oct 09, 2014 at 10:14:21AM +0100, Bruce Richardson wrote:
> > Hi Matthew,
> > 
> > What you are doing will indeed work, and it's the way the vast majority of 
> > the sample apps are written. However, this will not always work for everyone 
> > else, sadly.
> > 
> > First off, with RSS, there are a number of limitations. On the 1G and 10G 
> > NICs RSS works only with IP traffic, and won't work in cases with other 
> > protocols or where IP is encapsulated in anything other than a single VLAN.  
> > Those cases need software load distribution. As well as this, you have very 
> > little control over where flows get put, as the separation into queues 
> > (which go to cores), is only done on the low seven bits. For applications 
> > which work with a small number of flows, e.g. where multiple flows are 
> > contained inside a single tunnel, you get a get a large flow imbalance, 
> > where you get far more traffic coming to one queue/core than to another.  
> > Again in this instance, software load balancing is needed.
> > 
> > Secondly, then, based off that, it is entirely possible when doing software 
> > load balancing to strictly process packets for a flow in order - and indeed 
> > this is what the existing packet distributor does. However, for certain 
> > types of flow where processing of packets for that flow can be done in 
> > parallel, forcing things to be done serially can slow things down. As well 
> > as this, there can sometimes be requirements for the load balancing between 
> > cores to be done as fairly as possible so that it is guaranteed that all 
> > cores have approx the same load, irrespective of the number of input flows.  
> > In these cases, having the option to blindly distribute traffic to cores and 
> > then reorder packets on TX is the best way to ensure even load distribution.  
> > It's not going to be for everyone, but it's good to have the option - and 
> > there are a number of people doing things this way already.
> > 
> > Lastly, there is also the assumption being made that all flows are 
> > independent, which again may not always be the case. If you need ordering 
> > across flows and to share load between cores then reordering on transmission 
> > is the only way to do things.
> > 
> > Hope this helps,
> > 
> > Regards,
> > /Bruce
> 
> Bruce,
> 
> This explanation is of excellent quality.
> 
> It would be nice if it could be made into a whitepaper about the different 
> L2-L7 acceleration technologies available in the Intel NICs, popular VNICs 
> (virtio-net and vmxnet3), Intel CPUs, and DPDK code, all working together. Or 
> incorporated into such a document if it already exists.
> 
> Without things like this it's very hard to understand when and how to enable 
> the different accelerations can be used together, when they'll work, and when 
> they won't work.
> 
> For example, I didn't know RSS only worked on IP... I was assuming it would do 
> a consistent-hash of MAC's for non-IP packets at least... also, when it 
> doesn't know what to do, does it send them to the default queue, or a random 
> FIFO RX queue picks it up or what?
>

When RSS gets a non-IP packet, or a packet it can't hash, that packet will 
be put into queue 0. This leads to a number of little tricks we can use if 
we have a mix of IP/non-IP traffic.

1. To simply separate out IP traffic from non-IP traffic, we just turn on 
RSS and update the reta table to have all entries set to queue 1. This means 
that all IP traffic goes to queue 1, and all other traffic to queue 0.
2. If you want to separate IPv6 from IPv4, you can do the exact same thing 
as in point 1, except only turn on RSS for one of the protocols. If you only 
turn on RSS for IPv4, then IPv6 traffic should be treated as non-IP and go 
to queue 0.
3. If you have IP and non-IP traffic going to a set of ports and are using 
multiple RSS queues to split that traffic across multiple cores, such that 
each core also reads from each port [e.g. 4 ports, and 4 cores,  where each 
core reads one RSS queue on each port], you can "rotate" the RSS table 
between ports so that you also load-balance the non-IP traffic coming in.  
Taking the referenced example, instead of having core 0 read queue 0 on each 
port, you have the values that hash to queue 0 on port 0 get directed to 
queue 1 on port 1, queue 2 on port 2, etc. Then core 0 [and every other 
core] reads a different queue number on each port - while still getting the 
same flows.  Furthermore, since the non-IP traffic is unaffected and always 
goes to queue 0, the non-IP traffic to each port gets handled by a different 
core, rather than all non-IP traffic going to core 0 as would be the case in 
the default setup. [Yes, this would be the case too if you took the simple 
option of just having one core per port, but doing things this way also 
gives you load balancing if one port is busier than the others.]

Finally, I'd just note that RSS is documented in section 7.1.2.8 of the 
datasheet for the Intel 82599 10 GbE Controller, and to read up there for 
any more information.

Regards,
/Bruce