From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <declan.doherty@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id 1B0A3A48B
 for <dev@dpdk.org>; Tue, 23 Jan 2018 15:46:50 +0100 (CET)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 23 Jan 2018 06:46:49 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.46,401,1511856000"; d="scan'208";a="168466377"
Received: from dwdohert-mobl.ger.corp.intel.com (HELO [163.33.228.202])
 ([163.33.228.202])
 by orsmga004.jf.intel.com with ESMTP; 23 Jan 2018 06:46:45 -0800
To: "John Daley (johndale)" <johndale@cisco.com>,
 Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
References: <345C63BAECC1AD42A2EC8C63AFFC3ADCC488E501@IRSMSX102.ger.corp.intel.com>
 <VI1PR05MB31497DCA5CB4E2C429809B19C3000@VI1PR05MB3149.eurprd05.prod.outlook.com>
 <8580655d-a481-4a4a-2c9b-bba725c39485@intel.com>
 <6d26f10919d74934a569c7546bb6836b@XCH-RCD-007.cisco.com>
From: "Doherty, Declan" <declan.doherty@intel.com>
Message-ID: <4010a723-5e8c-39de-8f4a-a59f8fd0118b@intel.com>
Date: Tue, 23 Jan 2018 14:46:44 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.2
MIME-Version: 1.0
In-Reply-To: <6d26f10919d74934a569c7546bb6836b@XCH-RCD-007.cisco.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Jan 2018 14:46:51 -0000

On 11/01/2018 9:45 PM, John Daley (johndale) wrote:
> Hi Declan and Shahaf,
> 
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Doherty, Declan
>> Sent: Tuesday, January 09, 2018 9:31 AM
>> To: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement
>>
>> On 24/12/2017 5:30 PM, Shahaf Shuler wrote:
>>> Hi Declan,
>>>
>>
>> Hey Shahaf, apologies for the delay in responding, I have been out of office
>> for the last 2 weeks.
>>
>>> Friday, December 22, 2017 12:21 AM, Doherty, Declan:
>>>> This RFC contains a proposal to add a new tunnel endpoint API to DPDK
>>>> that when used in conjunction with rte_flow enables the configuration
>>>> of inline data path encapsulation and decapsulation of tunnel
>>>> endpoint network overlays on accelerated IO devices.
>>>>
>>>> The proposed new API would provide for the creation, destruction, and
>>>> monitoring of a tunnel endpoint in supporting hw, as well as
>>>> capabilities APIs to allow the acceleration features to be discovered by
>> applications.
>>>>
>> ....
>>>
>>>
>>> Am not sure I understand why there is a need for the above control
>> methods.
>>> Are you introducing a new "tep device" ? > As the tunnel endpoint is
>>> sending and receiving Ethernet packets from
>> the network I think it should still be counted as Ethernet device but with
>> more capabilities (for example it supported encap/decap etc..), therefore it
>> should use the Ethdev layer API to query statistics (for example).
>>
>> No, the new APIs are only intended to be a method of creating, monitoring
>> and deleting tunnel-endpoints on an existing ethdev. The rationale for APIs
>> separate to rte_flow are the same as that in the rte_security, there is not a
>> 1:1 mapping of TEPs to flows. Many flows (VNI's in VxLAN for example) can
>> be originate/terminate on the same TEP, therefore managing the TEP
>> independently of the flows being transmitted on it is important to allow
>> visibility of that endpoint stats for example.
> 
> I don't quite understand what you mean by tunnel and flow here. Can you define exactly what you mean? Flow is an overloaded word in our world. I think that defining it will make understanding the RFC a little easier.
> 

Hey John,

I think that's a good idea, for me the tunnel endpoint defines the l3/l4 
parameters of the endpoint, so for VxLAN over IPv4 this would include 
the IPv4, UDP and VxLAN headers excluding the VNI(flow id). I'm not sure 
if it makes more sense that each TEP contains the VNI(flow id) or not. I 
believe the model currently used by OvS today is similar to the RFC it 
that many VNIs can be terminated in the same TEP port context.

 From terms of flows definitions, for encapsulated ingress I would see 
the definition of a flow to include the l2 and l3/l4 headers of the 
outer including the flow id of the tunnel and optionally include any or 
all of the inner headers. For non-encapsulated egress traffic the flow 
defines any combination of the l2, l3, l4 headers as defined by the user.

> Taking VxLAN, I think of the tunnel as including up through the VxLAN header, including the VNI. If you go by this definition, I would consider a flow to be all packets with the same VNI and the same 5-tuple hash of the inner packet. Is this what you mean by tunnel (or TEP) and flow here?

Yes, with the exception that I had excluded the or flow id from the TEP 
definition and it was part of the flow but otherwise essentially yes.

> 
> With these definitions, VPP for example might need up to a couple thousand TEPs on an interface and each TEP could have hundreds or thousands of flows. It would be quite possible to have 1 rte flow rule per TEP (or 2- ingress/decap and egress/encap). The COUNT action could be used to count the number of packets through each TEP. Is this adequate, or are you proposing that we need a mechanism to get stats of flows within each TEP? Is that the main point of the API? Assuming no need for stats on a per TEP/flow basis is there anything else the API adds?

Yes the basis of having TEP as separate API is to allow flows to tracked 
independently of the overlay they may be transported on. I believe this 
will be a requirement for acceleration of any vswitch, as we could have 
a case that flows are bypassing the host vswitch completely and 
encap/decap and switched in hw directly to/from the guest to physical 
port. OvS currently can track both flows and TEP statistics and I think 
we need to support this model.
>> I can't see how the existing
>> ethdev API could be used for statistics as a single ethdev could be supporting
>> may concurrent TEPs, therefore we would either need to use the extended
>> stats with many entries, one for each TEP, or if we treat a TEP as an attribute
>> of a port in a similar manner to the way rte_security manages an IPsec SA,
>> the state of each TEP can be monitored and managed independently of both
>> the overall port or the flows being transported on that endpoint.
> 
> Assuming we can define one rte_flow rule per TEP, does what you propose give us anything more than just using the COUNT action?

This still won't all individual flow statistics to be tracked in the 
full offload model. As you state above, you could have a couple of 
thousand TEPs terminated on a single or small number of physical ports 
with tens or hundreds of thousands of flows on each TEP. I think for 
management of the system we need to be able to monitor all of these 
statistics independently.

>>
>>> As for the capabilities - what specifically you had in mind? The current
>> usage you show with tep is with rte_flow rules. There are no capabilities
>> currently for rte_flow supported actions/pattern. To check such capabilities
>> application uses rte_flow_validate.
>>
>> I envisaged that the application should be able to see if an ethdev can
>> support TEP in the rx/tx offloads, and then the rte_tep_capabilities would
>> allow applications to query what tunnel endpoint protocols are supported
>> etc. I would like a simple mechanism to allow users to see if a particular
>> tunnel endpoint type is supported without having to build actual flows to
>> validate.
> 
> I can see the value of that, but in the end wouldn't the API call rte_flow_validate anyways? Maybe we don't add the layer now or maybe it doesn't really belong in DPDK? I'm in favor of deferring the capabilities API until we know it's really needed.  I hate to see special capabilities APIs start sneaking in after we decided to go the rte_flow_validate route and users are starting to get used to it.

flow validation will still always be required but I think having a rich 
capability API will also be very important to allow applications control 
planes to figure out what accelerations are available and define the 
application pipeline accordingly. I can envisage scenarios were on the 
same platform you could two devices which both support TEP in hw but one 
may support switching also, the way the host application would use these 
2 devices may be radically different and rte_flow_validate does allow 
that sort of capabilities to be clearly discovered. This may be as 
simple as a new feature bit in the ethdev.

>>
>>> Regarding the creation/destroy of tep. Why not simply use rte_flow API
>> and avoid this extra control?
>>> For example - with 17.11 APIs, application can put the port in isolate mode,
>> and insert a flow_rule to catch only IPv4 VXLAN traffic and direct to some
>> queue/do RSS. Such operation, per my understanding, will create a tunnel
>> endpoint. What are the down sides of doing it with the current APIs?
>>
>> That doesn't enable encapsulation and decapsulation of the outer tunnel
>> endpoint in the hw as far as I know. Apart from the inability to monitor the
>> endpoint statistics I mentioned above. It would also require that you
>> redefine the endpoints parameters ever time to you wish to add a new flow
>> to it. I think the having the rte_tep object semantics should also simplify the
>> ability to enable a full vswitch offload of TEP where the hw is handling both
>> encap/decap and switching to a particular port.
> 
> If we have the ingress/decap and egress/encap actions and 1 rte_flow rule per TEP and use the COUNT action, I think we get all but the last bit. For that, perhaps the application could keep  ingress and egress rte_flow template for each tunnel type (VxLAN, GRE, ..). Then copying the template and filling in the outer packet info and tunnel Id is all that would be required. We could also define these in rte_flow.h?

Again the main issue here is that one flow per TEP doesn't work when the 
device also supports flow switching in the inner flow.

> 
>>
>>>
>>>>
>>>>
>>>> To direct traffic flows to hw terminated tunnel endpoint the rte_flow
>>>> API is enhanced to add a new flow item type. This contains a pointer
>>>> to the TEP context as well as the overlay flow id to which the traffic flow is
>> associated.
>>>>
>>>> struct rte_flow_item_tep {
>>>>                  struct rte_tep *tep;
>>>>                  uint32_t flow_id;
>>>> }
>>>
>>> Can you provide more detailed definition about the flow id ? to which field
>> from the packet headers it refers to?
>>> On your below examples it looks like it is to match the VXLAN vni in case of
>> VXLAN, what about the other protocols? And also, why not using the already
>> exists VXLAN item?
>>
>> I have only been looking initially at couple of the tunnel endpoint procotols,
>> namely Geneve, NvGRE, and VxLAN, but the idea here is to allow the user to
>> define the VNI in the case of Geneve and VxLAN and the VSID in the case of
>> NvGRE on a per flow basis, as per my understanding these are used to
>> identify the source/destination hosts on the overlay network independently
>> from the endpoint there are transported across.
>>
>> The VxLAN item is used in the creation of the TEP object, using the TEP
>> object just removes the need for the user to constantly redefine all the
>> tunnel parameters and also I think dependent on the hw implementation it
>> may simplify the drivers work if it know the exact endpoint the actions is for
>> instead of having to look it up on each flow addition.
>>
>>>
>>> Generally I like the idea of separating the encap/decap context from the
>> action. However looks like the rte_flow_item has double meaning on this
>> RFC, once for the classification and once for the action.
>>>   From the top of my head I would think of an API which separate those, and
>> re-use the existing flow items. Something like:
>>>
>>>    struct rte_flow_item pattern[] = {
>>>                   { set of already exists pattern  },
>>>                   { ... },
>>>                   { .type = RTE_FLOW_ITEM_TYPE_END } };
>>>
>>> encap_ctx = create_enacap_context(pattern)
>>>
>>> rte_flow_action actions[] = {
>>> 	{ .type RTE_FLOW_ITEM_ENCAP, .conf = encap_ctx} }
>>
>> I not sure I fully understand what you're asking here, but in general for encap
>> you only would define the inner part of the packet in the match pattern
>> criteria and the actual outer tunnel headers would be defined in the action.
>>
>> I guess there is some replication in the decap side as proposed, as the TEP
>> object is used in both the pattern and the action, possibly you could get away
>> with having no TEP object defined in the action data, but I prefer keeping the
>> API symmetrical for encap/decap actions at the shake of some extra
>> verbosity.
>>
>>>
>> ...
>>>
>